There are some pretty useful sites out there, but some interfaces are just plain annoying.
Take pof.com for example: they have millions of users, but haven’t touched their interface since the beginning; if you get lots of messages it becomes a pain to go through them all very quickly.
I figured it might be easier to use my hacking skills to create my own interface.
Step 1: Login
First take a look at the source code of the form on the login page:
<form action="https://www.pof.com/processLogin.aspx" method="post" id="frmLogin" name="frmLogin" class="form right">
<div id="login-box">
<input name="url" id="url" class="title" type="hidden">
<input name="username" id="username" class="title input" type="text" value="l33tman">
<label class="headline txtBlue size12 label username" for="username">Username</label>
<input name="password" id="password" class="title input" type="password">
<label class="headline txtBlue size12 label password" for="password">Password</label>
<script type="text/javascript">
var nowt = new Date(),
tempt_F = nowt.getTimezoneOffset();
document.write('<input type=\'hidden\' value=\'' + tempt_F + '\' name=\'tfset\'/>');
</script><input type="hidden" value="300" name="tfset">
<input name="login" id="login" class="button norm-blue submit" type="submit" value="Check Mail!">
<input name="callback" id="callback" type="hidden" value="http%3a%2f%2fwww.pof.com%2fstart.aspx">
<input name="sid" id="sid" type="hidden" value="wcqugtcmwbpb2rvn345x4mxk">
</div>
<script type="text/javascript">
if (document.getElementsByTagName("html").lang == undefined || document.getElementsByTagName("html").lang == null) {
var html = document.getElementsByTagName("html")[0];
html["lang"] = "en";
}
</script>
</form>
We will use python-requests to make all our requests with a simulated user session. See http://docs.python-requests.org/en/latest/ for more details.
We will start by passing in all those input values to requests.post:
import requests
session = requests.session()
payload = dict(username=username,
password=password,
tfset="300",
callback="http%3a%2f%2fwww.pof.com%2finbox.aspx",
sid="wcqugtcmwbpb2rvn345x4mxk")
response = session.post("http://pof.com/processLogin.aspx", data=payload)
By using session.post instead of the plain request.post, we retain all the cookie information necessary to simulate an actual logged in user.
Step 2: Collect the message links
BeautifulSoup makes parsing html extremely simple. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for the docs.
Say we have an html string and we’d like to find all the html elements with the “message” class. Here’s how we would do that with BeautifulSoup:
soup = BeautifulSoup(html)
for message in soup.find_all('a', 'message'):
# process your message
In step 1 we logged into pof.com and got a response object. We can pass the html contents of this object to BeatifulSoup to begin parsing.
For our case we need the next_page link and the links to the messages (the html code of POF is terrible, so some hackery was necessary to get the elements properly):
soup = BeautifulSoup(response.text)
next_page = soup.find('a', text='Next Page')attrs['href']
message_links = []
for message_link in soup.find_all(attrs={'href': re.compile('viewallmessages.*')}):
message_links.append(message_link.attrs['href'])
Step 3: Collect the content
We now need to go to each link and fetch the message content and user data.
Continuing to use the session object for all requests, we get:
def parse_all_messages(links):
messages = []
for link in links:
comment_page = session.get(link)
soup = BeautifulSoup(comment_page.text)
for message in soup.find_all(attrs={'style': re.compile('width:500px.*')}):
user = soup.find('span', 'username-inbox')
user_image_url = soup.find('td', attrs={'width':"60px"}).img.attrs['src']
messages.append(dict(user_username=clean_string(user.text),
user_url=pof_url(user.a.attrs['href']),
user_image_url=user_image_url,
date=user.parent.find('div').text,
message=clean_string(message.text)))
return sorted(messages, key=lambda m: to_date(m['date']), reverse=True)
Step 4: Pretty Print the data
I have opted to use Jinja2 to render the html, but this is not at all necessary. Jinja2 is a simple templating library that is used in many python web frameworks. See http://jinja.pocoo.org/docs/ for a more in depth tutorial.
It’s fairly simple to use:
>>> from jinja2 import Template
>>> template = Template('Hello {{ name }}!')
>>> template.render(name='John Doe')
u'Hello John Doe!'
Be careful to properly encode your strings when using Jinja2. POF has some malformed characters which required cleaning the strings with “”.encode(‘ascii’, ‘ignore’)
Step 5: Run it!
Below is the script in its entirety.
####################################################################
# my_pof_messages.py
#
# A simple script to scrape your pof messages and
# print them to single html file. Also outputs to json.
#
# Usage:
# sudo pip install beautifulsoup4 requests jinja2
# python my_pof_messages.py <username> <password> <output_prefix>
# firefox output_prefix.html
#
# Author:
# Ramin Rahkhamimov
# ramin32@gmail.com
# http://raminrakhamimov.com
#####################################################################
import requests
from bs4 import BeautifulSoup
import re
from jinja2 import Template
import json
import sys
from datetime import datetime
pof_url = lambda x: "https://www.pof.com/%s" % x
session = requests.session()
def append_message_links(e, links):
soup = BeautifulSoup(e.text)
for a in soup.find_all(attrs={'href': re.compile('viewallmessages.*')}):
links.append(pof_url(a.attrs['href']))
next_page = soup.find('a', text='Next Page')
return next_page and pof_url(next_page.attrs['href'])
def get_all_message_links(username, password):
links = []
payload = dict(username=username,
password=password,
tfset="300",
callback="http%3a%2f%2fwww.pof.com%2finbox.aspx",
sid="ikdnixh1pblvis1dlqaa0mb3")
e = session.post(pof_url("processLogin.aspx"), data=payload)
next_page = append_message_links(e, links)
while next_page:
e = session.get(next_page)
next_page = append_message_links(e, links)
return set(links)
def clean_string(string):
return string.encode('ascii', 'ignore')
def to_date(date_string):
return datetime.strptime(date_string, '%m/%d/%Y %I:%M:%S %p')
def parse_all_messages(links):
messages = []
for link in links:
comment_page = session.get(link)
soup = BeautifulSoup(comment_page.text)
for message in soup.find_all(attrs={'style': re.compile('width:500px.*')}):
user = soup.find('span', 'username-inbox')
user_image_url = soup.find('td', attrs={'width':"60px"}).img.attrs['src']
messages.append(dict(user_username=clean_string(user.text),
user_url=pof_url(user.a.attrs['href']),
user_image_url=user_image_url,
date=user.parent.find('div').text,
message=clean_string(message.text)))
return sorted(messages, key=lambda m: to_date(m['date']), reverse=True)
def save_messages(messages, prefix):
template = Template("""
<html>
<head>
<style>
.user, .message, .date {
display: inline-block;
vertical-align: top;
}
.message {
width: 500px;
padding-left: 10px;
}
</style>
</head>
<body>
<ol>
{% for message in messages %}
<li>
<a href="{{message.user_url}}" class="user">
<img src="{{message.user_image_url}}"/>
<div>
{{message.user_username}}
</div>
</a>
<div class="message">
{{message.message}}
</div>
<div class="date">
{{message.date}}
</div>
</li>
{% endfor %}
</ol>
</body>
</html>
""")
with open('%s.html' % prefix, 'w') as f:
f.write(template.render(messages=messages))
with open('%s.json' % prefix, 'w') as f:
f.write(json.dumps(messages))
if __name__ == '__main__':
if len(sys.argv) != 4:
print "Usage: my_pof_messages.py <username> <password> <output_prefix>"
links = get_all_message_links(sys.argv[1], sys.argv[2])
messages = parse_all_messages(links)
save_messages(messages, sys.argv[3])
Install requests, beautifulsoup4 and jinja2 and run with python. Depending on your inbox size, this may take a couple of minutes. Once the script is done running, open the newly create html file with your favorite browser:
sudo pip install requests beautifulsoup4 jinja2 python my_pof_messages.py your_username your_password output firefox output.html
This script can be easily tweaked to be used with your favorite service provider.


