Friday, April 26, 2013

Capturing On-Line Data in Python With BeautifulSoup

No matter what package or language you use for most of your analysis, it might be a good idea to learn a little Python, if only for the html parsing tools. Here we'll look at the screen scraping library BeautifulSoup.

There is lots of data on-line available for analysis, but much more is stored in html tables than fixed or delimited files. It can be tempting to manually copy data from a browser and paste it into a spreadsheet. But manual editing means that any fixes, updates or additions will require manual editing. It's much better to write a script to acquire data from a table. Of course every time you need to grab data from a website, you could do it quicker with manual editing than to find and learn to use a screen scraping library. That's why you should take a minute right now to learn BeautifulSoup, so you are ready when the time comes.

Let's take a look at the current ESPN MLB Standings report. It's familiar data, so it should be easy to check that we're reading it correctly. And it's a clean page with only one table, so we don't have to have a complex example that checks that it's reading the right table. First let's just read the webpage into a list.

>>> import urllib2
>>> URL = "http://espn.go.com/mlb/standings"
>>> connection = urllib2.urlopen(URL) # open url
>>> siteTextL = connection.readlines() # read all data
>>> connection.close()  # close
>>> print len(siteTextL) # view length of list
654
>>> print siteTextL[:3] # view beginning of list
['<!DOCTYPE html>\n', '<html xmlns:fb="http://www.facebook.com/2008/fbml">\n', '<meta charset="iso-8859-1">\n']


Above we read the page into a list that has one element per line of html text, and  browse what's in the list. (By the way, remembering the post on prettifying, the line to prettify the temp.txt with the console history was pygmentize -O full,style=colorful -f html -l pycon temp.txt.) Next we just make a copy of that text that's in a blob of text, rather than a list.

>>> siteText = "\n".join(siteTextL) # convert list to text
>>> print "len(siteText)", len(siteText)
len(siteText) 67217
>>> siteText[:50]
'<!DOCTYPE html>\n\n<html xmlns:fb="http://www.facebo'

Now the text of the site is gathered, we can load it into a BeautifulSoup object. The BeautifulSoup findAll method then provides a list of all the tables in the page. We check that there is only one table in the page.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(siteText) # parse page into soup
>>> tableL = soup.findAll('table') # list all tables
>>> print "len(tableL)", len(tableL) # count tables.
len(tableL) 1

The only table in the list, tableL[0], is also a BeautifulSoup object, and using it's findAll method we can get a list of rows. The length of the list is 38, because we have a row for each of the 30 teams, 6 divisions, and 2 leagues. And each row is also a BeautifulSoup object with a findAll method that returns a list of table data.

>>> type(tableL[0])
<class 'bs4.element.Tag'>
>>> type(tableL[0].findAll('tr'))
<type 'list'>
>>> len(tableL[0].findAll('tr'))
38
>>> type(tableL[0].findAll('tr')[2])
<class 'bs4.element.Tag'>
>>> tableL[0].findAll('tr')[2].findAll('td')[:3]
[<td align="left"><a href="http://espn.go.com/mlb/team/_/name/bos/boston-red-sox">Boston</a></td>, <td>14</td>, <td>7</td>]

The findAll method looking for 'table', 'tr', and 'td', has taken us all the way to the data. The elements of the list of table data are also BeautifulSoup objects, but instead of iterating over them, we just need to extract what we are looking for. Working just with rows that have team data, to get the wins and losses,  we just need to use the getText method on the [1] and [2] data from the list. We can get the link for the team using a findAll('a') on the [0] element of the team data. The team name is the getText of the link. And if we wanted to crawl further through ESPN, we could get the URL for the team data with a get('href') of the link.

>>> wins = tableL[0].findAll('tr')[2].findAll('td')[1].getText()
>>> losses = tableL[0].findAll('tr')[2].findAll('td')[2].getText()
>>> map(str, (wins, losses))
['14', '7']
>>> link = tableL[0].findAll('tr')[2].findAll('td')[0].findAll('a')[0]
>>> name = link.getText()
>>> str(name)
'Boston'
>>> url = link.get('href')
>>> str(url)
'http://espn.go.com/mlb/team/_/name/bos/boston-red-sox'

Putting it all together, we can easily load the team names, wins, and losses into lists.

>>> nameL, winL, lossL = [[], [], []] # initialize lists
>>> for row in tableL[0].findAll('tr'):
 dataL = row.findAll('td')
 try:
  link = dataL[0].findAll('a')[0]
 except IndexError:
  continue # rows we want start with links
 nameL.append(str(link.getText()))
 winL.append(str(dataL[1].getText()))
 lossL.append(str(dataL[2].getText()))

 
>>> nameL[:3]
['Boston', 'Baltimore', 'NY Yankees']
>>> winL[:3]
['14', '12', '11']
>>> lossL[:3]
['7', '9', '9']

Monday, April 1, 2013

Plots in R and Python

It makes sense for any analyst to learn R if only for making graphs. Making graphs in R is very natural. Below we'll look at making the same simple plot, against the same data, using R and also in Python using Matplotlib. The data plotted is just the y = x + noise points generated in the script shown in the post on prettifying code and stored as simpleData.csv.txt.  The idea here is just to show how easy it is to make plots. 

First, the R script. 

1
2
3
4
5
6
7
8
# import a csv file to a data frame and plot
# setwd(might need to setwd to where data is) 
read.csv("sampleData.csv.txt") -> sdf # load dataframe
summary(sdf) # check that file has been read
attach(sdf) # use names in the dataframe below
png("simpleRPlot.png", type='quartz') # prepare png
plot(x, y, type="p", main="sampleData y by x R plot")
dev.off() # cleanup

And the Python script. 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
"""
Import a csv file of x and y, and plot y by x.
"""
import numpy
import matplotlib.pyplot as plt

# load the sample data from a csv file into parallel
# arrays x and y
infn = "sampleData.csv.txt" 
x, y = numpy.loadtxt(infn, delimiter=",", \
                     skiprows=1, unpack=True)

fig = plt.figure() # open up a figure
fig.set_size_inches(5, 5) # set the size of the figure 
ax = fig.add_subplot(111) # put one plot on the figure
ax.plot(x, y, 'ko') # y by x, black (k) circles (o) 
ax.set_xlabel('x') # label the x axis 'x'
ax.set_ylabel('y') # label y axis 'y'
ax.set_title("sampleData y by x Python plot") 
# The following font size commmands are to make the 
# chart look like a chart made made with the default
# settings in R. 
ax.title.set_fontsize(11) 
for item in (ax.xaxis.label, ax.yaxis.label):
    item.set_fontsize(9)
for L in (ax.get_xticklabels(), ax.get_yticklabels()):
    for item in L:
        item.set_fontsize(9)

fig.savefig("simplePyPlot.png") # save the figure
plt.close() # cleanup important for interactive

The resulting plots shown together.


While the charts themselves are very similar, the R script is shorter than the Python script. Some of that is because the Python has some font changes to make the chart look like the R chart. But some of it is because R makes good assumptions about what a chart should include that have to be set in Matplotlib, for example how to label the X and Y axes.