There is lots of data on-line available for analysis, but much more is stored in html tables than fixed or delimited files. It can be tempting to manually copy data from a browser and paste it into a spreadsheet. But manual editing means that any fixes, updates or additions will require manual editing. It's much better to write a script to acquire data from a table. Of course every time you need to grab data from a website, you could do it quicker with manual editing than to find and learn to use a screen scraping library. That's why you should take a minute right now to learn BeautifulSoup, so you are ready when the time comes.
Let's take a look at the current ESPN MLB Standings report. It's familiar data, so it should be easy to check that we're reading it correctly. And it's a clean page with only one table, so we don't have to have a complex example that checks that it's reading the right table. First let's just read the webpage into a list.
>>> import urllib2
>>> URL = "http://espn.go.com/mlb/standings"
>>> connection = urllib2.urlopen(URL) # open url
>>> siteTextL = connection.readlines() # read all data
>>> connection.close() # close
>>> print len(siteTextL) # view length of list
654
>>> print siteTextL[:3] # view beginning of list
['<!DOCTYPE html>\n', '<html xmlns:fb="http://www.facebook.com/2008/fbml">\n', '<meta charset="iso-8859-1">\n']
Above we read the page into a list that has one element per line of html text, and browse what's in the list. (By the way, remembering the post on prettifying, the line to prettify the temp.txt with the console history was pygmentize -O full,style=colorful -f html -l pycon temp.txt.) Next we just make a copy of that text that's in a blob of text, rather than a list.
>>> siteText = "\n".join(siteTextL) # convert list to text
>>> print "len(siteText)", len(siteText)
len(siteText) 67217
>>> siteText[:50]
'<!DOCTYPE html>\n\n<html xmlns:fb="http://www.facebo'
Now the text of the site is gathered, we can load it into a BeautifulSoup object. The BeautifulSoup findAll method then provides a list of all the tables in the page. We check that there is only one table in the page.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(siteText) # parse page into soup
>>> tableL = soup.findAll('table') # list all tables
>>> print "len(tableL)", len(tableL) # count tables.
len(tableL) 1
The only table in the list, tableL[0], is also a BeautifulSoup object, and using it's findAll method we can get a list of rows. The length of the list is 38, because we have a row for each of the 30 teams, 6 divisions, and 2 leagues. And each row is also a BeautifulSoup object with a findAll method that returns a list of table data.
>>> type(tableL[0])
<class 'bs4.element.Tag'>
>>> type(tableL[0].findAll('tr'))
<type 'list'>
>>> len(tableL[0].findAll('tr'))
38
>>> type(tableL[0].findAll('tr')[2])
<class 'bs4.element.Tag'>
>>> tableL[0].findAll('tr')[2].findAll('td')[:3]
[<td align="left"><a href="http://espn.go.com/mlb/team/_/name/bos/boston-red-sox">Boston</a></td>, <td>14</td>, <td>7</td>]
The findAll method looking for 'table', 'tr', and 'td', has taken us all the way to the data. The elements of the list of table data are also BeautifulSoup objects, but instead of iterating over them, we just need to extract what we are looking for. Working just with rows that have team data, to get the wins and losses, we just need to use the getText method on the [1] and [2] data from the list. We can get the link for the team using a findAll('a') on the [0] element of the team data. The team name is the getText of the link. And if we wanted to crawl further through ESPN, we could get the URL for the team data with a get('href') of the link.
>>> wins = tableL[0].findAll('tr')[2].findAll('td')[1].getText()
>>> losses = tableL[0].findAll('tr')[2].findAll('td')[2].getText()
>>> map(str, (wins, losses))
['14', '7']
>>> link = tableL[0].findAll('tr')[2].findAll('td')[0].findAll('a')[0]
>>> name = link.getText()
>>> str(name)
'Boston'
>>> url = link.get('href')
>>> str(url)
'http://espn.go.com/mlb/team/_/name/bos/boston-red-sox'
Putting it all together, we can easily load the team names, wins, and losses into lists.
>>> nameL, winL, lossL = [[], [], []] # initialize lists
>>> for row in tableL[0].findAll('tr'):
dataL = row.findAll('td')
try:
link = dataL[0].findAll('a')[0]
except IndexError:
continue # rows we want start with links
nameL.append(str(link.getText()))
winL.append(str(dataL[1].getText()))
lossL.append(str(dataL[2].getText()))
>>> nameL[:3]
['Boston', 'Baltimore', 'NY Yankees']
>>> winL[:3]
['14', '12', '11']
>>> lossL[:3]
['7', '9', '9']