zero budget science: 2013

Thursday, May 23, 2013

Reading and Writing Excel Workbooks with Python

Lots of data is stored in Excel workbooks, so it's convenient to be able to read that. Workbooks can also be a great way of delivering results, since they make it easy for your clients to perform ad hoc checks. The PyWin32 (formerly win32all) module provides the ability to create and access many different types of Windows files on a Windows machine. But if you want to work with Excel files without Windows, you can use the Python Excel modules xlrd, xlwt, and xlutil for reading, writing, and filtering Excel files. Below we'll show simple examples of reading and writing Excel workbooks with xlrt and xlwt.

The packages have an excellent tutorial pdf, and the examples below are just a subset of what's shown there. Maybe the most interesting part of the tutorial is the last page, which shows the power of the packages by suggesting exercises an instructor might use for a workshop on the packages, including:

inserting a row into a worksheet
splitting a workbook into one file per worksheet
scanning a directory and reporting the location of error cells

Let's start by reading a workbook using the open_workbook from xlrd. For this workbook, we know there are 3 worksheets.

>>> # get to directory with workbook
>>> import os
>>> os.chdir("~/blog") 
>>> os.chddir("xlutils")
>>> # open workbook
>>> from xlrd import open_workbook 
>>> book = open_workbook("workbook.xls")
>>> print book.nsheets # check number of worksheets
3
>>> print book.sheet_names() # check sheet names
[u'letters', u'numbers', u'Sheet3']

And we can obtain the first worksheet using the sheet_by_index method on the workbook.

>>> sheet = book.sheet_by_index(0) # get first worksheet
>>> sheet.name # check name
u'letters'
>>> sheet.nrows # get number of rows
3
>>> sheet.ncols # and cols
1

Finally the values in the worksheet can be obtained with the workbook's cell method. Using that method in a list comprehension with the numbers of rows and columns already obtained we load the worksheet into a list.

>>> val = sheet.cell(0, 0)
>>> val.value
u'a'
>>> valL = [sheet.cell(i, 0).value for i in range(sheet.nrows)]
>>> valL
[u'a', u'b', u'c']

Writing a workbook is even simpler. First we open a workbook with the xlwt Workbook, and obtain a worksheet with the workbook's add_sheet method. On the sheet, we can write values with the write method. The arguments for the write method are (row, column, value). The row and column use 0 offset counting, so row 0 and column 0 are A1. Finally to save the workbook, we use the workbook's save method.

>>> wb = Workbook() # obtain a new workbook
>>> ws = wb.add_sheet('s0') # obtain a new worksheet named s0
>>> ws.write(0, 0, '1') # write a '1' in the top left cell
>>> ws.write(0, 1, 1) # write a 1 in B1
>>> wb.save('workbook2.xls') # save workbook

Friday, April 26, 2013

Capturing On-Line Data in Python With BeautifulSoup

No matter what package or language you use for most of your analysis, it might be a good idea to learn a little Python, if only for the html parsing tools. Here we'll look at the screen scraping library BeautifulSoup.

There is lots of data on-line available for analysis, but much more is stored in html tables than fixed or delimited files. It can be tempting to manually copy data from a browser and paste it into a spreadsheet. But manual editing means that any fixes, updates or additions will require manual editing. It's much better to write a script to acquire data from a table. Of course every time you need to grab data from a website, you could do it quicker with manual editing than to find and learn to use a screen scraping library. That's why you should take a minute right now to learn BeautifulSoup, so you are ready when the time comes.

Let's take a look at the current ESPN MLB Standings report. It's familiar data, so it should be easy to check that we're reading it correctly. And it's a clean page with only one table, so we don't have to have a complex example that checks that it's reading the right table. First let's just read the webpage into a list.

>>> import urllib2
>>> URL = "http://espn.go.com/mlb/standings"
>>> connection = urllib2.urlopen(URL) # open url
>>> siteTextL = connection.readlines() # read all data
>>> connection.close()  # close
>>> print len(siteTextL) # view length of list
654
>>> print siteTextL[:3] # view beginning of list
['<!DOCTYPE html>\n', '<html xmlns:fb="http://www.facebook.com/2008/fbml">\n', '<meta charset="iso-8859-1">\n']

Above we read the page into a list that has one element per line of html text, and browse what's in the list. (By the way, remembering the post on prettifying, the line to prettify the temp.txt with the console history was pygmentize -O full,style=colorful -f html -l pycon temp.txt.) Next we just make a copy of that text that's in a blob of text, rather than a list.

>>> siteText = "\n".join(siteTextL) # convert list to text
>>> print "len(siteText)", len(siteText)
len(siteText) 67217
>>> siteText[:50]
'<!DOCTYPE html>\n\n<html xmlns:fb="http://www.facebo'

Now the text of the site is gathered, we can load it into a BeautifulSoup object. The BeautifulSoup findAll method then provides a list of all the tables in the page. We check that there is only one table in the page.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(siteText) # parse page into soup
>>> tableL = soup.findAll('table') # list all tables
>>> print "len(tableL)", len(tableL) # count tables.
len(tableL) 1

The only table in the list, tableL[0], is also a BeautifulSoup object, and using it's findAll method we can get a list of rows. The length of the list is 38, because we have a row for each of the 30 teams, 6 divisions, and 2 leagues. And each row is also a BeautifulSoup object with a findAll method that returns a list of table data.

>>> type(tableL[0])
<class 'bs4.element.Tag'>
>>> type(tableL[0].findAll('tr'))
<type 'list'>
>>> len(tableL[0].findAll('tr'))
38
>>> type(tableL[0].findAll('tr')[2])
<class 'bs4.element.Tag'>
>>> tableL[0].findAll('tr')[2].findAll('td')[:3]
[<td align="left"><a href="http://espn.go.com/mlb/team/_/name/bos/boston-red-sox">Boston</a></td>, <td>14</td>, <td>7</td>]

The findAll method looking for 'table', 'tr', and 'td', has taken us all the way to the data. The elements of the list of table data are also BeautifulSoup objects, but instead of iterating over them, we just need to extract what we are looking for. Working just with rows that have team data, to get the wins and losses, we just need to use the getText method on the [1] and [2] data from the list. We can get the link for the team using a findAll('a') on the [0] element of the team data. The team name is the getText of the link. And if we wanted to crawl further through ESPN, we could get the URL for the team data with a get('href') of the link.

>>> wins = tableL[0].findAll('tr')[2].findAll('td')[1].getText()
>>> losses = tableL[0].findAll('tr')[2].findAll('td')[2].getText()
>>> map(str, (wins, losses))
['14', '7']
>>> link = tableL[0].findAll('tr')[2].findAll('td')[0].findAll('a')[0]
>>> name = link.getText()
>>> str(name)
'Boston'
>>> url = link.get('href')
>>> str(url)
'http://espn.go.com/mlb/team/_/name/bos/boston-red-sox'

Putting it all together, we can easily load the team names, wins, and losses into lists.

>>> nameL, winL, lossL = [[], [], []] # initialize lists
>>> for row in tableL[0].findAll('tr'):
 dataL = row.findAll('td')
 try:
  link = dataL[0].findAll('a')[0]
 except IndexError:
  continue # rows we want start with links
 nameL.append(str(link.getText()))
 winL.append(str(dataL[1].getText()))
 lossL.append(str(dataL[2].getText()))

 
>>> nameL[:3]
['Boston', 'Baltimore', 'NY Yankees']
>>> winL[:3]
['14', '12', '11']
>>> lossL[:3]
['7', '9', '9']

Monday, April 1, 2013

Plots in R and Python

It makes sense for any analyst to learn R if only for making graphs. Making graphs in R is very natural. Below we'll look at making the same simple plot, against the same data, using R and also in Python using Matplotlib. The data plotted is just the y = x + noise points generated in the script shown in the post on prettifying code and stored as simpleData.csv.txt. The idea here is just to show how easy it is to make plots.

First, the R script.

# import a csv file to a data frame and plot
# setwd(might need to setwd to where data is) 
read.csv("sampleData.csv.txt") -> sdf # load dataframe
summary(sdf) # check that file has been read
attach(sdf) # use names in the dataframe below
png("simpleRPlot.png", type='quartz') # prepare png
plot(x, y, type="p", main="sampleData y by x R plot")
dev.off() # cleanup

And the Python script.

"""
Import a csv file of x and y, and plot y by x.
"""
import numpy
import matplotlib.pyplot as plt

# load the sample data from a csv file into parallel
# arrays x and y
infn = "sampleData.csv.txt" 
x, y = numpy.loadtxt(infn, delimiter=",", \
                     skiprows=1, unpack=True)

fig = plt.figure() # open up a figure
fig.set_size_inches(5, 5) # set the size of the figure 
ax = fig.add_subplot(111) # put one plot on the figure
ax.plot(x, y, 'ko') # y by x, black (k) circles (o) 
ax.set_xlabel('x') # label the x axis 'x'
ax.set_ylabel('y') # label y axis 'y'
ax.set_title("sampleData y by x Python plot") 
# The following font size commmands are to make the 
# chart look like a chart made made with the default
# settings in R. 
ax.title.set_fontsize(11) 
for item in (ax.xaxis.label, ax.yaxis.label):
    item.set_fontsize(9)
for L in (ax.get_xticklabels(), ax.get_yticklabels()):
    for item in L:
        item.set_fontsize(9)

fig.savefig("simplePyPlot.png") # save the figure
plt.close() # cleanup important for interactive

The resulting plots shown together.

While the charts themselves are very similar, the R script is shorter than the Python script. Some of that is because the Python has some font changes to make the chart look like the R chart. But some of it is because R makes good assumptions about what a chart should include that have to be set in Matplotlib, for example how to label the X and Y axes.

Tuesday, March 19, 2013

Using Pygments to prettify code for online

It's much easier to write correct code that's correct the first time using a modern editor (emacs, vim) or an IDE that provides syntax highlighting. Syntax highlighting also makes code more readable for discussions. Because whitespace is meaningful in Python, it's especially important for Python code to look right in a blog post, and simply posting code text into html paragraphs will not look right when rendered. Many blog engines and wikis have tools to add syntax highlighting to articles, but they are all different. An alternative to learning several tools is to learn one tool that creates syntax highlighted html representations of the code, which can be placed anyplace html can be placed. Below we discuss the highlighter Pygments.

Pygments is code highlighter that generates output for a large variety of languages (including Bash, HTML, Java, Matlab, Python, and S, but unfortunately not SAS) in a variety of formats (including HTML, RTF, and LaTeX). It can be used in three ways:

A Python script can import the pygments library and use its functions to format text. This is the most practical way to use pygments if generating html from a script.
A script pygmentize makes the functionality of pygments available from the command line. This is the most natural way to add highlighting to one or a few files at a time, and it's what we'll show below.
The Pygments homepage has a form at the bottom that allows users to enter code, select a language, and see what Pygments highlighting looks like. Users can elect for their examples to be stored in a database of examples, and also browse examples from other users.

To see what pygmentize does, consider a simple Python script, genSample.py, that writes a csv file read by a simple R script, importData.r. Using the original Blogger editor and converting the end of line marks to <p> would result in a total mess, because Python code has so much whitespace. The new Compose mode of editing allows an author to simply paste the Python code in, and it generates the &nbsp needed to make the indentation look correct. But we can get a highlighted html file in the shell by entering as follows:

bash$ pygmentize -O full,style=colorful,linenos=1 -f html genSample.py > genSample.py.html

The resulting file is a complete html document, from which I've inserted below the text from the first <head> to the </body>. The arguments and options for pygmentize are explained when it's run with the --help.

"""
Generate a csv sample file of x and y, where y = b + m*x + uniform noise
The name of the file, number of lines of output, intercept, slope, and
magnitude of noise are set in constants. 
"""
import random

# CONSTANTS
OUTF = open("sampleData.csv.txt", "w")
LINES = 20
OFFSET = 1 
SLOPE = 1
NOISE = .3

def gety(x, b=OFFSET, m=SLOPE, e=NOISE):
    """
    Return y for a given x
    """
    eps = random.uniform(-e, e) 
    y = m*x + b + eps
    return y

print >> OUTF, "x,y" # labels record
for i in range(LINES):
    x = i/10. # x takes on 0, 0.1, ... 1.9
    y = gety(x)
    print >> OUTF, "%f,%f" %(x, y)

OUTF.close()

Similarly, running

bash$ pygmentize -O full,style=colorful,linenos=1 -l r importData.r

provides the highlighted text below:

# import a csv file to a data frame and summary it
# the user might need to setwd to the directory where the data is
# setwd("/pathToDirectoryWithData") 
read.csv("sampleData.csv.txt") -> sdf
summary(sdf)

The line numbers and highlighting make Python and R snippets easier to understand and discuss.

Free Software for Data Analysis

Data analysis was one of the first fields to embrace computing. In the 1960s, commercial statistical packages were first developed that gave analysis access to a wide set of robust statistical procedures. Two of the most popular packages, SAS and SPSS, are widely used today. The popularity of these packages facilitates collaboration, since you can find other users to discuss or share work.

But you can't collaborate just anyone, because they are commercial packages. Free software statistical packages allow for even greater collaboration, because you can give a script to anyone and they can run it without having to obtain a license. The packages we will focus on are R, a free implementation of the S language introduced by the commercial package S-PLUS, and the Python libraries (including SciPy, Numpy, and matplotlib). Both R and SciPy are available for Windows, Linux, or OS X.

In addition to the advantages of free software, there are some ease of use advantages. S-PLUS and R have long been popular with SAS users because it is so easy to make high quality plots in S-PLUS or R. The scripting language for SAS is older than C, and it shows, while S and Python are much more modern languages that are simpler to learn or develop with. The commercial packages are themselves modernizing. Python has been embedded as a scripting language for SPSS since 2005, and SAS has started introducing elements of Java into SAS with Version 9.2 in 2008.

zero budget science