zero budget science: March 2013

It's much easier to write correct code that's correct the first time using a modern editor (emacs, vim) or an IDE that provides syntax highlighting. Syntax highlighting also makes code more readable for discussions. Because whitespace is meaningful in Python, it's especially important for Python code to look right in a blog post, and simply posting code text into html paragraphs will not look right when rendered. Many blog engines and wikis have tools to add syntax highlighting to articles, but they are all different. An alternative to learning several tools is to learn one tool that creates syntax highlighted html representations of the code, which can be placed anyplace html can be placed. Below we discuss the highlighter Pygments.

Pygments is code highlighter that generates output for a large variety of languages (including Bash, HTML, Java, Matlab, Python, and S, but unfortunately not SAS) in a variety of formats (including HTML, RTF, and LaTeX). It can be used in three ways:

A Python script can import the pygments library and use its functions to format text. This is the most practical way to use pygments if generating html from a script.
A script pygmentize makes the functionality of pygments available from the command line. This is the most natural way to add highlighting to one or a few files at a time, and it's what we'll show below.
The Pygments homepage has a form at the bottom that allows users to enter code, select a language, and see what Pygments highlighting looks like. Users can elect for their examples to be stored in a database of examples, and also browse examples from other users.

To see what pygmentize does, consider a simple Python script, genSample.py, that writes a csv file read by a simple R script, importData.r. Using the original Blogger editor and converting the end of line marks to <p> would result in a total mess, because Python code has so much whitespace. The new Compose mode of editing allows an author to simply paste the Python code in, and it generates the &nbsp needed to make the indentation look correct. But we can get a highlighted html file in the shell by entering as follows:

bash$ pygmentize -O full,style=colorful,linenos=1 -f html genSample.py > genSample.py.html

The resulting file is a complete html document, from which I've inserted below the text from the first <head> to the </body>. The arguments and options for pygmentize are explained when it's run with the --help.

"""
Generate a csv sample file of x and y, where y = b + m*x + uniform noise
The name of the file, number of lines of output, intercept, slope, and
magnitude of noise are set in constants. 
"""
import random

# CONSTANTS
OUTF = open("sampleData.csv.txt", "w")
LINES = 20
OFFSET = 1 
SLOPE = 1
NOISE = .3

def gety(x, b=OFFSET, m=SLOPE, e=NOISE):
    """
    Return y for a given x
    """
    eps = random.uniform(-e, e) 
    y = m*x + b + eps
    return y

print >> OUTF, "x,y" # labels record
for i in range(LINES):
    x = i/10. # x takes on 0, 0.1, ... 1.9
    y = gety(x)
    print >> OUTF, "%f,%f" %(x, y)

OUTF.close()

Similarly, running

bash$ pygmentize -O full,style=colorful,linenos=1 -l r importData.r

provides the highlighted text below:

# import a csv file to a data frame and summary it
# the user might need to setwd to the directory where the data is
# setwd("/pathToDirectoryWithData") 
read.csv("sampleData.csv.txt") -> sdf
summary(sdf)

The line numbers and highlighting make Python and R snippets easier to understand and discuss.

Data analysis was one of the first fields to embrace computing. In the 1960s, commercial statistical packages were first developed that gave analysis access to a wide set of robust statistical procedures. Two of the most popular packages, SAS and SPSS, are widely used today. The popularity of these packages facilitates collaboration, since you can find other users to discuss or share work.

But you can't collaborate just anyone, because they are commercial packages. Free software statistical packages allow for even greater collaboration, because you can give a script to anyone and they can run it without having to obtain a license. The packages we will focus on are R, a free implementation of the S language introduced by the commercial package S-PLUS, and the Python libraries (including SciPy, Numpy, and matplotlib). Both R and SciPy are available for Windows, Linux, or OS X.

In addition to the advantages of free software, there are some ease of use advantages. S-PLUS and R have long been popular with SAS users because it is so easy to make high quality plots in S-PLUS or R. The scripting language for SAS is older than C, and it shows, while S and Python are much more modern languages that are simpler to learn or develop with. The commercial packages are themselves modernizing. Python has been embedded as a scripting language for SPSS since 2005, and SAS has started introducing elements of Java into SAS with Version 9.2 in 2008.

zero budget science

Tuesday, March 19, 2013

Using Pygments to prettify code for online

Free Software for Data Analysis

Followers

Blog Archive

About Me