Sunday, April 22, 2018

SQL on pandas Tables with sqlite3

It can be extremely convenient to use SQL with pandas dataframes. Maybe you are working with pandas dataframes and you want to implement logic that you are borrowing from another environment. Or you could be trying to demonstrate something about SQL, and want a database populated with data that can work on your audience desktops. The sqlite3 module makes this simple. The writeup on python.org is very good but more examples are always better. Here we'll populate some dataframes with wikiHelp, load them into an sqlite3 database, and join them using sql. Of course, we can also join dataframes in pandas, so we will compare the results of of the joins.

Loading dataframes

Here we populate dataframes with extracts from Wikipedia List of 100 wealthiest countries and List of countries by continent.

from wikiHelp import WIKI, getWtable

# CONST
WEALTH = "List_of_countries_by_wealth_per_adult"
CONTINENTS = """List of sovereign states
and dependent territories by continent""" # make narrow for show
CONTINENT = "_".join(CONTINENTS.split()) # string for wiki

# get 1th table from wealth per adult
wealthDF = getWtable("%s%s" %(WIKI, WEALTH), tabNum=1) 
print (wealthDF.iloc[0])
# get 3th continent table
noAmDF = getWtable("%s%s" %(WIKI, CONTINENT), tabNum=3)
print (noAmDF.iloc[0])
The first record from each dataframe is below.

wealthDF

  • country World
  • totalWealth 280,289
  • wealthPerAdult 56,541
  • medianWealthPerAdult 3,582
  • Name: 0, dtype: object
noAmDF
  • flag
  • name Anguilla
  • capital The Valley
  • status Overseas territory of the United Kingdom
  • Name: 0, dtype: object

Loading dataframes into sqlite3 database Below, we instantiate an sqlite3 connection, and load our dataframes into it using the pandas to_sql method. The method requires a string table name and a connection, and we supply also an index=False. The database is created in RAM just so that there is no file to clean up, but the code to create instead on disk is shown.

import sqlite3
conn = sqlite3.connect(':memory:') # create connection in ram
# conn = sqlite3.connect('example.db') # create conn on disk
wealthDF.to_sql("wealth", conn, index=False) # load wealthDF
noAmDF.to_sql("northAmerica", conn, index=False) # load noAmDF
We can now access the northAmerica and wealth tables in the database with sql, or use pandas to access the same data he noAmDF and wealthDF. We will use the pandas read_sql method, which we will supply with a string of sql and a connection. For some sql that's quick and easy to check, we will simply count the records in the tables with "select count(*) from tablename ", which we can compare to the len of the dataframe.
qS = 'select count(*) as n from %s' # query string
for tableName, df in zip("wealth northAmerica".split() \
                                 , (wealthDF, noAmDF)):
    tableLen = pd.read_sql(qS %(tableName), conn).iloc[0].n
    print ("%s\t%d\t%d" %(tableName, tableLen, len(df)))
  • wealth 107 107
  • northAmerica 45 45
Which shows matching lengths. The simplicity of the read_sql call is really amazing, so we'll look at it again with a join. Let's look at the mean wealth and capital only for countries in North America. The sql would be
select wealth.country, wealth.wealthPerAdult
    , continent.capital
from northAmerica continent
inner join wealth wealth
    on wealth.country = continent.name
In Python, we send that sql through the read_sql and view the resulting dataframe.
qS = """select wealth.country, wealth.wealthPerAdult
    , continent.capital
from northAmerica continent
inner join wealth wealth
    on wealth.country = continent.name """
joinDF = pd.read_sql(qS, conn) # join 2 tables in sql
print(joinDF.to_string(index=False))
The Python to perform the same join on the original dataframes is below, were we also check that the resulting dataframe is identical to the one from above. The output shows the same entried in a different order because the database is unordered like a Python dictionary and we made no effort to impose an order on the output.
keepL = "country wealthPerAdult capital".split() # ordered field list 
mergeDF = wealthDF.merge(noAmDF, left_on="country", right_on="name" \
                         , how="inner")[keepL] # merge 2 dataframe
print(mergeDF.to_string(index=False))

3 comments:

  1. I made a notebook that uses a couple of these examples to show how to get set differences in SQL.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete