Loading dataframes
Here we populate dataframes with extracts from Wikipedia List of 100 wealthiest countries and List of countries by continent.
from wikiHelp import WIKI, getWtable # CONST WEALTH = "List_of_countries_by_wealth_per_adult" CONTINENTS = """List of sovereign states and dependent territories by continent""" # make narrow for show CONTINENT = "_".join(CONTINENTS.split()) # string for wiki # get 1th table from wealth per adult wealthDF = getWtable("%s%s" %(WIKI, WEALTH), tabNum=1) print (wealthDF.iloc[0]) # get 3th continent table noAmDF = getWtable("%s%s" %(WIKI, CONTINENT), tabNum=3) print (noAmDF.iloc[0])The first record from each dataframe is below.
wealthDF
- country World
- totalWealth 280,289
- wealthPerAdult 56,541
- medianWealthPerAdult 3,582
- Name: 0, dtype: object
- flag
- name Anguilla
- capital The Valley
- status Overseas territory of the United Kingdom
- Name: 0, dtype: object
Loading dataframes into sqlite3 database Below, we instantiate an sqlite3 connection, and load our dataframes into it using the pandas to_sql method. The method requires a string table name and a connection, and we supply also an index=False. The database is created in RAM just so that there is no file to clean up, but the code to create instead on disk is shown.
import sqlite3 conn = sqlite3.connect(':memory:') # create connection in ram # conn = sqlite3.connect('example.db') # create conn on disk wealthDF.to_sql("wealth", conn, index=False) # load wealthDF noAmDF.to_sql("northAmerica", conn, index=False) # load noAmDFWe can now access the northAmerica and wealth tables in the database with sql, or use pandas to access the same data he noAmDF and wealthDF. We will use the pandas read_sql method, which we will supply with a string of sql and a connection. For some sql that's quick and easy to check, we will simply count the records in the tables with "select count(*) from tablename ", which we can compare to the len of the dataframe.
qS = 'select count(*) as n from %s' # query string for tableName, df in zip("wealth northAmerica".split() \ , (wealthDF, noAmDF)): tableLen = pd.read_sql(qS %(tableName), conn).iloc[0].n print ("%s\t%d\t%d" %(tableName, tableLen, len(df)))
- wealth 107 107
- northAmerica 45 45
select wealth.country, wealth.wealthPerAdult , continent.capital from northAmerica continent inner join wealth wealth on wealth.country = continent.nameIn Python, we send that sql through the read_sql and view the resulting dataframe.
qS = """select wealth.country, wealth.wealthPerAdult , continent.capital from northAmerica continent inner join wealth wealth on wealth.country = continent.name """ joinDF = pd.read_sql(qS, conn) # join 2 tables in sql print(joinDF.to_string(index=False))The Python to perform the same join on the original dataframes is below, were we also check that the resulting dataframe is identical to the one from above. The output shows the same entried in a different order because the database is unordered like a Python dictionary and we made no effort to impose an order on the output.
keepL = "country wealthPerAdult capital".split() # ordered field list mergeDF = wealthDF.merge(noAmDF, left_on="country", right_on="name" \ , how="inner")[keepL] # merge 2 dataframe print(mergeDF.to_string(index=False))
I made a notebook that uses a couple of these examples to show how to get set differences in SQL.
ReplyDeleteThis comment has been removed by the author.
DeleteThis comment has been removed by the author.
ReplyDelete