Edgar data with requests and bs4¶
THe SEC EDGAR archive is an immense store of structured data. There are afew libraries for accessing EDGAR and before evaluating them it seems like a good idea to try less specific tools. Below we use the requests and bs4 modules to pull a dataframe containing the CIK, company name, and location for an SIC.
Imports¶
In [14]:
# imports
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
# CONST
SIC = 6189 # the sic searched for later
PAGEMAX = 100 # max entries per page available at this facility
URL = "https://www.sec.gov/cgi-bin/browse-edgar" # base url
REQL = ["company=", "match=", "CIK=", "filenum=", "State=" \
, "Country=", "SIC=%04d", "owner=exclude", "Find=Find+Companies" \
, "action=getcompany", "count=%d", "start=%d"]
print("imports done")
pulling a single page¶
We can pull a single page with a url like the string below. This was modeled after the url from a browser session.
In [16]:
# reqS = "&".join(REQL) %(SIC, 100, 0)
reqS = "?".join((URL, "&".join(REQL) %(SIC, 100, 0)))
reqS
Out[16]:
Obtaining the table with requests, and finding the table with bs4¶
In [19]:
respS = requests.get(reqS).content.decode() # do search
soup = BeautifulSoup(respS, "lxml") # pull into bs4 structure
table = soup.findAll('table')[0] # find 0th table in bs4 structure
respDF = pd.read_html(str(table), header=0)[0] # get 0th dataframe from table
respDF.head()
Out[19]:
Function to grab pages till there are no more pages¶
In [21]:
def getSIC(sic, url=URL, pagesize=PAGEMAX):
"""
pull dataframes from url till they are empty,
then concat and return
"""
dfL = []
while 1:
time.sleep(1)
reqS = "?".join((url, "&".join(REQL) %(SIC, pagesize, pagesize*len(dfL))))
try:
respS = requests.get(reqS).content.decode()
except:
break
soup = BeautifulSoup(respS, "lxml")
try:
table = soup.findAll('table')[0]
except:
break
respDF = pd.read_html(str(tableL[0]), header=0)[0]
dfL.append(respDF)
if len(respDF) == 0:
break
return pd.concat(dfL)
Calling function and checking on it's length.¶
In [25]:
testDF = getSIC(SIC)
testDF.shape
Out[25]:
In [23]:
testDF.head()
Out[23]:
Available as a notebook.
ReplyDelete