Saturday, June 2, 2018

EDGAR Data with requests and bs4

Edgar data with requests and bs4

THe SEC EDGAR archive is an immense store of structured data. There are afew libraries for accessing EDGAR and before evaluating them it seems like a good idea to try less specific tools. Below we use the requests and bs4 modules to pull a dataframe containing the CIK, company name, and location for an SIC.

Imports

In [14]:
# imports
import pandas as pd
import requests 
from bs4 import BeautifulSoup
import time

# CONST
SIC = 6189 # the sic searched for later
PAGEMAX = 100 # max entries per page available at this facility
URL = "https://www.sec.gov/cgi-bin/browse-edgar" # base url
REQL = ["company=", "match=", "CIK=", "filenum=", "State=" \
        , "Country=", "SIC=%04d", "owner=exclude", "Find=Find+Companies" \
        , "action=getcompany", "count=%d", "start=%d"] 
print("imports done")
imports done

pulling a single page

We can pull a single page with a url like the string below. This was modeled after the url from a browser session.

In [16]:
# reqS = "&".join(REQL)  %(SIC, 100, 0)
reqS = "?".join((URL, "&".join(REQL) %(SIC, 100, 0)))
reqS
Out[16]:
'https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=6189&owner=exclude&Find=Find+Companies&action=getcompany&count=100&start=0'

Obtaining the table with requests, and finding the table with bs4

In [19]:
respS = requests.get(reqS).content.decode() # do search
soup = BeautifulSoup(respS, "lxml") # pull into bs4 structure 
table = soup.findAll('table')[0] # find 0th table in bs4 structure 
respDF = pd.read_html(str(table), header=0)[0] # get 0th dataframe from table 
respDF.head()
Out[19]:
CIK Company State/Country
0 1738503 20 Times Square Trust 2018-20TS NY
1 1715687 280 Park Avenue 2017-280P Mortgage Trust NY
2 1740866 5 Bryant Park 2018-5BP Mortgage Trust NY
3 1053428 A I RECEIVABLES CORP NY
4 1094412 A.I. RECEIVABLES TRANSFER CORP. NJ

Function to grab pages till there are no more pages

In [21]:
def getSIC(sic, url=URL, pagesize=PAGEMAX):
    """
    pull dataframes from url till they are empty,
    then concat and return
    """    
    dfL = []
    while 1:
        time.sleep(1)
        reqS = "?".join((url, "&".join(REQL) %(SIC, pagesize, pagesize*len(dfL))))
        try:
            respS = requests.get(reqS).content.decode()
        except:
            break
        soup = BeautifulSoup(respS, "lxml")
        try:
            table = soup.findAll('table')[0]
        except:
            break
        respDF = pd.read_html(str(tableL[0]), header=0)[0]
        dfL.append(respDF)
        if len(respDF) == 0:
            break
    return pd.concat(dfL)

Calling function and checking on it's length.

In [25]:
testDF = getSIC(SIC)
testDF.shape
Out[25]:
(11600, 3)
In [23]:
testDF.head()
Out[23]:
CIK Company State/Country
0 1738503 20 Times Square Trust 2018-20TS NY
1 1715687 280 Park Avenue 2017-280P Mortgage Trust NY
2 1740866 5 Bryant Park 2018-5BP Mortgage Trust NY
3 1053428 A I RECEIVABLES CORP NY
4 1094412 A.I. RECEIVABLES TRANSFER CORP. NJ