Edgar data with requests and bs4¶

THe SEC EDGAR archive is an immense store of structured data. There are afew libraries for accessing EDGAR and before evaluating them it seems like a good idea to try less specific tools. Below we use the requests and bs4 modules to pull a dataframe containing the CIK, company name, and location for an SIC.

Imports¶

In [14]:

# imports
import pandas as pd
import requests 
from bs4 import BeautifulSoup
import time

# CONST
SIC = 6189 # the sic searched for later
PAGEMAX = 100 # max entries per page available at this facility
URL = "https://www.sec.gov/cgi-bin/browse-edgar" # base url
REQL = ["company=", "match=", "CIK=", "filenum=", "State=" \
        , "Country=", "SIC=%04d", "owner=exclude", "Find=Find+Companies" \
        , "action=getcompany", "count=%d", "start=%d"] 
print("imports done")

imports done

pulling a single page¶

We can pull a single page with a url like the string below. This was modeled after the url from a browser session.

In [16]:

# reqS = "&".join(REQL)  %(SIC, 100, 0)
reqS = "?".join((URL, "&".join(REQL) %(SIC, 100, 0)))
reqS

Out[16]:

'https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=6189&owner=exclude&Find=Find+Companies&action=getcompany&count=100&start=0'

Obtaining the table with requests, and finding the table with bs4¶

In [19]:

respS = requests.get(reqS).content.decode() # do search
soup = BeautifulSoup(respS, "lxml") # pull into bs4 structure 
table = soup.findAll('table')[0] # find 0th table in bs4 structure 
respDF = pd.read_html(str(table), header=0)[0] # get 0th dataframe from table 
respDF.head()

Out[19]:

	CIK	Company	State/Country
0	1738503	20 Times Square Trust 2018-20TS	NY
1	1715687	280 Park Avenue 2017-280P Mortgage Trust	NY
2	1740866	5 Bryant Park 2018-5BP Mortgage Trust	NY
3	1053428	A I RECEIVABLES CORP	NY
4	1094412	A.I. RECEIVABLES TRANSFER CORP.	NJ

Function to grab pages till there are no more pages¶

In [21]:

def getSIC(sic, url=URL, pagesize=PAGEMAX):
    """
    pull dataframes from url till they are empty,
    then concat and return
    """    
    dfL = []
    while 1:
        time.sleep(1)
        reqS = "?".join((url, "&".join(REQL) %(SIC, pagesize, pagesize*len(dfL))))
        try:
            respS = requests.get(reqS).content.decode()
        except:
            break
        soup = BeautifulSoup(respS, "lxml")
        try:
            table = soup.findAll('table')[0]
        except:
            break
        respDF = pd.read_html(str(tableL[0]), header=0)[0]
        dfL.append(respDF)
        if len(respDF) == 0:
            break
    return pd.concat(dfL)

Calling function and checking on it's length.¶

In [25]:

testDF = getSIC(SIC)
testDF.shape

Out[25]:

(11600, 3)

In [23]:

testDF.head()

Out[23]:

	CIK	Company	State/Country
0	1738503	20 Times Square Trust 2018-20TS	NY
1	1715687	280 Park Avenue 2017-280P Mortgage Trust	NY
2	1740866	5 Bryant Park 2018-5BP Mortgage Trust	NY
3	1053428	A I RECEIVABLES CORP	NY
4	1094412	A.I. RECEIVABLES TRANSFER CORP.	NJ

zero budget science

Saturday, June 2, 2018

EDGAR Data with requests and bs4

Edgar data with requests and bs4¶

Imports¶

pulling a single page¶

Obtaining the table with requests, and finding the table with bs4¶

Function to grab pages till there are no more pages¶

Calling function and checking on it's length.¶

1 comment:

Followers

Blog Archive

About Me