Pandas Read_html Clean Up Before Or After Read
Solution 1:
It is always better to clean original data, because any processing might introduce artifacts. Your HTML table is created using span
feature, and this is why it impossible to extract the data in generic way if you clean the DataFrame
after HTML parsing. So I suggest you install a small module which is intended exactly to this: extracting data out of HTML tables. Run in your command line
pip install html-table-extractor
After this get the raw HTML of the page (you will need requests
also), process the table and clean duplicate entries:
import requests
import pandas as pd
from collections import OrderedDict
from html_table_extractor.extractor import Extractor
pd.set_option('display.width', 400)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
# get raw html
resp = requests.get('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')
# find last table
beg = resp.text.rfind('<table')
end = resp.text.rfind('</table')
html = resp.text[beg:end+8]
# process table
ex = Extractor(html)
ex.parse()
list_of_lines = ex.return_list()
# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)## we need to consolidate some columns# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitionsfor el in list_of_lines[names_line]:
col_names[el] = [i for i, x inenumerate(list_of_lines[names_line]) if x == el]
# now consolidate repetitive values
storage = OrderedDict() # this will contain columnsfor k in col_names:
res = []
for line in list_of_lines[names_line+1:]: # first 2 lines are empty, third is column names
joined = [] # <- this list will accumulate *unique* values to become a single cellfor idx in col_names[k]:
el = line[idx]
if joined and joined[-1]==el: # if value already exist, skipcontinue
joined.append(el) # add unique value to cell
res.append(''.join(joined)) # add cell to column
storage[k] = res # add column to storage
df = pd.DataFrame(storage)
print(df)
This will produce the following result, which is very close to original:
Q1`17 Q2`17 Q3`17 Q4`17 FY 2017 Q1`180 (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands)
1 (Unaudited) (Unaudited) (Unaudited) (Unaudited) (Unaudited) (Unaudited)
2 Customer metrics
3 Customer accounts (1) 57,000+ 61,000+ 65,000+ 70,000+ 70,000+ 74,000+
4 Customer accounts added in period (1) 3,300+ 4,000+ 4,100+ 4,700+ 16,100+ 3,900+
5 Deals greater than $100,000 (2) 2943723375901,5933016 Customer accounts that purchased greater than $1 million during the quarter (1,2) 10151327137
8 Annual recurring revenue metrics
9 Total annual recurring revenue (3) $439,001 $483,578 $526,211 $596,244 $596,244 $641,94610 Subscription annual recurring revenue (4) $71,950 $103,538 $139,210 $195,488 $195,488 $237,53311
12 Geographic revenue metrics - ASC 606
13 United States and Canada — — — — — $167,79914 International — — — — — $78,408
.. ... ... ... ... ... ... ...
23
24 Additional revenue metrics - ASC 606
25 Remaining performance obligations (5) — — — — $99,580 $114,52326
27 Additional revenue metrics - ASC 605
28 Ratable revenue as % of total revenue (6) 54%56%63%60%59%72%29 Ratable license revenue as % of total license revenue (7) 19%23%34%34%28%54%30 Services revenues as a % of maintenance and services revenue (8) 12%13%12%13%13%11%31
32 Bookings metrics - ASC 605
33 Ratable bookings as % of total bookings (2) 55%61%65%70%64%72%34 Ratable license bookings as % of total license bookings (2) 26%37%45%51%41%59%35
36 Other metrics
37 Worldwide employees 3,1933,3053,4183,4893,4893,663
Solution 2:
Code
below extracts the table using pd.read_html()
from a website. Additional parameters could be tuned further depending on the table format
.
# Import libraries
import pandas as pd
# Read tablelink = 'https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm'
a=pd.read_html(link, header=None, skiprows=1)
# Save the dataframedf = a[23]
# Remove NaN rows/columns
col_list = df.iloc[1]
df = df.loc[4:,[0,1,3,5,7,9,11]] # adjusted column names
df.columns = col_list[:len(df.columns)]
df.head(7)
Note: Empty cells in the original table are replaced with NaN's
Post a Comment for "Pandas Read_html Clean Up Before Or After Read"