Can I Execute A Function In "apply" To Pandas Dataframe Asynchronously?
I have a pandas dataframe and on each row, I would like to execute a function. However, the function includes I/O call to a remote server and thus it is very slow if I call it simp
Solution 1:
Asynchronous I/O approach with well-known asyncio + aiohttp libraries:
Demonstrated on sample Dataframe and simple webpage content processing routines (to show the mechanics of the approach).
Let's say we need to count all header, link(<a>
) and span tags through all urls and store the resulting counters in the source dataframe.
import pandas as pd
import asyncio
import aiohttp
from bs4 import BeautifulSoup
defcount_headers(html):
returnlen(list(html.select('h1,h2,h3,h4,h5,h6')))
defcount_links(html):
returnlen(list(html.find_all('a')))
defcount_spans(html):
returnlen(list(html.find_all('spans')))
df = pd.DataFrame({'id': [1, 2, 3], 'url': ['https://stackoverflow.com/questions',
'https://facebook.com',
'https://wiki.archlinux.org']})
df['head_c'], df['link_c'], df['span_c'] = [None, None, None]
# print(df)asyncdefprocess_url(df, url):
asyncwith aiohttp.ClientSession() as session:
resp = await session.get(url)
content = await resp.text()
soup = BeautifulSoup(content, 'html.parser')
headers_count = count_headers(soup)
links_count = count_links(soup)
spans_count = count_spans(soup)
print("Done")
df.loc[df['url'] == url, ['head_c', 'link_c', 'span_c']] = \
[[headers_count, links_count, spans_count]]
asyncdefmain(df):
await asyncio.gather(*[process_url(df, url) for url in df['url']])
print(df)
loop = asyncio.get_event_loop()
loop.run_until_complete(main(df))
loop.close()
The output:
Done
Done
Done
id url head_c link_c span_c
0 1 https://stackoverflow.com/questions 25 306 0
1 2 https://facebook.com 3 55 0
2 3 https://wiki.archlinux.org 15 91 0
Enjoy the performance difference.
Solution 2:
Another idea for an optimized apply for Pandas is to use Swifter https://github.com/jmcarpenter2/swifter
Solution 3:
The solution using dask is simple:
import dask.dataframe as dd
npartitions = 24
dd.from_pandas(df, npartitions=npartitions).apply(lambda x: f(x), meta=list, axis=1).compute()
A more common use-case might be for the apply function to return a dataframe. For that, you want to return pd.Series({'x': x, 'y': y, 'z': z})
in your function and then pass e.g. meta={'x': float, 'y': float, 'z': float}
to apply
.
Post a Comment for "Can I Execute A Function In "apply" To Pandas Dataframe Asynchronously?"