Why Is My Python Dataframe Performing So Slowly
Solution 1:
iterrows doesn't take advantage of vectorized operations. Most of the benefits of using pandas come from vectorized and parallel operations.
Replace for index, row in df_wf.iterrows(): with df_wf.apply(something, axis=1) where something is a function that encapsulates the logic you needed from iterrows, and uses numpy vectorized operations.
Also if your df doesn't fit in memory such that you need to batch read, consider using dask or spark over pandas.
Further reading: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html
Solution 2:
a few comments about your code:
- all those
globalvariables are scaring me! what's wrong with passing parameters and returning state? - you're not using any functionality from
Pandas, creating a dataframe just to use it to do a dumb iteration over rows is causing it to do lots of unnecessary work - the standard
csvmodule (can be used withdelimiter='|') provides a much closer interface if this is really the best way you can to do this
this might be a better question for https://codereview.stackexchange.com/
just playing with performance of alternative ways of working row wise. the take home from the below seems to be that working "row wise" is basically always slow with Pandas
start by creating a dataframe to test this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1, 1e6, (10_000, 2)))
df[1] = df[1].apply(str)
this takes 3.65 ms to create a dataframe with int and str columns. next I try the iterrows approach:
tot =0for i, rowin df.iterrows():
tot +=row[0] /1e5< len(row[1])
the aggregation is pretty dumb, I just wanted something that uses both columns. it takes a scary long 903ms. next I try iterating manually:
tot = 0for i in range(df.shape[0]):
tot += df.loc[i, 0] / 1e5 < len(df.loc[i, 1])
which reduces this down to 408 ms. next I try apply:
deffn(row):
return row[0] / 1e5 < len(row[1])
sum(df.apply(fn, axis=1))
which is basically the same at 368 ms. finally, I find some code that Pandas is happy with:
sum(df[0] / 1e5 < df[1].apply(len))
which takes 4.15 ms. and another approach that occurred to me:
tot = 0
for a, b in zip(df[0], df[1]):
tot += a / 1e5 < len(b)
which takes 2.78 ms. while another variant:
tot = 0
for a, b in zip(df[0] / 1e5, df[1]):
tot += a < len(b)
takes 2.29 ms.
Post a Comment for "Why Is My Python Dataframe Performing So Slowly"