Why Is My Python Dataframe Performing So Slowly
Solution 1:
iterrows
doesn't take advantage of vectorized operations. Most of the benefits of using pandas
come from vectorized and parallel operations.
Replace for index, row in df_wf.iterrows():
with df_wf.apply(something, axis=1)
where something
is a function that encapsulates the logic you needed from iterrows
, and uses numpy
vectorized operations.
Also if your df
doesn't fit in memory such that you need to batch read, consider using dask
or spark
over pandas
.
Further reading: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html
Solution 2:
a few comments about your code:
- all those
global
variables are scaring me! what's wrong with passing parameters and returning state? - you're not using any functionality from
Pandas
, creating a dataframe just to use it to do a dumb iteration over rows is causing it to do lots of unnecessary work - the standard
csv
module (can be used withdelimiter='|'
) provides a much closer interface if this is really the best way you can to do this
this might be a better question for https://codereview.stackexchange.com/
just playing with performance of alternative ways of working row wise. the take home from the below seems to be that working "row wise" is basically always slow with Pandas
start by creating a dataframe to test this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1, 1e6, (10_000, 2)))
df[1] = df[1].apply(str)
this takes 3.65 ms to create a dataframe with int
and str
columns. next I try the iterrows
approach:
tot =0for i, rowin df.iterrows():
tot +=row[0] /1e5< len(row[1])
the aggregation is pretty dumb, I just wanted something that uses both columns. it takes a scary long 903ms. next I try iterating manually:
tot = 0for i in range(df.shape[0]):
tot += df.loc[i, 0] / 1e5 < len(df.loc[i, 1])
which reduces this down to 408 ms. next I try apply
:
deffn(row):
return row[0] / 1e5 < len(row[1])
sum(df.apply(fn, axis=1))
which is basically the same at 368 ms. finally, I find some code that Pandas is happy with:
sum(df[0] / 1e5 < df[1].apply(len))
which takes 4.15 ms. and another approach that occurred to me:
tot = 0
for a, b in zip(df[0], df[1]):
tot += a / 1e5 < len(b)
which takes 2.78 ms. while another variant:
tot = 0
for a, b in zip(df[0] / 1e5, df[1]):
tot += a < len(b)
takes 2.29 ms.
Post a Comment for "Why Is My Python Dataframe Performing So Slowly"