Skip to content Skip to sidebar Skip to footer

Why Is My Python Dataframe Performing So Slowly

I'm building an application that provides some very simple analysis on large datasets. These datasets are delivered in CSV files of 10 million + rows with about 30 columns. (I don'

Solution 1:

iterrows doesn't take advantage of vectorized operations. Most of the benefits of using pandas come from vectorized and parallel operations.

Replace for index, row in df_wf.iterrows(): with df_wf.apply(something, axis=1) where something is a function that encapsulates the logic you needed from iterrows, and uses numpy vectorized operations.

Also if your df doesn't fit in memory such that you need to batch read, consider using dask or spark over pandas.

Further reading: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html

Solution 2:

a few comments about your code:

  • all those global variables are scaring me! what's wrong with passing parameters and returning state?
  • you're not using any functionality from Pandas, creating a dataframe just to use it to do a dumb iteration over rows is causing it to do lots of unnecessary work
  • the standard csv module (can be used with delimiter='|') provides a much closer interface if this is really the best way you can to do this

this might be a better question for https://codereview.stackexchange.com/

just playing with performance of alternative ways of working row wise. the take home from the below seems to be that working "row wise" is basically always slow with Pandas

start by creating a dataframe to test this:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(1, 1e6, (10_000, 2)))
df[1] = df[1].apply(str)

this takes 3.65 ms to create a dataframe with int and str columns. next I try the iterrows approach:

tot =0for i, rowin df.iterrows():
    tot +=row[0] /1e5< len(row[1])

the aggregation is pretty dumb, I just wanted something that uses both columns. it takes a scary long 903ms. next I try iterating manually:

tot = 0for i in range(df.shape[0]):
    tot += df.loc[i, 0] / 1e5 < len(df.loc[i, 1])

which reduces this down to 408 ms. next I try apply:

deffn(row):
    return row[0] / 1e5 < len(row[1])

sum(df.apply(fn, axis=1))

which is basically the same at 368 ms. finally, I find some code that Pandas is happy with:

sum(df[0] / 1e5 < df[1].apply(len))

which takes 4.15 ms. and another approach that occurred to me:

tot = 0
for a, b in zip(df[0], df[1]):
    tot += a / 1e5 < len(b)

which takes 2.78 ms. while another variant:

tot = 0
for a, b in zip(df[0] / 1e5, df[1]):
    tot += a < len(b)

takes 2.29 ms.

Post a Comment for "Why Is My Python Dataframe Performing So Slowly"