Skip to content Skip to sidebar Skip to footer

Memory Efficient Way For List Comprehension Of Pandas Dataframe Using Multiple Columns

I want to run a function on rows of a pandas dataframe in list comprehension. Dataframe can have varying number of columns. How to make use these columns of dataframe? import pan

Solution 1:

The shown code is extremely memory efficient, and should be faster than an iterrow based solution.

But from your comment, it is not the code that causes the memory error... The problematic codes are:

df[list(df.columns.values)].values()

or:

df[list(df.columns.values)].to_numpy(copy=False)

because both involves a full copy of the dataframe values unless all columns have the same dtype.

If you want to process an unknown number of columns, the safe way is:

[func(row) for row in zip([df[i].values for i in df.columns])]

No copy is required here because df[i].values will return the underlying numpy arrays.


By the way, if you only need to use once the values of the returned list you could even save some memory by using a generator instead of a list:

(func(row) for row in zip([df[i].values for i in df.columns]))

Solution 2:

Thanks for your answers.

Meantime, I found the following as a solution:

df_columns = list(df.columns.values)
[func_using_list_comp(
                row,
                var1,
                var2,
                var3,
                ...,
                df_columns) for row in df[df_columns].values]

In this way, I did not need to use zip function and make it work for any number of columns.

I hope this is also memory efficient. By the way, I'm accumulating in the var1, var2, var3 each time I process a row.

If I use generator instead of a list, how much will it affect my memory usage and will I get the all the accumulated data after processing all rows?

Since, I'm returning these var1, var2, var3 after all rows are processed.


Solution 3:

Your list comprehension method seems a bit more confusing than it needs to be, especially considering pandas dataframes have an iterrows() method. You can replace your version with this:

for index, row in df.iterrows():
    func(row)

But I only suggest the above method because your function seems to only print out the row. Depending on what your func really does, you may want to consider using df.apply():

df.apply(func, axis=1)

Solution 4:

In your example, printing the full row, the [0] or * is simply to remove the numpy frame again:

[func(*row) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]

or

[func(row[0]) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]

['chr1' 10000 1 2 3]
['chr1' 10100 1 2 3]
['chr1' 12000 1 2 3]

printing only the third column:

[func(row[0][2]) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]

1
1
1

p.s.: this also has the console output [None, None, None] in the end, but that is just because the result of print() inside the list comprehension is None, it does not belong to the print results.

See also:

EDIT:

Please use df.iloc and df.loc instead of df[[...]], see Selecting multiple columns in a pandas dataframe


Post a Comment for "Memory Efficient Way For List Comprehension Of Pandas Dataframe Using Multiple Columns"