Assign (add) A New Column To A Dask Dataframe Based On Values Of 2 Existing Columns - Involves A Conditional Statement

July 02, 2022 Post a Comment

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls: DataFrame defi

Solution 1:

You can either use fillna (fast) or you can use apply (slow but flexible)

Fillna

import pandas as pd

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)

ddf['z'] = ddf.y.fillna((100 + ddf.x))

>>> df

   x      y
0  1  0.200
1  2    NaN
2  3  0.345
3  4  0.400
4  5  0.150

>>> ddf.compute()

   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Of course in this case though because your function uses y if y is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.

Use apply

As any Pandas expert will tell you, using apply comes with a 10x to 100x slowdown penalty. Please beware.

That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:

In [1]: import pandas as pd
   ...: 
   ...: import dask.dataframe as dd
   ...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
   ...: ddf = dd.from_pandas(df, npartitions=2)
   ...: 

In [2]: def func(row):
   ...:     if pd.isnull(row['y']):
   ...:         return row['x'] + 100
   ...:     else:
   ...:         return row['y']
   ...:     

In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)

In [4]: ddf.compute()
Out[4]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)

In [6]: ddf.compute()
Out[6]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Solution 2:

I do not have any experience with dask but your boolean test will not catch that 2nd element as null in funcUpdate. Null values with pandas are equal to None or NaN/Nan, not "".

def funcUpdate(row):
    try:
        return  round((1 + row['x'])/(1+ 1/row['y']),4)
    except:
        return row['y']

Is a possible workaround but you would need to run data validation before hand.

Python Freelancers

Assign (add) A New Column To A Dask Dataframe Based On Values Of 2 Existing Columns - Involves A Conditional Statement

Solution 1:

Fillna

Use apply

Solution 2:

Post a Comment for "Assign (add) A New Column To A Dask Dataframe Based On Values Of 2 Existing Columns - Involves A Conditional Statement"