Assign (add) A New Column To A Dask Dataframe Based On Values Of 2 Existing Columns - Involves A Conditional Statement
Solution 1:
You can either use fillna
(fast) or you can use apply
(slow but flexible)
Fillna
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)
ddf['z'] = ddf.y.fillna((100 + ddf.x))
>>> df
x y
0 1 0.200
1 2 NaN
2 3 0.345
3 4 0.400
4 5 0.150
>>> ddf.compute()
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
Of course in this case though because your function uses y
if y
is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.
Use apply
As any Pandas expert will tell you, using apply
comes with a 10x to 100x slowdown penalty. Please beware.
That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:
In [1]: import pandas as pd
...:
...: import dask.dataframe as dd
...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
...: ddf = dd.from_pandas(df, npartitions=2)
...:
In [2]: def func(row):
...: if pd.isnull(row['y']):
...: return row['x'] + 100
...: else:
...: return row['y']
...:
In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
In [4]: ddf.compute()
Out[4]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)
In [6]: ddf.compute()
Out[6]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
Solution 2:
I do not have any experience with dask but your boolean test will not catch that 2nd element as null in funcUpdate. Null values with pandas are equal to None or NaN/Nan, not "".
def funcUpdate(row):
try:
return round((1 + row['x'])/(1+ 1/row['y']),4)
except:
return row['y']
Is a possible workaround but you would need to run data validation before hand.
Post a Comment for "Assign (add) A New Column To A Dask Dataframe Based On Values Of 2 Existing Columns - Involves A Conditional Statement"