How To Compute Weighted Sum Of All Elements In A Row In Pandas?
I have a pandas data frame with multiple columns. I want to create a new column weighted_sum from the values in the row and another column vector dataframe weight weighted_sum sho
Solution 1:
The problem is that you're multiplying a frame with a frame of a different size with a different row index. Here's the solution:
In [121]: df = DataFrame([[1,2.2,3.5],[6.1,0.4,1.2]], columns=list('abc'))
In [122]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))
In [123]: df
Out[123]:
a b c
01.002.203.5016.100.401.20
In [124]: weight
Out[124]:
0
a 0.50
b 0.30
c 0.20
In [125]: df * weight
Out[125]:
0 a b c
0 nan nan nan nan
1 nan nan nan nan
a nan nan nan nan
b nan nan nan nan
c nan nan nan nan
You can either access the column:
In[126]: df * weight[0]Out[126]:
abc00.500.660.7013.050.120.24In[128]: (df * weight[0]).sum(1)
Out[128]:
01.8613.41dtype: float64
Or use dot
to get back another DataFrame
In [127]: df.dot(weight)
Out[127]:
001.8613.41
To bring it all together:
In [130]: df['weighted_sum'] = df.dot(weight)
In [131]: df
Out[131]:
a b c weighted_sum
0 1.00 2.20 3.50 1.86
1 6.10 0.40 1.20 3.41
Here are the timeit
s of each method, using a larger DataFrame
.
In [145]: df = DataFrame(randn(10000000, 3), columns=list('abc'))
weight
In [146]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))
In [147]: timeit df.dot(weight)
10 loops, best of 3: 57.5 ms per loop
In [148]: timeit (df * weight[0]).sum(1)
10 loops, best of 3: 125 ms per loop
For a wide DataFrame
:
In [162]: df = DataFrame(randn(10000, 1000))
In [163]: weight = DataFrame(randn(1000, 1))
In [164]: timeit df.dot(weight)
100 loops, best of 3: 5.14 ms per loop
In [165]: timeit (df * weight[0]).sum(1)
10 loops, best of 3: 41.8 ms per loop
So, dot
is faster and more readable.
NOTE: If any of your data contain NaN
s then you should not use dot
you should use the multiply-and-sum method. dot
cannot handle NaN
s since it is just a thin wrapper around numpy.dot()
(which doesn't handle NaN
s).
Solution 2:
Assuming weights is a Series of weights for each columns, you can just multiply and do the sum:
In [11]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
In [12]: weights = pd.Series([7, 8, 9], index=['a', 'b', 'c'])
In [13]: (df * weights)
Out[13]:
a b c
0716271284054
In [14]: (df * weights).sum(1)
Out[14]:
0501122
dtype: int64
The benefit of this approach is it takes care of columns which you don't want to weigh:
In [21]: weights = pd.Series([7, 8], index=['a', 'b'])
In [22]: (df * weights)
Out[22]:
a b c
0 7 16 NaN
1 28 40 NaN
In [23]: (df * weights).sum(1)
Out[23]:
0 23
1 68
dtype: float64
Post a Comment for "How To Compute Weighted Sum Of All Elements In A Row In Pandas?"