Skip to content Skip to sidebar Skip to footer

Removing Rows In Pandas Based On Multiple Columns

In Pandas, I have a dataframe with ZipCode, Age, and a bunch of columns that should all have values 1 or 0, ie: ZipCode Age A B C D 12345 21 0 1 1 1 12345 22 1 0 1 4 23456

Solution 1:

Use isin to test for membership and all to test if all row values are True and use this boolean mask to filter the df:

In[12]:
df[df.ix[:,'A':].isin([0,1]).all(axis=1)]

Out[12]:
   ZipCodeAgeABCD012345210111223456451011

Solution 2:

You can opt for a vectorized solution:

In [64]: df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]

Out[64]:
   ZipCode  Age  A  B  C  D
0    12345   21  0  1  1  1
2    23456   45  1  0  1  1

Solution 3:

Other two solutions works well but if you interested in speed you should look at numpy in1d function:

data=df.loc[:, 'A':]

In [72]:  df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
Out[72]:
   ZipCode  Age  A  B  C  D
012345210111223456451011

Timing:

In [73]: %timeit data=df.loc[:, 'A':]; df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
1000 loops, best of3: 558 us per loop

In [74]: %timeit df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
1000 loops, best of3: 843 us per loop

In [75]: %timeit df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
1000 loops, best of3: 1.44 ms per loop

Post a Comment for "Removing Rows In Pandas Based On Multiple Columns"