Skip to content Skip to sidebar Skip to footer

How Can I Summarize Several Pandas Dataframe Columns Into A Parent Column Name?

I've a dataframe which looks like this some feature another feature label sample 0 ... ... ... and I'd like to get a dataframe with multiind

Solution 1:

From the API it's not clear to me how to use from_arrays(), from_product(), from_tuples() or from_frame() correctly.

It is mainly used, if generate new DataFrame with MultiIndex independent of original columns names.

So it means if need completely new MultiIndex, e.g. by lists or arrays:

a = ['a','a','b']
b = ['x','y','z']
df.columns = pd.MultiIndex.from_arrays([a,b])
print (df)
        a     b
        x  y  z
sample         
0       2  3  5
1       4  5  7

EDIT1: If want set all columns to MultiIndex all columns same way without last one:

a = ['parent'] * (len(df.columns) - 1) + ['label']
b = df.columns[:-1].tolist() + ['val']
df.columns = pd.MultiIndex.from_arrays([a,b])
print (df)
          parent           label
       feature a feature b   val
sample                          
0              2         3     5
1              4         5     7

It is possible by split, but if some column(s) without separator get NaNs for second level, because is not possible combinations MultiIndex and not MultiIndex columns (actaully yes, but get tuples from MultiIndex columns):

print (df)
        feature_a  feature_b  label
sample                             
0               2          3      5
1               4          5      7

df.columns = df.columns.str.split(expand=True)
print (df)
       feature    label
             a  b   NaN
sample                 
0            2  3     5
1            4  5     7

So better is convert all columns without separator to Index/MultiIndex first by DataFrame.set_index:

df = df.set_index('label')
df.columns = df.columns.str.split(expand=True)
print (df)
      feature   
            a  b
label           
5           2  3
7           4  5

For prevent original index is used append=True parameter:

df = df.set_index('label', append=True)
df.columns = df.columns.str.split(expand=True)
print (df)
             feature   
                   a  b
sample label           
0      5           2  3
1      7           4  5

Post a Comment for "How Can I Summarize Several Pandas Dataframe Columns Into A Parent Column Name?"