Pandas: Split Columns Into Multiple Columns By Two Delimiters

October 21, 2024 Post a Comment

I have data like this ID INFO 1 A=2;B=2;C=5 2 A=3;B=4;C=1 3 A=1;B=3;C=2 I want to split the Info columns into ID A B C 1 2 2 5 2 3 4 1 3 1

Solution 1:

You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.

pd.concat([df['ID'], 
           df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)

#   ID  A  B  C#0   1  2  2  5#1   2  3  4  1#2   3  1  3  2

If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.

### Starting Data#ID   INFO#1    A=2;B=2;C=5#2    A=3;B=4;C=1#3    A=1;B=3;C=2#4    A=1;C=2

(df.set_index('ID')['INFO']
   .str.split(';', expand=True)
   .stack()
   .str.partition('=')
   .reset_index(-1, drop=True)
   .pivot(columns=0, values=2)
)

#    A    B  C#ID           #1   2    2  5#2   3    4  1#3   1    3  2#4   1  NaN  2

Solution 2:

Browsing a Series is much faster that iterating across the rows of a dataframe.

So I would do:

pd.DataFrame([dict([x.split('=') forx in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()

It gives as expected:

   ID  AB  C
012251234123132

It should be faster than splitting twice dataframe columns.

Solution 3:

values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)

This will give you the desired output:

    ID INFO         a   b   c1  a=1;b=2;c=31232  a=4;b=5;c=64563  a=7;b=8;c=9789

Explanation: The first line converts every value to a dictionary. e.g.

x = 'a=1;b=2;c=3' 
dict(item.split("=") for item in x.split(";"))

results in : {'a': '1', 'b': '2', 'c': '3'}

DataFrame can take a list of dicts as an input and turn it into a dataframe.

Then you only need to assign the dataframe to the columns you want: df[['a', 'b', 'c']] = pd.DataFrame(values)

Solution 4:

Another solution is Series.str.findAll to extract values and then apply(pd.Series):

df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)

Details:

df = pd.DataFrame([[1, "A=2;B=2;C=5"],
                [2, "A=3;B=4;C=1"],
                [3, "A=1;B=3;C=2"]],
                 columns=["ID", "INFO"])

print(df.INFO.str.findall(r'=(\d+)'))
# 0    [2, 2, 5]# 1    [3, 4, 1]# 2    [1, 3, 2]

df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
#    ID         INFO  A  B  C# 0   1  A=2;B=2;C=5  2  2  5# 1   2  A=3;B=4;C=1  3  4  1# 2   3  A=1;B=3;C=2  1  3  2# Remove INFO column
df = df.drop("INFO", 1)
print(df)
#    ID  A  B  C# 0   1  2  2  5# 1   2  3  4  1# 2   3  1  3  2

Solution 5:

Another solution :

#split on ';'#explode#then split on '='#and pivot
  df_INFO = (df.INFO
             .str.split(';')
             .explode()
             .str.split('=',expand=True)
             .pivot(columns=0,values=1)
             )

   pd.concat([df.ID,df_INFO],axis=1)

    ID  A   B   C
012251234123132

Python Freelancers