Pandas: Split Columns Into Multiple Columns By Two Delimiters
Solution 1:
You could use named groups together with Series.str.extract
. In the end concat back the 'ID'
. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C#0 1 2 2 5#1 2 3 4 1#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2'
then we can split on ';'
and partition
on '='
. pivot
in the end to get to your desired output.
### Starting Data#ID INFO#1 A=2;B=2;C=5#2 A=3;B=4;C=1#3 A=1;B=3;C=2#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C#ID #1 2 2 5#2 3 4 1#3 1 3 2#4 1 NaN 2
Solution 2:
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') forx in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID AB C
012251234123132
It should be faster than splitting twice dataframe columns.
Solution 3:
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c1 a=1;b=2;c=31232 a=4;b=5;c=64563 a=7;b=8;c=9789
Explanation: The first line converts every value to a dictionary. e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame
can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Solution 4:
Another solution is Series.str.findAll
to extract values and then apply(pd.Series)
:
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]# 1 [3, 4, 1]# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C# 0 1 A=2;B=2;C=5 2 2 5# 1 2 A=3;B=4;C=1 3 4 1# 2 3 A=1;B=3;C=2 1 3 2# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C# 0 1 2 2 5# 1 2 3 4 1# 2 3 1 3 2
Solution 5:
Another solution :
#split on ';'#explode#then split on '='#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
012251234123132
Post a Comment for "Pandas: Split Columns Into Multiple Columns By Two Delimiters"