Skip to content Skip to sidebar Skip to footer

Wrangling A Data Frame In Pandas (python)

I have the following data in a csv file: from StringIO import StringIO import pandas as pd the_data = ''' ABC,2016-6-9 0:00,95,{'//Purple': [115L], '//Yellow': [403L], '//Blue': [

Solution 1:

Consider converting the dictionary column values as Python dictionaries using ast.literal_eval() and then cast them as individual dataframes for final merge with original dataframe:

from io import StringIO
import pandas as pd

import ast
...

df = pd.read_csv(StringIO(the_data), header=None, 
                 names=['Company', 'Date', 'Value', 'Dicts'])

dfList = []
for i in df['Dicts'].tolist():
    result = ast.literal_eval(i.replace('L]', ']'))            
    result = {k.replace('//',''):v for k,v in result.items()}
    temp = pd.DataFrame(result)
    dfList.append(temp)

dictdf = pd.concat(dfList).reset_index(drop=True)
df = pd.merge(df, dictdf, left_index=True, right_index=True).drop(['Dicts'], axis=1)
print(df)

#   Company            Date  Value  Black    Blue  NPO-Green  Pink  Purple  White-XYZ  Yellow# 0     ABC   2016-6-9 0:00     95    NaN    16.0        NaN   NaN     115        0.0   403.0# 1     ABC  2016-6-10 0:00      0    NaN    90.0        NaN   NaN     219        0.0   381.0# 2     ABC  2016-6-11 0:00      0    NaN    31.0        NaN   NaN     817        0.0    21.0# 3     ABC  2016-6-12 0:00      0    NaN  8888.0        NaN   NaN      80        0.0  2011.0# 4     ABC  2016-6-13 0:00      0    NaN     4.0        NaN   NaN      32        0.0    15.0# 5     DEF  2016-6-16 0:00      0   15.0     NaN        3.0   4.0      32        NaN     NaN# 6     DEF  2016-6-17 0:00      0   15.0     NaN        0.0   4.0      32        NaN     NaN# 7     DEF  2016-6-18 0:00      0   15.0     NaN        7.0   4.0      32        NaN     NaN# 8     DEF  2016-6-19 0:00      0   15.0     NaN       14.0   4.0      32        NaN     NaN# 9     DEF  2016-6-20 0:00      0   15.0     NaN       21.0   4.0      32        NaN     NaN

Solution 2:

I really don't think this pandas can do much for you here. You're data is very obtuse and seems to me to be best dealt with using regular expressions. Here's my solution:

import re

static_cols = []
dynamic_cols = []
for line in the_data.splitlines():
    if line == '':
        continue# deal with static columns
    x = line.split(',')
    company, date, other = x[0:3]
    keys = ['Company', 'Date', 'Other']
    values = [company, date, other]
    d = {i: j for i, j inzip(keys, values)}
    static_cols.append(d)

    # deal with dynamic columns
    keys = re.findall(r'(?<=//)[^\']*', line)
    values = re.findall(r'\d+(?=L)', line)
    d = {i: j for i, j inzip(keys, values)}
    dynamic_cols.append(d)

df1 = pd.DataFrame(static_cols)
df2 = pd.DataFrame(dynamic_cols)
df = pd.concat([df1, df2], axis=1)

And the output:

enter image description here

Also, your data had an extra column after the date I wasn't sure how to deal with so I just called it 'Other'. It wasn't included in your output, so you can easily remove it if you want as well.

Post a Comment for "Wrangling A Data Frame In Pandas (python)"