Python Pandas To Ensure Each Row Based On Column Value Has A Set Of Data Present, If Not Add Row
I am organising AWS resources for tagging, and have captured data into a CSV file. A sample output of the CSV file is as follows. I am trying to make sure that for each resource_id
Solution 1:
One way to do this is to use multindexes, from_product
, and renindex
:
taglist = ['Application',
'Client',
'Environment',
'Name',
'Owner',
'Project',
'Purpose']
df_out = df.set_index(['resource_id','tag_key'])\
.reindex(pd.MultiIndex.from_product([df['resource_id'].unique(), taglist],
names=['resource_id','tag_key']))
df_out.assign(resource_type = df_out.groupby('resource_id')['resource_type']\
.ffill().bfill()).reset_index()
Output:
resource_id tag_key resource_type tag_value
0 vol-00441b671ca48ba41 Application volume NaN
1 vol-00441b671ca48ba41 Client volume NaN
2 vol-00441b671ca48ba41 Environment volume Development
3 vol-00441b671ca48ba41 Name volume Database Files
4 vol-00441b671ca48ba41 Owner volume NaN
5 vol-00441b671ca48ba41 Project volume Application Development
6 vol-00441b671ca48ba41 Purpose volume Web Server
7i-1234567890abcdef0 Application instance NaN
8i-1234567890abcdef0 Client instance NaN
9i-1234567890abcdef0 Environment instance Production
10i-1234567890abcdef0 Name instance NaN
11i-1234567890abcdef0 Owner instance Fast Company
12i-1234567890abcdef0 Project instance NaN
13i-1234567890abcdef0 Purpose instance NaN
Solution 2:
To make a slightly simpler example. I have dataframe df:
df = pd.DataFrame(data={'a': [1, 1, 2, 2], 'b': [[1, 2], [3, 5], [1, 2], [5]]})
Returning
ab01[1, 2]11[3, 5]22[1, 2]32[5]
With required b's: 1, 2, 3, 4 and 5.
Then we need to find out what we 'already have'. This we do:
def flatten(lsts):
return [j for i in lsts for j in i]
df_new = df.groupby(by=['a'])['b'].apply(flatten)
Returns:
a1[1, 2, 3, 5]2[1, 2, 5]
Now we need to list the columns we are missing and add those:
df_new = df_new.reset_index()
lst_wanted = [1, 2, 3, 4, 5]
for row in df_new.itertuples():
for j in lst_wanted:
if j not in row.b:
df = df.append({'a': row.a, 'b': j}, ignore_index=True)
print(df)
Returning:
ab01[1, 2]11[3, 5]22[1, 2]32[5]414523624
Post a Comment for "Python Pandas To Ensure Each Row Based On Column Value Has A Set Of Data Present, If Not Add Row"