How To Handle Meta Data Associated With A Pandas Dataframe?
Solution 1:
Although building a custom object is not your first choice, it might be your only feasible option, and has the significant advantage of being extremely flexible. Here's a really simple example:
df=pd.DataFrame({'stock': 'AAPL AAPL MSFT MSFT'.split(),
'price':[ 445.,455.,195.,205.]})
col_labels = { 'stock' : 'Ticker Symbol',
'price' : 'Closing Price in USD' }
That's just a dictionary of column labels, but often the majority of metadata is related to specific columns. Here's the sample data, with labels:
df.rename(columns=col_labels)
# Ticker Symbol Closing Price in USD
# 0 AAPL 445.0
# 1 AAPL 455.0
# 2 MSFT 195.0
# 3 MSFT 205.0
The nice thing is that the labels "persist" in the sense that you can basically apply them to any data whose columns are a subset or superset of the original columns:
df.groupby('stock').mean().rename(columns=col_labels)
# Closing Price in USD
# stock
# AAPL 450.0
# MSFT 200.0
You can get some limited persistence if you use the attrs
attribute:
df.attrs = col_labels
But it's fairly limited. It will persist for dataframes derived via .copy()
,loc[]
, or iloc[]
, but not for a groupby()
. You can of course reattach to any derivative dataframe with, for example,
df2.attrs = df.attrs
But as noted in the documentation (or lack thereof), this is an experimental feature and subject to change. Seems slightly better than nothing, and maybe will be expanded in the future. I couldn't find much info at all regarding attrs
, but it appears to be initialized as an empty dictionary, and can only be a dictionary (or similar) although of course lists could be nested below the top level.
Solution 2:
I think that MultiIndexes is the way to go, but this way:
daily_price_data = pd.DataFrame({'Apple': [90, 85, 30], 'MSFT':[20, 30, 25]})
daily_earnings_data = pd.DataFrame({'Apple': [5000, 58000, 5100], 'MSFT':[2000, 2200, 3000]})
data = pd.concat({'price':daily_price_data, 'earnings': daily_earnings_data}, axis=1)
data
earnings price
Apple MSFT Apple MSFT
0 5000 2000 90 20
1 58000 2200 85 30
2 5100 3000 30 25
Then, to divide:
data['price'] / data['earnings']
If you find that your workflow makes more sense to have companies listed on the first level of the index, then pandas.DataFrame.xs will be very helpful:
data2 = data.reorder_levels([1,0], axis=1).sort_index(axis=1)
data2.xs('price', axis=1, level=-1) / data2.xs('earnings', axis=1, level=-1)
Post a Comment for "How To Handle Meta Data Associated With A Pandas Dataframe?"