Is There Any Nicer Way To Aggregate Multiple Columns On Same Grouped Pandas Dataframe?
I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data. The reason why I am doing this because, I need to
Solution 1:
This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way. The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data. Here are some key points:
- The dates in the original dataset are already on a weekly frequency so
groupby('week')
is not needed fordf_ret_df1
anddff_ret_df2
, which is why these contain identical values for min, max, and mean. - This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.
- The aggregation step produces a MultiIndex for the columns. You can access the aggregated variables (min, max, mean) of each high-level variable by using
df.xs
. - The date is set as the index of the aggregated dataframe to use as the x variable. Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.
- It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.
Import dataset and aggregate it as needed
import pandas as pd # v 1.2.3import matplotlib.pyplot as plt # v 3.3.4# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])
# Create dataframe containing data for ground beef products, compute# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
.reset_index('retail_item'))
df_gbeef_agg
Plot aggregated variables in single figure containing small multiples
variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()
fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color inzip(axs, variables, colors):
for i, (item, df_item) inenumerate(df_gbeef_agg.groupby('retail_item')):
ax = axs_row[i]
# Select data and plot it
data = df_item.xs(var, axis=1)
ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
color=color, alpha=0.3, label='min/max')
ax.plot(data.index, data['mean'], color=color, label='mean')
ax.spines['bottom'].set_position('zero')
# Format x-axis tick labels
fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
ax.xaxis.set_major_formatter(fmt)
# Fomat subplot according to position within the figureif ax.is_first_row():
ax.set_title(item, pad=10)
if ax.is_last_row():
ax.set_xlabel('Week number', size=12, labelpad=5)
if ax.is_first_col():
ax.set_ylabel(var, size=12, labelpad=10)
if ax.is_last_col():
ax.legend(frameon=False)
fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);
Solution 2:
I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:
import pandas as pd
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
# define which columns to group and in which way
dct = {'low_price': [max, min],
'high_price': min,
'year': 'mean'}
# actually group the columns
df.groupby(['region']).agg(dct)
Output:
low_pricehigh_priceyearmaxminminmeanregionALASKA16.991.331.332020.792123HAWAII12.991.331.332020.738318MIDWEST28.730.990.992020.690159NORTHEAST19.991.201.992020.709916NORTHWEST16.991.331.332020.736397SOUTHCENTRAL28.761.201.492020.700980SOUTHEAST21.991.331.482020.699655SOUTHWEST16.991.291.292020.704341
Post a Comment for "Is There Any Nicer Way To Aggregate Multiple Columns On Same Grouped Pandas Dataframe?"