Skip to content Skip to sidebar Skip to footer

Is There Any Nicer Way To Aggregate Multiple Columns On Same Grouped Pandas Dataframe?

I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data. The reason why I am doing this because, I need to

Solution 1:

This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way. The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data. Here are some key points:

  • The dates in the original dataset are already on a weekly frequency so groupby('week') is not needed for df_ret_df1 and dff_ret_df2, which is why these contain identical values for min, max, and mean.
  • This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.
  • The aggregation step produces a MultiIndex for the columns. You can access the aggregated variables (min, max, mean) of each high-level variable by using df.xs.
  • The date is set as the index of the aggregated dataframe to use as the x variable. Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.
  • It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.

Import dataset and aggregate it as needed

import pandas as pd              # v 1.2.3import matplotlib.pyplot as plt  # v 3.3.4# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])

# Create dataframe containing data for ground beef products, compute# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
            'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
                .reset_index('retail_item'))
df_gbeef_agg

df_gbeef_agg

Plot aggregated variables in single figure containing small multiples

variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()

fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color inzip(axs, variables, colors):
    for i, (item, df_item) inenumerate(df_gbeef_agg.groupby('retail_item')):
        ax = axs_row[i]
        
        # Select data and plot it
        data = df_item.xs(var, axis=1)
        ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
                        color=color, alpha=0.3, label='min/max')
        ax.plot(data.index, data['mean'], color=color, label='mean')
        ax.spines['bottom'].set_position('zero')
        
        # Format x-axis tick labels
        fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
        ax.xaxis.set_major_formatter(fmt)
        
        # Fomat subplot according to position within the figureif ax.is_first_row():
            ax.set_title(item, pad=10)
        if ax.is_last_row():
            ax.set_xlabel('Week number', size=12, labelpad=5)
        if ax.is_first_col():
            ax.set_ylabel(var, size=12, labelpad=10)
        if ax.is_last_col():
            ax.legend(frameon=False)

fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
             size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);

small_multiples

Solution 2:

I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:

import pandas as pd

url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])

# define which columns to group and in which way
dct = {'low_price': [max, min],
       'high_price': min,
       'year': 'mean'}

# actually group the columns
df.groupby(['region']).agg(dct)

Output:

low_pricehigh_priceyearmaxminminmeanregionALASKA16.991.331.332020.792123HAWAII12.991.331.332020.738318MIDWEST28.730.990.992020.690159NORTHEAST19.991.201.992020.709916NORTHWEST16.991.331.332020.736397SOUTHCENTRAL28.761.201.492020.700980SOUTHEAST21.991.331.482020.699655SOUTHWEST16.991.291.292020.704341

Post a Comment for "Is There Any Nicer Way To Aggregate Multiple Columns On Same Grouped Pandas Dataframe?"