Is There Any Nicer Way To Aggregate Multiple Columns On Same Grouped Pandas Dataframe?

June 28, 2023 Post a Comment

I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data. The reason why I am doing this because, I need to

Solution 1:

This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way. The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data. Here are some key points:

The dates in the original dataset are already on a weekly frequency so groupby('week') is not needed for df_ret_df1 and dff_ret_df2, which is why these contain identical values for min, max, and mean.
This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.
The aggregation step produces a MultiIndex for the columns. You can access the aggregated variables (min, max, mean) of each high-level variable by using df.xs.
The date is set as the index of the aggregated dataframe to use as the x variable. Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.
It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.

Import dataset and aggregate it as needed

import pandas as pd              # v 1.2.3import matplotlib.pyplot as plt  # v 3.3.4# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])

# Create dataframe containing data for ground beef products, compute# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
            'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
                .reset_index('retail_item'))
df_gbeef_agg

Plot aggregated variables in single figure containing small multiples

variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()

fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color inzip(axs, variables, colors):
    for i, (item, df_item) inenumerate(df_gbeef_agg.groupby('retail_item')):
        ax = axs_row[i]
        
        # Select data and plot it
        data = df_item.xs(var, axis=1)
        ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
                        color=color, alpha=0.3, label='min/max')
        ax.plot(data.index, data['mean'], color=color, label='mean')
        ax.spines['bottom'].set_position('zero')
        
        # Format x-axis tick labels
        fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
        ax.xaxis.set_major_formatter(fmt)
        
        # Fomat subplot according to position within the figureif ax.is_first_row():
            ax.set_title(item, pad=10)
        if ax.is_last_row():
            ax.set_xlabel('Week number', size=12, labelpad=5)
        if ax.is_first_col():
            ax.set_ylabel(var, size=12, labelpad=10)
        if ax.is_last_col():
            ax.legend(frameon=False)

fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
             size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);

Solution 2:

I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:

import pandas as pd

url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])

# define which columns to group and in which way
dct = {'low_price': [max, min],
       'high_price': min,
       'year': 'mean'}

# actually group the columns
df.groupby(['region']).agg(dct)

Output:

low_pricehigh_priceyearmaxminminmeanregionALASKA16.991.331.332020.792123HAWAII12.991.331.332020.738318MIDWEST28.730.990.992020.690159NORTHEAST19.991.201.992020.709916NORTHWEST16.991.331.332020.736397SOUTHCENTRAL28.761.201.492020.700980SOUTHEAST21.991.331.482020.699655SOUTHWEST16.991.291.292020.704341

Python Freelancers

Is There Any Nicer Way To Aggregate Multiple Columns On Same Grouped Pandas Dataframe?

Solution 1:

Solution 2:

Post a Comment for "Is There Any Nicer Way To Aggregate Multiple Columns On Same Grouped Pandas Dataframe?"