Repeat Sections Of Dataframe Based On A Column Value

February 28, 2024 Post a Comment

I'm collecting data over the course of many days and rather than filling it in for every day, I can elect to say that the data from one day should really be a repeat of another day

Solution 1:

I slightly extended your test data:

data = [['1', 51, np.nan], ['1', 52, np.nan],     ['1', 53, np.nan],
        ['2', 61, np.nan], ['2', 62, np.nan],     ['2', 63, np.nan],
        ['3', np.nan, 1],  ['3', np.nan, np.nan], ['3', np.nan, np.nan],
        ['4', np.nan, 2],  ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])

Details:

There are 4 days with observations.
Each observation has different value (Data).
To avoid "single day copy", values for day '3' are to be copied from day '1' and for day '4' from day '2'.

I assume that non-null value of repeat_tag can be placed in only one observation for the "target" day.

I also added obsNo column to identify observations within particular day:

df['obsNo'] = df.groupby('Day').cumcount().add(1);

(it will be necessary later).

The first step of actual processing is to generate replDays table, where Day column is the target day and repeat_tag is the source day:

replDays = df.query('repeat_tag.notnull()')[['Day', 'repeat_tag']]
replDays.repeat_tag = replDays.repeat_tag.astype(int).apply(str)

A bit of type manipulation was with repeat_tag column. As this column contains NaN values and non-null values are int, this column is coerced to float64. Hence, to get string type (comparable with Day) it must be converted:

First to int, to drop the decimal part.
Then to str.

The result is:

Day repeat_tag
631942

(fill data for day 3 with data from day 1 and data for day 4 with data from day 2).

The next step is to generate replData table:

replData = pd.merge(replDays, df, left_on='repeat_tag', right_on='Day',
    suffixes=('_src', ''))[['Day_src', 'Day', 'Data', 'obsNo']]\
    .set_index(['Day_src', 'obsNo']).drop(columns='Day')

The result is:

               Data
Day_src obsNo      
3       1      51.0
        2      52.0
        3      53.0
4       1      61.0
        2      62.0
        3      63.0

As you can see:

There is only one column of replacement data - Data (from day 1 and 2).
MutliIndex contains both the day and observation number (both will be needed for proper update).

And the final part includes the following steps:

Copy df to res (result), setting index to Day and obsNo (required for update).
Update this table with data from replData.
Move Day and obsNo from index back to "regular" columns.

The code is:

res = df.copy().set_index(['Day', 'obsNo'])
res.update(replData)
res.reset_index(inplace=True)

If you want, you can alse drop obsNo column.

And a remark concerning the solution by Peter: If source data contains for any day different values, his code fails with InvalidIndexError, probably due to lack of identification of individual observations within particular day. This confirms that my idea to add obsNo column is valid.

Solution 2:

Setup

# Start with Valdi_Bo's expanded example data
data = [['1', 51, np.nan], ['1', 52, np.nan],     ['1', 53, np.nan],
        ['2', 61, np.nan], ['2', 62, np.nan],     ['2', 63, np.nan],
        ['3', np.nan, 1],  ['3', np.nan, np.nan], ['3', np.nan, np.nan],
        ['4', np.nan, 2],  ['4', np.nan, np.nan], ['4', np.nan, np.nan]]
df = pd.DataFrame(data, columns = ['Day', 'Data', 'repeat_tag'])

# Convert Day to integer data typedf['Day'] = df['Day'].astype(int)

# Spread repeat_tag values into all rows of tagged daydf['repeat_tag'] = df.groupby('Day')['repeat_tag'].ffill()

Solution

# Within each day, assign a number to each rowdf['obs'] = df.groupby('Day').cumcount()

# Self-join
filler = (pd.merge(df, df, 
                   left_on=['repeat_tag', 'obs'], 
                   right_on=['Day', 'obs'])
            .set_index(['Day_x', 'obs'])['Data_y'])

# Fill missing datadf = df.set_index(['Day', 'obs'])
df.loc[df['Data'].isnull(), 'Data'] = filler
df = df.reset_index()

Result

df
    Day  obs  Data  repeat_tag
01051.0NaN11152.0NaN21253.0NaN32061.0NaN42162.0NaN52263.0NaN63051.01.073152.01.083253.01.094061.02.0104162.02.0114263.02.0

Python Freelancers