Python, Pandas: How To Club Rows Rows Of Dataframe Based On Start Time And End Time
Solution 1:
To present a more instructive example, I took the following source DataFrame:
StartTimeEndTimeIDsC1C2015:02:1315:10:24BAMB30X9Y9119:46:1919:46:29BHI110X9Y9219:47:0119:57:04BHI110D2F2319:47:0119:56:58BHI110D2E2419:47:0119:56:59BHI110D2E2520:00:0220:20:00BHI110G3H3620:01:0320:21:16BHI110G3H3720:15:0020:23:20BHI110X9Y9812:01:4612:06:30AKB286A1B1912:02:4812:06:50AKB286A1B11012:02:5012:06:55AKB286A1C1
I added C1 and C2 columns (to be compared on equality within the current group), according to your comment.
Since both Start Time and End Time columns are of string type, the first step is to convert them to Timedelta:
df['Start Time'] = df['Start Time'].apply(pd.Timedelta)
df['End Time'] = df['End Time'].apply(pd.Timedelta)
Then I defined a size limit for a group to be considered as an Event. You wrote about this limit == 5, but since both your and my data sample contain only smaller groups, I set this limit at 2:
sizeLimit = 2
Of course, running my code on your real data, change this limit to whatever you need.
Then define a function to check the "time delta" between the current row and the "starting row" and generate "event numbers":
def tDlt(row):
st, et, c1, c2 = row[['Start Time', 'End Time', 'C1', 'C2']]if tDlt.start is None:
tDlt.start, tDlt.end, tDlt.ev, tDlt.c1, tDlt.c2 = st, et, 0, c1, c2
else:
if ((st - tDlt.start).total_seconds() > 120)\
or ((et - tDlt.end).total_seconds() > 120)\
or (c1 != tDlt.c1) or (c2 != tDlt.c2):
tDlt.start, tDlt.end, tDlt.c1, tDlt.c2 = st, et, c1, c2
tDlt.ev += 1return tDlt.ev
Due to usage of its internal attributes, it is a "function with memory", keeping in start and end attributes respective values from the previous row and ev attribute - the event number.
This function will be applied to each row of the DataFrame, but before it, its start attribute will be set to None, to provide proper dealing with the first row.
Note that the starting row is set:
- on the first row (when tDlt.start is None),
- on each row "too distant in time" from the "starting" row or with C1 or C2 different from th e"starting" row.
This function generates consecutive "event numbers":
- starting from 0,
- increased whenever the any condition to continue the current group has not been met,
- for all groups, even those below the size limit.
The main processing runs as follows:
Set the "initial value" in start attribute of tDlt function:
tDlt.start = None
Sort df and apply tDlt to each row:
ev = df.sort_values(['Start Time', 'End Time']).apply(tDlt, axis=1)
The result (for my data sample) is:
80901010213344425566677 dtype: int64
Of course, the row order is different, due to the sort before the application.
Check e.g. rows with indices 3, 4 and 2. Row 3 is the earliest from this group. Row 4 is within the same group (all conditions met). But row 2 has different value in C2 column, so it starts a new group.
The next step is to cancel group numbers for "too small" groups:
ev = ev.groupby(ev).transform(lambda grp: str(grp.iloc[0]) if grp.size >= sizeLimit else'')
Steps:
- take each group (by value) and check its size,
- if it has at least sizeLimit rows, return the original group number, but as a string (for each row),
- otherwise return an empty string (also for each row) - the actual cancellation.
The result is:
809010013444256667 dtype: object
Now perform "initial filling" of the new column:
df['Event'] = ev[ev != ''].groupby(ev, sort=False).ngroup() + 1
Steps:
- Take non-empty elements from ev.
- Group them (by their value).
- Return the "global" group number, starting from 1. Note that the "initial group numbers" (computed so far) are here changed into consecutive numbers.
But this is not the final content yet (print df at this stage), because:
- cells for "too short" groups contain NaN,
- elements are of float type.
To get rid of the above deficiencies, run:
df.Event.replace(np.nan, '', inplace=True)
The final result, sorted by both times, is:
StartTimeEndTimeIDsC1C2Event812:01:4612:06:30AKB286A1B11912:02:4812:06:50AKB286A1B111012:02:5012:06:55AKB286A1C1015:02:1315:10:24BAMB30X9Y9119:46:1919:46:29BHI110X9Y9319:47:0119:56:58BHI110D2E22419:47:0119:56:59BHI110D2E22219:47:0119:57:04BHI110D2F2520:00:0220:20:00BHI110G3H33620:01:0320:21:16BHI110G3H33720:15:0020:23:20BHI110X9Y9
As you can see:
- First 2 rows are close enough in time, so the are event 1.
- The third row has Start Time too distant from the previous row, so it has not been included in the above group.
- It is also too distant from the next row, so they can not be grouped in an event.
- Rows 3 and 4 form the next group.
- Row 2 is excluded, due to different value in C2.
- And so on.
Most likely, your both columns to be checked for equality can have other names, so put their actual names in tDlt instead of C1 and C2.
Edit following comment about grouping by IDs
Change the function to:
deftDlt(row):
id, st, et, c1, c2 = row[['IDs', 'Start Time', 'End Time', 'C1', 'C2']]
if tDlt.start isNone:
tDlt.id, tDlt.start, tDlt.end, tDlt.ev, tDlt.c1, tDlt.c2 = id, st, et, 0, c1, c2
else:
ifid != tDlt.id\
or ((st - tDlt.start).total_seconds() > 120)\
or ((et - tDlt.end).total_seconds() > 120)\
or (c1 != tDlt.c1) or (c2 != tDlt.c2):
tDlt.id, tDlt.start, tDlt.end, tDlt.c1, tDlt.c2 = id, st, et, c1, c2
tDlt.ev += 1return tDlt.ev
After tDlt.start = None
change the next instruction to:
ev = df.sort_values(['IDs', 'Start Time', 'End Time']).apply(tDlt, axis=1)
I suppose that you added an internal attribute for IDs, but forgot to sort the source DataFrame on IDs.
To test this code, I added 2 rows:
1120:00:0220:20:00XXX110G3H31220:01:0320:21:16XXX110G3H3
Note that they are just like rows with indices 5 and 6, but there is different IDs.
The result, for such extended data, sorted on IDs, Start Time and End Time is:
StartTimeEndTimeIDsC1C2Event812:01:4612:06:30AKB286A1B11912:02:4812:06:50AKB286A1B111012:02:5012:06:55AKB286A1C1015:02:1315:10:24BAMB30X9Y9119:46:1919:46:29BHI110X9Y9319:47:0119:56:58BHI110D2E22419:47:0119:56:59BHI110D2E22219:47:0119:57:04BHI110D2F2520:00:0220:20:00BHI110G3H33620:01:0320:21:16BHI110G3H33720:15:0020:23:20BHI110X9Y91120:00:0220:20:00XXX110G3H341220:01:0320:21:16XXX110G3H34
so rows from XXX110 are members of a separate event.
Post a Comment for "Python, Pandas: How To Club Rows Rows Of Dataframe Based On Start Time And End Time"