Using group by function in python/pandas dataframe -
i have data frame in python
. data looks below.
id time test count 1 01:25.5 1105 1 2 02:25.9 1105 0 3 03:25.5 1105 1 4 04:25.5 1105 1 5 05:25.9 1105 1 6 06:25.5 1105 0 7 07:25.9 1105 1 8 08:25.6 1105 1 9 09:25.9 1106 0 10 10:25.6 1105 1 11 11:26.0 1105 1 12 12:25.6 1105 1 13 14:22.0 1105 0 14 14:25.6 1106 1 15 15:26.0 1105 1 16 16:25.6 1105 0 17 17:22.0 1105 1 18 18:25.7 1105 1 19 19:26.0 1105 1 20 20:25.7 1105 0 21 21:25.1 1105 1 22 22:25.7 1106 1 23 22:33.7 1107 0 24 24:25.7 1105 0 25 25:26.1 1105 0 26 27:25.7 1105 1 27 22:35.7 1106 0
now want group records on conditions.
1) if in 3 minute window there 4 or more 4 records particular test
, group records , count number of records , update count value. time grouped record should time of last record has been grouped.
2) if there less 4 records particulat test
in given 3 minute window leave records are.
expected output
:
id time test count 1 04:25.5 1105 04 2 08:25.6 1105 04 3 10:25.6 1105 1 4 11:26.0 1105 1 5 12:25.6 1105 1 6 17:22.0 1105 04 7 21:25.1 1105 04 8 09:25.9 1106 0 9 14:25.6 1106 1 10 22:25.7 1106 1 11 22:33.7 1107 0 12 24:25.7 1105 0 13 25:26.1 1105 0 14 27:25.7 1105 1 15 22:35.7 1106 0
explanation output
:
1) 1st record group of 1-4 records of input , count of records , time last time of group.
2) 2nd record group of 4-8 records of input , count of records , time last time of group
3) 3,4, 5 records because particular test 1105 in given 3 minute window 10:25 13:25 there 3 records not minimum 4 records them grouped.
4) 6th record group of 13-17 records of input except record 14 , count of records , time last time of group.
5) 7th record group of 18-21 records of input , count of records , time last time of group
6) records 9-15 records because particular test 1105/1106/1107 in given 3 minute window there not minimum 4 records.
how can achieve result?
okay can't think of "pandas" way this. because 3 minute window relative , not absolute (i.e. not 0-3 minutes, 4-6 minutes, etc.) don't see how can use resample()
or rolling()
type functionality available.
i'm not sure elegant way accomplish wanting. there main groupby().apply()
function, function collapse >4 tests 1 row, , function create 3 minute filter. rest of code cleaning columns , indexes output wish. assumes starting dataframe called df
, give output dataframe dfg
def x_min_filter(df, start_idx, mins=3): ''' returns filter starts @ df[start_index].time , ends @ df[start_index].time + mins parameters ---------- df : pandas.dataframe dataframe filter start_index : int df index number start @ mins : int, default=3 number of minutes of filtering window (start + mins = total window time) ''' plus_x_min = df[start_idx:].iloc[0].time + pd.timedelta('{} minutes'.format(mins)) time_fil = ((df.time >= df[start_idx:].iloc[0].time) & (df.time <= plus_x_min)) return time_fil def collapse_gte_x(df, number_of_tests = 4): ''' collapse dataframe 1 row if number of entries exceeds parameter number_of_tests otherwise return dataframe parameters ---------- df : pandas.dataframe dataframe collapse or return number_of_tests : int, default=4 minimum number of tests required collapse dataframe 1 entry ''' test_count = len(df.index) if test_count >= number_of_tests: df.loc[df.iloc[-1].name, 'count'] = test_count return df.tail(1) else: return df def grp_x_mins(grp): ''' return dataframe custom grouping intended used groupby.apply ''' dfs = [] # list hold list of collapsed (or not) dataframes combined next_index = 0 # clean dataframe grp = grp.reset_index().drop(['index', 'test', 'id'], axis=1) # loop through tests in x minute chunks while next_index < len(grp.index): # use plus x minutes filter pxm_df = grp[x_min_filter(grp, next_index)] # add copy of collapsed dataframe dfs.append(collapse_gte_x(pxm_df).copy()) # next entry in sent group next_index = pxm_df.iloc[-1].name + 1 return pd.concat(dfs) dfg = df.groupby('test').apply(grp_x_mins) # clean dataframe dfg = dfg.reset_index(level=1, drop=true).reset_index() # renumber id desired dfg['id'] = dfg.index + 1 # drop date datetime, datetime required indexing dfg['time'] = dfg.time.dt.time dfg
i output
test time count id 0 1105 00:04:25.500000 4 1 1 1105 00:08:25.600000 4 2 2 1105 00:10:25.600000 1 3 3 1105 00:11:26 1 4 4 1105 00:12:25.600000 1 5 5 1105 00:17:22 4 6 6 1105 00:21:25.100000 4 7 7 1105 00:24:25.700000 0 8 8 1105 00:25:26.100000 0 9 9 1105 00:27:25.700000 1 10 10 1106 00:09:25.900000 0 11 11 1106 00:14:25.600000 1 12 12 1106 00:22:25.700000 1 13 13 1106 00:22:35.700000 0 14 14 1107 00:22:33.700000 0 15
Comments
Post a Comment