python - Missing samples of a dataframe in pandas -
my df:
in [163]: df.head() out[163]: x-axis y-axis z-axis time 2017-07-27 06:23:08 -0.107666 -0.068848 0.963623 2017-07-27 06:23:08 -0.105225 -0.070068 0.963867 .....
i set index datetime. since sampling rate (10 hz) not constant in dataframe , second have 8 or 9 samples.
- i specify milliseconds on datatime (06:23:08**.100**, 06:23:08**.200**, etc.)
- i interpolation of missing samples.
some ideas how in pandas?
first lets create sample data maybe resembles data.
import pandas pd datetime import timedelta datetime import datetime base = datetime.now() date_list = [base - timedelta(days=x) x in range(0, 2)] values = [v v in range(2)] df = pd.dataframe.from_dict({'date': date_list, 'values': values}) df = df.set_index('date') df values date 2017-08-18 20:42:08.563878 0 2017-08-17 20:42:08.563878 1
now create data frame every 100 milliseconds of datapoint.
min_val = df.index.min() max_val = df.index.max() all_val = [] while min_val <= max_val: all_val.append(min_val) min_val += timedelta(milliseconds=100) # len(all_val) 864001 df_new = pd.dataframe.from_dict({'date': all_val}) df_new = df_new.set_index('date')
lets join both data frame missing rows have index no values.
final_df = df_new.join(df) final_df values date 2017-08-17 20:42:08.563878 1.0 2017-08-17 20:42:08.663878 nan 2017-08-17 20:42:08.763878 nan 2017-08-17 20:42:08.863878 nan 2017-08-17 20:42:08.963878 nan 2017-08-17 20:42:09.063878 nan 2017-08-17 20:42:09.163878 nan
now interpolate data:
df_final.interpolate() values date 2017-08-17 20:42:08.563878 1.000000 2017-08-17 20:42:08.663878 0.999999 2017-08-17 20:42:08.763878 0.999998 2017-08-17 20:42:08.863878 0.999997 2017-08-17 20:42:08.963878 0.999995 2017-08-17 20:42:09.063878 0.999994 2017-08-17 20:42:09.163878 0.999993 2017-08-17 20:42:09.263878 0.999992
some interpolation strategies: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.dataframe.interpolate.html
update: per discussion in comments:
say our initial data not have millisecond information.
df_new_date_without_miliseconds = df_new['date'] df_new_date_without_miliseconds[0] # timestamp('2017-08-17 21:45:49') max_value_date = df_new_date_without_miliseconds[0] max_value_miliseconds = df_new_date_without_miliseconds[0] updated_dates = [] val in df_new_date_without_miliseconds: if val == max_value_date: val = max_value_miliseconds + timedelta(milliseconds=100) max_value_miliseconds = val elif val > max_value_date: max_value_date = val + timedelta(milliseconds=0) max_value_miliseconds = val updated_dates.append(val) output: [timestamp('2017-08-17 21:45:49.100000'), timestamp('2017-08-17 21:45:49.200000'), timestamp('2017-08-17 21:45:49.300000'), timestamp('2017-08-17 21:45:50'), timestamp('2017-08-17 21:45:50.100000'),
assign new values dataframe
df_new['date'] = updated_dates
Comments
Post a Comment