python - Speeding up pandas pivot across large dataframe with mixed datatypes -
i trying pivot long-form dataframe multi-column dataframe, resolving mixed datatypes present in dataframe.
here example data:
data = {'alignmentid': {0: u'ensmust00000000001.4-1', 1: u'ensmust00000000001.4-1', 2: u'ensmust00000000001.4-1', 3: u'ensmust00000000001.4-1', 4: u'ensmust00000000001.4-1'}, 'classifier': {0: u'alncoverage', 1: u'alngoodness', 2: u'alnidentity', 3: u'originalintrons', 4: u'percentunknownbases'}, 'value': {0: u'89.853', 1: u'93.357', 2: u'85.595', 3: u'1,1,1,1,1,1,1,1', 4: u'0.0'}} df = pd.dataframe.from_dict(data)
as can see, unfortunately came out of sqlite database strings, due mixed datatypes in value
column.
right now, approach this:
def aggfunc(s): """used aggregate columns. attempts convert each cell float if possible""" try: return float(s) except valueerror: return s.iloc[0] except typeerror: assert s.iloc[0] none return none result = pd.pivot_table(metrics_df, index='alignmentid', columns='classifier', values='value', fill_value=none, aggfunc=aggfunc).reset_index()
which works, slow. across 500 lines of data, takes 43 seconds. used %lprun
on it:
%lprun -f aggfunc wrapper(metrics_df) timer unit: 1e-06 s total time: 43.4913 s file: <ipython-input-120-7015d1dd4f54> function: aggfunc @ line 1 line # hits time per hit % time line contents ============================================================== 1 def aggfunc(s): 2 """used aggregate columns. attempts convert each cell float if possible""" 3 250695 492657 2.0 1.1 try: 4 250695 37849440 151.0 87.0 return float(s) 5 46548 158232 3.4 0.4 except valueerror: 6 44206 4743152 107.3 10.9 return s.iloc[0] 7 2342 4661 2.0 0.0 except typeerror: 8 2342 239844 102.4 0.6 assert s.iloc[0] none 9 2342 3274 1.4 0.0 return none
i ran lprun
on pd.pivot_table
itself. output truncated 1 relevant line shows aggfunc takes of time:
%lprun -f pd.pivot_table wrapper(metrics_df) total time: 123.496 s <lines removed> 121 1 123153706 123153706.0 99.7 agged = grouped.agg(aggfunc)
so pivoting takes 80 seconds, slow. how can make faster? have hope?
Comments
Post a Comment