python - Speeding up pandas pivot across large dataframe with mixed datatypes -


i trying pivot long-form dataframe multi-column dataframe, resolving mixed datatypes present in dataframe.

here example data:

data = {'alignmentid': {0: u'ensmust00000000001.4-1',   1: u'ensmust00000000001.4-1',   2: u'ensmust00000000001.4-1',   3: u'ensmust00000000001.4-1',   4: u'ensmust00000000001.4-1'},  'classifier': {0: u'alncoverage',   1: u'alngoodness',   2: u'alnidentity',   3: u'originalintrons',   4: u'percentunknownbases'},  'value': {0: u'89.853',   1: u'93.357',   2: u'85.595',   3: u'1,1,1,1,1,1,1,1',   4: u'0.0'}} df = pd.dataframe.from_dict(data) 

as can see, unfortunately came out of sqlite database strings, due mixed datatypes in value column.

right now, approach this:

def aggfunc(s):     """used aggregate columns. attempts convert each cell float if possible"""     try:         return float(s)     except valueerror:         return s.iloc[0]     except typeerror:         assert s.iloc[0] none         return none  result = pd.pivot_table(metrics_df, index='alignmentid', columns='classifier', values='value', fill_value=none, aggfunc=aggfunc).reset_index() 

which works, slow. across 500 lines of data, takes 43 seconds. used %lprun on it:

%lprun -f aggfunc wrapper(metrics_df)  timer unit: 1e-06 s  total time: 43.4913 s file: <ipython-input-120-7015d1dd4f54> function: aggfunc @ line 1  line #      hits         time  per hit   % time  line contents ==============================================================      1                                           def aggfunc(s):      2                                               """used aggregate columns. attempts convert each cell float if possible"""      3    250695       492657      2.0      1.1      try:      4    250695     37849440    151.0     87.0          return float(s)      5     46548       158232      3.4      0.4      except valueerror:      6     44206      4743152    107.3     10.9          return s.iloc[0]      7      2342         4661      2.0      0.0      except typeerror:      8      2342       239844    102.4      0.6          assert s.iloc[0] none      9      2342         3274      1.4      0.0          return none 

i ran lprun on pd.pivot_table itself. output truncated 1 relevant line shows aggfunc takes of time:

%lprun -f pd.pivot_table wrapper(metrics_df)  total time: 123.496 s <lines removed> 121         1    123153706 123153706.0     99.7      agged = grouped.agg(aggfunc) 

so pivoting takes 80 seconds, slow. how can make faster? have hope?


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -