python - enumerate groups in a dataframe -


i have following table

date        ui  mw  maxw    tc  hl    msurp 01/03/2004    10   10     ec  0.25   0.1 01/04/2004    10   10     ec  0.25   -0.1 01/03/2004  b   20   20     bc  0.5    0.3 01/03/2004  b   20   20     bc  0.25    0.3 

what looking add column table enumerates unique combinations of ui, mw, maxw, tc , hl , enumerates

so example in above table

unique combinations of ui, mw, maxw, tc , hl are

 a,10, 10, ec, 0.25  b,20, 20, bc, 0.5  b,20, 20, bc, 0.5 

there total 3 output should like

date        ui  mw  maxw    tc  hl    msurp  counter 01/03/2004    10   10     ec  0.25   0.1    1 01/04/2004    10   10     ec  0.25   -0.1   1 01/03/2004  b   20   20     bc  0.5    0.3    2 01/03/2004  b   20   20     bc  0.25    0.3   3 

option 1
pd.series.factorize

df.assign(    counter=df[['ui', 'mw', 'maxw', 'tc', 'hl']].apply(tuple, 1).factorize()[0] + 1)           date ui  mw  maxw  tc    hl  msurp  counter 0  01/03/2004   10    10  ec  0.25    0.1        1 1  01/04/2004   10    10  ec  0.25   -0.1        1 2  01/03/2004  b  20    20  bc  0.50    0.3        2 3  01/03/2004  b  20    20  bc  0.25    0.3        3 

option 1.5
more obnoxious version of option 1 should faster

df.assign(     counter=pd.factorize(list(zip(         *[df[c].values.tolist() c in ['ui', 'mw', 'maxw', 'tc', 'hl']]     )))[0] + 1 )           date ui  mw  maxw  tc    hl  msurp  counter 0  01/03/2004   10    10  ec  0.25    0.1        1 1  01/04/2004   10    10  ec  0.25   -0.1        1 2  01/03/2004  b  20    20  bc  0.50    0.3        2 3  01/03/2004  b  20    20  bc  0.25    0.3        3 

option 2
@ayhan's answer (will delete if posts it)

df.assign(     counter=df.groupby(['ui', 'mw', 'maxw', 'tc', 'hl']).ngroup() + 1)           date ui  mw  maxw  tc    hl  msurp  counter 0  01/03/2004   10    10  ec  0.25    0.1        1 1  01/04/2004   10    10  ec  0.25   -0.1        1 2  01/03/2004  b  20    20  bc  0.50    0.3        3 3  01/03/2004  b  20    20  bc  0.25    0.3        2 

timing
code below

(lambda r: r.div(r.min(1), 0).assign(best=lambda x: x.idxmin(1)))(results)               pir1      pir2     ayhan   best 100     17.260639  1.000000  3.438354   pir2 300     30.550010  1.000000  2.598456   pir2 1000    43.201163  1.000000  1.236190   pir2 3000    61.593932  1.000000  1.025420   pir2 10000  127.003138  2.177171  1.000000  ayhan 

enter image description here

pir1 = lambda d: d.assign(counter=d[['ui', 'mw', 'maxw', 'tc', 'hl']].apply(tuple, 1).factorize()[0] + 1) pir2 = lambda d: d.assign(counter=pd.factorize(list(zip(*[d[c].values.tolist() c in ['ui', 'mw', 'maxw', 'tc', 'hl']])))[0] + 1) ayhan = lambda d: d.assign(counter=d.groupby(['ui', 'mw', 'maxw', 'tc', 'hl']).ngroup() + 1)  results = pd.dataframe(     index=[100, 300, 1000, 3000, 10000],     columns='pir1 pir2 ayhan'.split(),     dtype=float )  in results.index:     d = pd.concat([df] * i, ignore_index=true)     j in results.columns:         stmt = '{}(d)'.format(j)         setp = 'from __main__ import d, {}'.format(j)         results.set_value(i, j, timeit(stmt, setp, number=10))  results.plot(loglog=true) 

Comments