python - using groupby/aggregate to return multiple columns -
i have example dataset want groupby 1 column , produce 4 new columns based on of values of existing columns.
here sample data:
data = {'alignmentid': {0: u'ensmust00000000001.4-1', 1: u'ensmust00000000001.4-1', 2: u'ensmust00000000003.13-0', 3: u'ensmust00000000003.13-0', 4: u'ensmust00000000003.13-0'}, 'name': {0: u'noncodingdeletion', 1: u'noncodinginsertion', 2: u'codingdeletion', 3: u'codinginsertion', 4: u'noncodingdeletion'}, 'value_cds': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan}, 'value_mrna': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}} df = pd.dataframe.from_dict(data)
which looks this:
alignmentid name value_mrna value_cds 0 ensmust00000000001.4-1 noncodingdeletion 21.0 nan 1 ensmust00000000001.4-1 noncodinginsertion 26.0 nan 2 ensmust00000000003.13-0 codingdeletion 1.0 1.0 3 ensmust00000000003.13-0 codinginsertion 1.0 1.0 4 ensmust00000000003.13-0 noncodingdeletion 2.0 nan
i want return booleans based on presence/absence of values in name
column depending on whether value_cds
contains null values. produced function so:
def aggfunc(s): if s.value_cds.any(): c = set(s.name) else: c = set(s.name) return ('codingdeletion' in c or 'codinginsertion' in c, 'codinginsertion' in c, 'codingdeletion' in c, 'codingmult3deletion' in c or 'codingmult3insertion' in c)
and did this:
merged = df.groupby('alignmentid').aggregate(aggfunc)
which gives me error valueerror: shape of passed values (318, 4), indices imply (318, 3)
.
how can return multiple new columns groupby-aggregate?
the output looking is:
ensmust00000000001.4-1 (false, false, false, false) ensmust00000000003.13-0 (true, true, true, false)
which ideally put 5-column dataframe.
if use .apply
, output incorrect:
ensmust00000000001.4-1 (false, false, false, false) ensmust00000000003.13-0 (false, false, false, false)
but if grab groups 1 @ time, correct:
in [380]: aln_id, d in df.groupby('alignmentid'): .....: print aggfunc(d) .....: (false, false, false, false) (true, true, true, false)
you need change name
['name']
, because .name
return name of group (value of column grouping by):
def aggfunc(s): if s.value_cds.any(): c = set(s['name']) else: c = set(s['name']) return ('codingdeletion' in c or 'codinginsertion' in c, 'codinginsertion' in c, 'codingdeletion' in c, 'codingmult3deletion' in c or 'codingmult3insertion' in c) merged = df.groupby('alignmentid').apply(aggfunc) print (merged) alignmentid ensmust00000000001.4-1 (false, false, false, false) ensmust00000000003.13-0 (true, true, true, false) dtype: object
def aggfunc(s): print ('name of group is: {}'.format((s.name))) print ('column name is:\n {}'.format(s['name'])) merged = df.groupby('alignmentid').apply(aggfunc) print (merged) name of group is: ensmust00000000001.4-1 column name is: 0 noncodingdeletion 1 noncodinginsertion name: name, dtype: object name of group is: ensmust00000000001.4-1 column name is: 0 noncodingdeletion 1 noncodinginsertion name: name, dtype: object name of group is: ensmust00000000003.13-0 column name is: 2 codingdeletion 3 codinginsertion 4 noncodingdeletion name: name, dtype: object
improved code:
def aggfunc(s): #if , else return same c, omitted c = set(s['name']) #added series return columns instead tuples cols = ['col1','col2','col3','col4'] return pd.series(('codingdeletion' in c or 'codinginsertion' in c, 'codinginsertion' in c, 'codingdeletion' in c, 'codingmult3deletion' in c or 'codingmult3insertion' in c), index=cols) merged = df.groupby('alignmentid').apply(aggfunc) print (merged) col1 col2 col3 col4 alignmentid ensmust00000000001.4-1 false false false false ensmust00000000003.13-0 true true true false
Comments
Post a Comment