python - How to spread a column in a Pandas data frame -
i have following pandas data frame:
import pandas pd import numpy np df = pd.dataframe({                'fc': [100,100,112,1.3,14,125],                'sample_id': ['s1','s1','s1','s2','s2','s2'],                'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],                })  df = df[['gene_symbol', 'sample_id', 'fc']] df which produces this:
out[11]:   gene_symbol sample_id     fc 0                  s1  100.0 1           b        s1  100.0 2           c        s1  112.0 3                  s2    1.3 4           b        s2   14.0 5           c        s2  125.0 how can spread sample_id in end this:
gene_symbol    s1   s2             100   1.3 b             100   14.0 c             112   125.0 
#df = df[['gene_symbol', 'sample_id', 'fc']] df = df.pivot(index='gene_symbol',columns='sample_id',values='fc') print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  125.0 df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0) print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  125.0 but if duplicates, need pivot_table or aggregate groupby or , mean can changed sum, median, ...:
df = pd.dataframe({                'fc': [100,100,112,1.3,14,125, 100],                'sample_id': ['s1','s1','s1','s2','s2','s2', 's2'],                'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],                }) print (df)       fc gene_symbol sample_id 0  100.0                  s1 1  100.0           b        s1 2  112.0           c        s1 3    1.3                  s2 4   14.0           b        s2 5  125.0           c        s2 <- same c, s2, different fc 6  100.0           c        s2 <- same c, s2, different fc df = df.pivot(index='gene_symbol',columns='sample_id',values='fc') valueerror: index contains duplicate entries, cannot reshape
df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean') print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  112.5 df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0) print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  112.5 edit:
for cleaning set columns name none , reset_index:
df.columns.name = none df = df.reset_index() print (df)   gene_symbol     s1     s2 0            100.0    1.3 1           b  100.0   14.0 2           c  112.0  112.5 
Comments
Post a Comment