python - How to spread a column in a Pandas data frame -


i have following pandas data frame:

import pandas pd import numpy np df = pd.dataframe({                'fc': [100,100,112,1.3,14,125],                'sample_id': ['s1','s1','s1','s2','s2','s2'],                'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],                })  df = df[['gene_symbol', 'sample_id', 'fc']] df 

which produces this:

out[11]:   gene_symbol sample_id     fc 0                  s1  100.0 1           b        s1  100.0 2           c        s1  112.0 3                  s2    1.3 4           b        s2   14.0 5           c        s2  125.0 

how can spread sample_id in end this:

gene_symbol    s1   s2             100   1.3 b             100   14.0 c             112   125.0 

use pivot or unstack:

#df = df[['gene_symbol', 'sample_id', 'fc']] df = df.pivot(index='gene_symbol',columns='sample_id',values='fc') print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  125.0 

df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0) print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  125.0 

but if duplicates, need pivot_table or aggregate groupby or , mean can changed sum, median, ...:

df = pd.dataframe({                'fc': [100,100,112,1.3,14,125, 100],                'sample_id': ['s1','s1','s1','s2','s2','s2', 's2'],                'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],                }) print (df)       fc gene_symbol sample_id 0  100.0                  s1 1  100.0           b        s1 2  112.0           c        s1 3    1.3                  s2 4   14.0           b        s2 5  125.0           c        s2 <- same c, s2, different fc 6  100.0           c        s2 <- same c, s2, different fc 
df = df.pivot(index='gene_symbol',columns='sample_id',values='fc') 

valueerror: index contains duplicate entries, cannot reshape

df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean') print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  112.5 

df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0) print (df) sample_id       s1     s2 gene_symbol                          100.0    1.3 b            100.0   14.0 c            112.0  112.5 

edit:

for cleaning set columns name none , reset_index:

df.columns.name = none df = df.reset_index() print (df)   gene_symbol     s1     s2 0            100.0    1.3 1           b  100.0   14.0 2           c  112.0  112.5 

Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -