bash - Calculating mean from values in columns specified on the first line using awk -
i have huge file (hundreds of lines, ca. 4,000 columns) structured this
locus 1 1 1 2 2 3 3 3 exon 1 2 3 1 2 1 2 3 data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07 data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
and need calculate mean values (on each data line separately) same locus number (i.e., same number in first line), i.e.
data1: mean first 3 values (three columns locus '1': 17.07, 7.11, 10.58), next 2 values (10.21, 19.34) , next 3 values (14.69, 3.32, 21.07)
i have output this
data1 mean1 mean2 mean3 data1 mean1 mean2 mean3
i thinking using bash , awk... thank advice.
if me, use r
, not awk
:
library(data.table) x = fread('data.txt') #> x # v1 v2 v3 v4 v5 v6 v7 v8 v9 #1: locus 1.00 1.00 1.00 2.00 2.00 3.00 3.00 3.00 #2: exon 1.00 2.00 3.00 1.00 2.00 1.00 2.00 3.00 #3: data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07 #4: data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82 # save first column of names later cnames = x$v1 # remove first column x[,v1:=null] # matrix transpose: makes rows columns x = t(x) # convert matrix data.table x = data.table(x,keep.rownames=f) # set column names colnames(x) = cnames #> x # locus exon data1 data2 #1: 1 1 17.07 21.42 #... # ditch useless column x[,exon:=null] #> x # locus data1 data2 #1: 1 17.07 21.42 # apply mean() function each column, grouped locus x[,lapply(.sd,mean),locus] # locus data1 data2 #1: 1 11.58667 13.58667 #2: 2 14.77500 18.56500 #3: 3 13.02667 10.93333
for convenience, here's whole thing again without comments:
library(data.table) x = fread('data.txt') cnames = x$v1 x[,v1:=null] x = t(x) x = data.table(x,keep.rownames=f) colnames(x) = cnames x[,exon:=null] x[,lapply(.sd,mean),locus]
Comments
Post a Comment