perl - Add a new column that corresponds to the codons that are in the protein domain positions -


i have 2 files thousands of proteins: (1) file: protein id + sequence of amino acids; (2) file: protein id + sequence of nucleotides. , third file file position of domains in (these proteins) related amino acid's sequence , nucleotide's sequence files. related these 3 files code:

acids.txt file contains:

enst00000274849|q9ulw3 meaeesekaateqeplegteqtldaeeeqeeseeaacgskkrvvpgivylghipprfrpl hvrnllsaygevgrvffqaedrfvrrkkkaaaaaggkkrsytkdytegwvefrdkriakr vaaslhntpmgarrrspfrydlwnlkylhrftwshlsehlaferqvrrqrlraevaqakr etdfylqsvergqrflaadgdparpdgswtfaqrpteqelrarkaarpggrerarlataq dkarsnkgllarifgapppsesmegpslvrds*

nucleotides.txt file contains:

enst00000274849|q9ulw3 atggaggcagaggaatcggagaaggccgcaacggagcaagagccgctggaagggacagaa cagacactagatgcggaggaggagcaggaggaatccgaagaagcggcctgtggcagcaag aaacgggtagtgccaggtattgtgtacctgggccatatcccgccgcgcttccggcccctg cacgtccgcaaccttctcagcgcctatggcgaggtcggacgcgtcttctttcaggctgag gaccggttcgtgagacgcaagaagaaggcagcagcagctgccggaggaaaaaagcggtcc tacaccaaggactacaccgagggatgggtggagttccgtgacaagcgcatagccaagcgc gtggcggccagtctacacaacacgcctatgggtgcccgcaggcgcagccccttccgttat gatctttggaacctcaagtacttgcaccgtttcacctggtcccacctcagcgagcacctc gcctttgagcgccaggtgcgcaggcagcgcttgagagcggaggttgctcaagccaagcgt gagaccgacttctatcttcaaagtgtggaacggggacaacgctttcttgcggccgatggg gaccctgctcgcccagatggctcctggacatttgcccagcgtcctactgagcaggaactg agggcccgtaaagcagcacggccagggggacgtgaacgggctcgcctggcaactgcccag gacaaggcccgctccaacaaagggctcctggccaggatctttggagccccgccaccctca gagagcatggagggaccttcccttgtcagggactcctga

domain.txt file contains:

 q9ulw3;    46 142 

note: numbers mean position domain in sequence

script:

use strict; use bio::seqio; #################################################### #module 1: read protein file, , save in hash# #################################################### %hash1; $sequence = "acid.txt"; $multifasta = bio::seqio ->new (-file => "<$sequence",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { $na= $seq->display_id;   #saves id in $na $ss = $seq->seq;  $hash1{$na} = $ss;   } ############################################################# #module 2: read nucleotide file, , save in hash# ############################################################# %hash2; $genes = "nucleotides.txt"; $multifasta = bio::seqio ->new (-file => "<$genes",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { $na= $seq->display_id;   #saves id in $na $des=$seq->description; $ss = $seq->seq;  $hash2{$na} = $ss;   } ##################### #module 3: $name;# ##################### $name;               # read standard input chomp $name; ############################################################################## #module 4: domain annotation + related amino acids , nucleotides in columns# ############################################################################## foreach $name (keys %hash1) { $ac = (split(/\s*\|/, $name))[1]; #print "$ac\n" ;  #################################################### #module 4.1: domain annotation: position of domains# #################################################### open(file, "<" ,"domain.txt"); @array = (<file>); @lines = grep (/$ac/, @array); print @lines; close (file); ############################################################ #module 4.2: related amino acids , nucleotides in columns# ############################################################ @array1 = split(//, $hash1{$name}, $hash2{$name});  #cut sequence @array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); #cut  nucleotide sequence $number = "$#array1+1"; foreach (my $count = 0; $count <= $number; $count++) {         print "$count\t@array1[$count]\t@array2[$count]\n";    }      } 

and here file got after running script:

 q9ulw3; 46    142      0 m   atg   1 e   gag   2   gca   3 e   gag   4 e   gaa   5 s   tcg   6 e   gag   7 k   aag   8   gcc   9   gca  10 t   acg  11 e   gag  12 q   caa  13 e   gag  14 p   ccg  15 l   ctg  16 e   gaa  17 g   ggg  18 t   aca  19 e   gaa  20 q   cag  21 t   aca  22 l   cta  23 d   gat  24   gcg  25 e   gag  26 e   gag  27 e   gag  28 q   cag  29 e   gag  30 e   gaa  31 s   tcc  32 e   gaa  33 e   gaa  34   gcg  35   gcc  36 c   tgt  37 g   ggc  38 s   agc  39 k   aag  40 k   aaa  41 r   cgg  42 v   gta  43 v   gtg  44 p   cca  45 g   ggt  46   att  47 v   gtg  48 y   tac  49 l   ctg  50 g   ggc  51 h   cat  52   atc  53 p   ccg  54 p   ccg  55 r   cgc  56 f   ttc  57 r   cgg  58 p   ccc  59 l   ctg  60 h   cac  61 v   gtc  62 r   cgc  63 n   aac  64 l   ctt  65 l   ctc  66 s   agc  67   gcc  68 y   tat  69 g   ggc  70 e   gag  71 v   gtc  72 g   gga  73 r   cgc  74 v   gtc  75 f   ttc  76 f   ttt  77 q   cag  78   gct  79 e   gag  80 d   gac  81 r   cgg  82 f   ttc  83 v   gtg  84 r   aga  85 r   cgc  86 k   aag  87 k   aag  88 k   aag  89   gca  90   gca  91   gca  92   gct  93   gcc  94 g   gga  95 g   gga  96 k   aaa  97 k   aag   98 r   cgg  99 s   tcc 100 y   tac 101 t   acc 102 k   aag  103 d   gac 104 y   tac 105 t   acc 106 e   gag 107 g   gga 108 w   tgg 109 v   gtg 110 e   gag 111 f   ttc 112 r   cgt 113 d   gac 114 k   aag 115 r   cgc 116   ata 117   gcc 118 k   aag 119 r   cgc 120 v   gtg 121   gcg 122   gcc 123 s   agt 124 l   cta 125 h   cac 126 n   aac 127 t   acg 128 p   cct 129 m   atg 130 g   ggt 131   gcc 132 r   cgc 133 r   agg 134 r   cgc 135 s   agc 136 p   ccc 137 f   ttc 138 r   cgt 139 y   tat 140 d   gat 141 l   ctt 142 w   tgg 143 n   aac 144 l   ctc 145 k   aag 146 y   tac 147 l   ttg 148 h   cac 149 r   cgt 150 f   ttc 151 t   acc 152 w   tgg 153 s   tcc 154 h   cac 155 *   tga 

now should add new fourth column contain 'yes' or 'not' - depends on codons in domain - yes, not in domain - not. so, here domains in positions 46 till 142. output file:

 q9ulw3;    46    142            0    m    atg    not   1    e    gag    not   2       gca    not   3    e    gag    not   4    e    gaa    not   5    s    tcg    not   6    e    gag    not   7    k    aag    not   8       gcc    not   9       gca    not  10    t    acg    not  11    e    gag    not  12    q    caa    not  13    e    gag    not  14    p    ccg    not  15    l    ctg    not  16    e    gaa    not  17    g    ggg    not  18    t    aca    not  19    e    gaa    not  20    q    cag    not  21    t    aca    not  22    l    cta    not  23    d    gat    not  24       gcg    not  25    e    gag    not  26    e    gag    not  27    e    gag    not  28    q    cag    not  29    e    gag    not  30    e    gaa    not  31    s    tcc    not  32    e    gaa    not  33    e    gaa    not  34       gcg    not  35       gcc    not  36    c    tgt    not  37    g    ggc    not  38    s    agc    not  39    k    aag    not  40    k    aaa    not  41    r    cgg    not  42    v    gta    not  43    v    gtg    not  44    p    cca    not  45    g    ggt    not  46       att    yes  47    v    gtg    yes  48    y    tac    yes  49    l    ctg    yes  50    g    ggc    yes  51    h    cat    yes  52       atc    yes  53    p    ccg    yes  54    p    ccg    yes  55    r    cgc    yes  56    f    ttc    yes  57    r    cgg    yes  58    p    ccc    yes  59    l    ctg    yes  60    h    cac    yes  61    v    gtc    yes  62    r    cgc    yes  63    n    aac    yes  64    l    ctt    yes  65    l    ctc    yes  66    s    agc    yes  67       gcc    yes  68    y    tat    yes  69    g    ggc    yes  70    e    gag    yes  71    v    gtc    yes  72    g    gga    yes  73    r    cgc    yes  74    v    gtc    yes  75    f    ttc    yes  76    f    ttt    yes  77    q    cag    yes  78       gct    yes  79    e    gag    yes  80    d    gac    yes  81    r    cgg    yes  82    f    ttc    yes  83    v    gtg    yes  84    r    aga    yes  85    r    cgc    yes  86    k    aag    yes  87    k    aag    yes  88    k    aag    yes  89       gca    yes  90       gca    yes  91       gca    yes  92       gct    yes  93       gcc    yes  94    g    gga    yes  95    g    gga    yes  96    k    aaa    yes  97    k    aag    yes  98    r    cgg    yes  99    s    tcc    yes 100    y    tac    yes 101    t    acc    yes 102    k    aag    yes 103    d    gac    yes 104    y    tac    yes 105    t    acc    yes 106    e    gag    yes 107    g    gga    yes 108    w    tgg    yes 109    v    gtg    yes 110    e    gag    yes 111    f    ttc    yes 112    r    cgt    yes 113    d    gac    yes 114    k    aag    yes 115    r    cgc    yes 116       ata    yes 117       gcc    yes 118    k    aag    yes 119    r    cgc    yes 120    v    gtg    yes 121       gcg    yes 122       gcc    yes 123    s    agt    yes 124    l    cta    yes 125    h    cac    yes 126    n    aac    yes 127    t    acg    yes 128    p    cct    yes 129    m    atg    yes 130    g    ggt    yes 131       gcc    yes 132    r    cgc    yes 133    r    agg    yes 134    r    cgc    yes 135    s    agc    yes 136    p    ccc    yes 137    f    ttc    yes 138    r    cgt    yes   139    y    tat    yes 140    d    gat    yes 141    l    ctt    yes 142    w    tgg    yes 143    n    aac    not 144    l    ctc    not 145    k    aag    not 146    y    tac    not 147    l    ttg    not 148    h    cac    not 149    r    cgt    not 150    f    ttc    not 151    t    acc    not 152    w    tgg    not 153    s    tcc    not 154    h    cac    not  155    *    tga    not 

this example 1 protein, have thousands proteins. please, have suggestions?

thank you!


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -