perl - Add a new column that corresponds to the codons that are in the protein domain positions -
i have 2 files thousands of proteins: (1) file: protein id + sequence of amino acids; (2) file: protein id + sequence of nucleotides. , third file file position of domains in (these proteins) related amino acid's sequence , nucleotide's sequence files. related these 3 files code:
acids.txt file contains:
enst00000274849|q9ulw3 meaeesekaateqeplegteqtldaeeeqeeseeaacgskkrvvpgivylghipprfrpl hvrnllsaygevgrvffqaedrfvrrkkkaaaaaggkkrsytkdytegwvefrdkriakr vaaslhntpmgarrrspfrydlwnlkylhrftwshlsehlaferqvrrqrlraevaqakr etdfylqsvergqrflaadgdparpdgswtfaqrpteqelrarkaarpggrerarlataq dkarsnkgllarifgapppsesmegpslvrds*
nucleotides.txt file contains:
enst00000274849|q9ulw3 atggaggcagaggaatcggagaaggccgcaacggagcaagagccgctggaagggacagaa cagacactagatgcggaggaggagcaggaggaatccgaagaagcggcctgtggcagcaag aaacgggtagtgccaggtattgtgtacctgggccatatcccgccgcgcttccggcccctg cacgtccgcaaccttctcagcgcctatggcgaggtcggacgcgtcttctttcaggctgag gaccggttcgtgagacgcaagaagaaggcagcagcagctgccggaggaaaaaagcggtcc tacaccaaggactacaccgagggatgggtggagttccgtgacaagcgcatagccaagcgc gtggcggccagtctacacaacacgcctatgggtgcccgcaggcgcagccccttccgttat gatctttggaacctcaagtacttgcaccgtttcacctggtcccacctcagcgagcacctc gcctttgagcgccaggtgcgcaggcagcgcttgagagcggaggttgctcaagccaagcgt gagaccgacttctatcttcaaagtgtggaacggggacaacgctttcttgcggccgatggg gaccctgctcgcccagatggctcctggacatttgcccagcgtcctactgagcaggaactg agggcccgtaaagcagcacggccagggggacgtgaacgggctcgcctggcaactgcccag gacaaggcccgctccaacaaagggctcctggccaggatctttggagccccgccaccctca gagagcatggagggaccttcccttgtcagggactcctga
domain.txt file contains:
q9ulw3; 46 142
note: numbers mean position domain in sequence
script:
use strict; use bio::seqio; #################################################### #module 1: read protein file, , save in hash# #################################################### %hash1; $sequence = "acid.txt"; $multifasta = bio::seqio ->new (-file => "<$sequence",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { $na= $seq->display_id; #saves id in $na $ss = $seq->seq; $hash1{$na} = $ss; } ############################################################# #module 2: read nucleotide file, , save in hash# ############################################################# %hash2; $genes = "nucleotides.txt"; $multifasta = bio::seqio ->new (-file => "<$genes",-format=> "fasta"); while (my $seq= $multifasta->next_seq()) { $na= $seq->display_id; #saves id in $na $des=$seq->description; $ss = $seq->seq; $hash2{$na} = $ss; } ##################### #module 3: $name;# ##################### $name; # read standard input chomp $name; ############################################################################## #module 4: domain annotation + related amino acids , nucleotides in columns# ############################################################################## foreach $name (keys %hash1) { $ac = (split(/\s*\|/, $name))[1]; #print "$ac\n" ; #################################################### #module 4.1: domain annotation: position of domains# #################################################### open(file, "<" ,"domain.txt"); @array = (<file>); @lines = grep (/$ac/, @array); print @lines; close (file); ############################################################ #module 4.2: related amino acids , nucleotides in columns# ############################################################ @array1 = split(//, $hash1{$name}, $hash2{$name}); #cut sequence @array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); #cut nucleotide sequence $number = "$#array1+1"; foreach (my $count = 0; $count <= $number; $count++) { print "$count\t@array1[$count]\t@array2[$count]\n"; } }
and here file got after running script:
q9ulw3; 46 142 0 m atg 1 e gag 2 gca 3 e gag 4 e gaa 5 s tcg 6 e gag 7 k aag 8 gcc 9 gca 10 t acg 11 e gag 12 q caa 13 e gag 14 p ccg 15 l ctg 16 e gaa 17 g ggg 18 t aca 19 e gaa 20 q cag 21 t aca 22 l cta 23 d gat 24 gcg 25 e gag 26 e gag 27 e gag 28 q cag 29 e gag 30 e gaa 31 s tcc 32 e gaa 33 e gaa 34 gcg 35 gcc 36 c tgt 37 g ggc 38 s agc 39 k aag 40 k aaa 41 r cgg 42 v gta 43 v gtg 44 p cca 45 g ggt 46 att 47 v gtg 48 y tac 49 l ctg 50 g ggc 51 h cat 52 atc 53 p ccg 54 p ccg 55 r cgc 56 f ttc 57 r cgg 58 p ccc 59 l ctg 60 h cac 61 v gtc 62 r cgc 63 n aac 64 l ctt 65 l ctc 66 s agc 67 gcc 68 y tat 69 g ggc 70 e gag 71 v gtc 72 g gga 73 r cgc 74 v gtc 75 f ttc 76 f ttt 77 q cag 78 gct 79 e gag 80 d gac 81 r cgg 82 f ttc 83 v gtg 84 r aga 85 r cgc 86 k aag 87 k aag 88 k aag 89 gca 90 gca 91 gca 92 gct 93 gcc 94 g gga 95 g gga 96 k aaa 97 k aag 98 r cgg 99 s tcc 100 y tac 101 t acc 102 k aag 103 d gac 104 y tac 105 t acc 106 e gag 107 g gga 108 w tgg 109 v gtg 110 e gag 111 f ttc 112 r cgt 113 d gac 114 k aag 115 r cgc 116 ata 117 gcc 118 k aag 119 r cgc 120 v gtg 121 gcg 122 gcc 123 s agt 124 l cta 125 h cac 126 n aac 127 t acg 128 p cct 129 m atg 130 g ggt 131 gcc 132 r cgc 133 r agg 134 r cgc 135 s agc 136 p ccc 137 f ttc 138 r cgt 139 y tat 140 d gat 141 l ctt 142 w tgg 143 n aac 144 l ctc 145 k aag 146 y tac 147 l ttg 148 h cac 149 r cgt 150 f ttc 151 t acc 152 w tgg 153 s tcc 154 h cac 155 * tga
now should add new fourth column contain 'yes' or 'not' - depends on codons in domain - yes, not in domain - not. so, here domains in positions 46 till 142. output file:
q9ulw3; 46 142 0 m atg not 1 e gag not 2 gca not 3 e gag not 4 e gaa not 5 s tcg not 6 e gag not 7 k aag not 8 gcc not 9 gca not 10 t acg not 11 e gag not 12 q caa not 13 e gag not 14 p ccg not 15 l ctg not 16 e gaa not 17 g ggg not 18 t aca not 19 e gaa not 20 q cag not 21 t aca not 22 l cta not 23 d gat not 24 gcg not 25 e gag not 26 e gag not 27 e gag not 28 q cag not 29 e gag not 30 e gaa not 31 s tcc not 32 e gaa not 33 e gaa not 34 gcg not 35 gcc not 36 c tgt not 37 g ggc not 38 s agc not 39 k aag not 40 k aaa not 41 r cgg not 42 v gta not 43 v gtg not 44 p cca not 45 g ggt not 46 att yes 47 v gtg yes 48 y tac yes 49 l ctg yes 50 g ggc yes 51 h cat yes 52 atc yes 53 p ccg yes 54 p ccg yes 55 r cgc yes 56 f ttc yes 57 r cgg yes 58 p ccc yes 59 l ctg yes 60 h cac yes 61 v gtc yes 62 r cgc yes 63 n aac yes 64 l ctt yes 65 l ctc yes 66 s agc yes 67 gcc yes 68 y tat yes 69 g ggc yes 70 e gag yes 71 v gtc yes 72 g gga yes 73 r cgc yes 74 v gtc yes 75 f ttc yes 76 f ttt yes 77 q cag yes 78 gct yes 79 e gag yes 80 d gac yes 81 r cgg yes 82 f ttc yes 83 v gtg yes 84 r aga yes 85 r cgc yes 86 k aag yes 87 k aag yes 88 k aag yes 89 gca yes 90 gca yes 91 gca yes 92 gct yes 93 gcc yes 94 g gga yes 95 g gga yes 96 k aaa yes 97 k aag yes 98 r cgg yes 99 s tcc yes 100 y tac yes 101 t acc yes 102 k aag yes 103 d gac yes 104 y tac yes 105 t acc yes 106 e gag yes 107 g gga yes 108 w tgg yes 109 v gtg yes 110 e gag yes 111 f ttc yes 112 r cgt yes 113 d gac yes 114 k aag yes 115 r cgc yes 116 ata yes 117 gcc yes 118 k aag yes 119 r cgc yes 120 v gtg yes 121 gcg yes 122 gcc yes 123 s agt yes 124 l cta yes 125 h cac yes 126 n aac yes 127 t acg yes 128 p cct yes 129 m atg yes 130 g ggt yes 131 gcc yes 132 r cgc yes 133 r agg yes 134 r cgc yes 135 s agc yes 136 p ccc yes 137 f ttc yes 138 r cgt yes 139 y tat yes 140 d gat yes 141 l ctt yes 142 w tgg yes 143 n aac not 144 l ctc not 145 k aag not 146 y tac not 147 l ttg not 148 h cac not 149 r cgt not 150 f ttc not 151 t acc not 152 w tgg not 153 s tcc not 154 h cac not 155 * tga not
this example 1 protein, have thousands proteins. please, have suggestions?
thank you!
Comments
Post a Comment