次のコマンドを実行すると、次の文字列全体が印刷されます。Note="Peptidase S59%2C nucleoporin"
awk '$3=="mRNA"' Nitab-v4.5_gene_models_Chr_Edwards2017.gff | head
Nt01 maker mRNA 143295 155540 . + . ID=Nitab4.5_0006317g0010.1;Parent=Nitab4.5_0006317g0010;Name=Nitab4.5_0006317g0010.1;_AED=0.08;_eAED=0.08;_QI=0|0.45|0.25|1|0.90|0.75|12|0|1011;Note="Peptidase S59%2C nucleoporin"
Nt01 maker mRNA 170633 173860 . + . ID=Nitab4.5_0006317g0020.1;Parent=Nitab4.5_0006317g0020;Name=Nitab4.5_0006317g0020.1;_AED=0.26;_eAED=0.26;_QI=15|0|0|0.83|0.6|0.33|6|0|424;Note="Putative S-adenosyl-L-methionine-dependent methyltransferase"
Nt01 maker mRNA 156516 160996 . - . ID=Nitab4.5_0006317g0030.1;Parent=Nitab4.5_0006317g0030;Name=Nitab4.5_0006317g0030.1;_AED=0.01;_eAED=0.01;_QI=161|1|1|1|0|0.5|2|358|141;Note="Unknown"
Nt01 maker mRNA 78554 80638 . - . ID=Nitab4.5_0006317g0040.1;Parent=Nitab4.5_0006317g0040;Name=Nitab4.5_0006317g0040.1;_AED=0.02;_eAED=0.02;_QI=0|0|0|1|1|1|3|0|187;Note="Heavy metal-associated domain%2C HMA"
Nt01 maker mRNA 111288 129916 . - . ID=Nitab4.5_0006317g0050.1;Parent=Nitab4.5_0006317g0050;Name=Nitab4.5_0006317g0050.1;_AED=0.24;_eAED=0.24;_QI=0|0|0|0.5|1|1|2|0|72;Note="Unknown"
Nt01 maker mRNA 470560 474346 . + . ID=Nitab4.5_0002367g0010.1;Parent=Nitab4.5_0002367g0010;Name=Nitab4.5_0002367g0010.1;_AED=0.11;_eAED=0.11;_QI=0|0|0|1|1|1|14|0|668;Note="Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation"
Nt01 maker mRNA 499946 502182 . + . ID=Nitab4.5_0002367g0020.1;Parent=Nitab4.5_0002367g0020;Name=Nitab4.5_0002367g0020.1;_AED=0.26;_eAED=0.26;_QI=0|0.5|0|0.66|0|0|3|0|258;Note="Cellulose synthase"
Nt01 maker mRNA 496891 497596 . + . ID=Nitab4.5_0002367g0030.1;Parent=Nitab4.5_0002367g0030;Name=Nitab4.5_0002367g0030.1;_AED=0.33;_eAED=0.33;_QI=0|0|0|0.5|0|0.5|2|0|213;Note="Cellulose synthase"
Nt01 maker mRNA 505125 506853 . - . ID=Nitab4.5_0002367g0040.1;Parent=Nitab4.5_0002367g0040;Name=Nitab4.5_0002367g0040.1;_AED=0.09;_eAED=0.09;_QI=0|0|0|1|0.5|0.66|3|0|230;Note="Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type"
Nt01 maker mRNA 564383 570328 . + . ID=Nitab4.5_0002367g0050.1;Parent=Nitab4.5_0002367g0050;Name=Nitab4.5_0002367g0050.1;_AED=0.08;_eAED=0.08;_QI=75|1|1|1|1|1|6|146|267;Note="SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12"
ただし、次のコマンドを使用すると、文字列は次のように短縮されます。Note = "ペプチダーゼ
awk '$3=="mRNA"' Nitab-v4.5_gene_models_Chr_Edwards2017.gff | awk '{print $9}' | head
ID=Nitab4.5_0006317g0010.1;Parent=Nitab4.5_0006317g0010;Name=Nitab4.5_0006317g0010.1;_AED=0.08;_eAED=0.08;_QI=0|0.45|0.25|1|0.90|0.75|12|0|1011;Note="Peptidase
ID=Nitab4.5_0006317g0020.1;Parent=Nitab4.5_0006317g0020;Name=Nitab4.5_0006317g0020.1;_AED=0.26;_eAED=0.26;_QI=15|0|0|0.83|0.6|0.33|6|0|424;Note="Putative
ID=Nitab4.5_0006317g0030.1;Parent=Nitab4.5_0006317g0030;Name=Nitab4.5_0006317g0030.1;_AED=0.01;_eAED=0.01;_QI=161|1|1|1|0|0.5|2|358|141;Note="Unknown"
ID=Nitab4.5_0006317g0040.1;Parent=Nitab4.5_0006317g0040;Name=Nitab4.5_0006317g0040.1;_AED=0.02;_eAED=0.02;_QI=0|0|0|1|1|1|3|0|187;Note="Heavy
ID=Nitab4.5_0006317g0050.1;Parent=Nitab4.5_0006317g0050;Name=Nitab4.5_0006317g0050.1;_AED=0.24;_eAED=0.24;_QI=0|0|0|0.5|1|1|2|0|72;Note="Unknown"
ID=Nitab4.5_0002367g0010.1;Parent=Nitab4.5_0002367g0010;Name=Nitab4.5_0002367g0010.1;_AED=0.11;_eAED=0.11;_QI=0|0|0|1|1|1|14|0|668;Note="Auxin
ID=Nitab4.5_0002367g0020.1;Parent=Nitab4.5_0002367g0020;Name=Nitab4.5_0002367g0020.1;_AED=0.26;_eAED=0.26;_QI=0|0.5|0|0.66|0|0|3|0|258;Note="Cellulose
ID=Nitab4.5_0002367g0030.1;Parent=Nitab4.5_0002367g0030;Name=Nitab4.5_0002367g0030.1;_AED=0.33;_eAED=0.33;_QI=0|0|0|0.5|0|0.5|2|0|213;Note="Cellulose
ID=Nitab4.5_0002367g0040.1;Parent=Nitab4.5_0002367g0040;Name=Nitab4.5_0002367g0040.1;_AED=0.09;_eAED=0.09;_QI=0|0|0|1|0.5|0.66|3|0|230;Note="Zinc
ID=Nitab4.5_0002367g0050.1;Parent=Nitab4.5_0002367g0050;Name=Nitab4.5_0002367g0050.1;_AED=0.08;_eAED=0.08;_QI=75|1|1|1|1|1|6|146|267;Note="SAC3/GANP/Nin1/mts3/eIF-3
最終結果としてNitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
。
私が逃したものは何ですか?
事前にありがとう
答え1
GFFはタブ区切り形式ですが、タブは使用しません。-F'\t'
orを使用しない限り、BEGIN{FS="\t"}
awkはスペースを含むすべてのスペースをフィールド区切り文字として使用します。スペースを切っているので、$9
最初のスペースで終わります。 2つのコマンドも必要ありません。実行する必要がある作業は次のとおりです。
$ awk -F'\t' '$3=="mRNA"{print $9}' file.gff
ID=Nitab4.5_0006317g0010.1;Parent=Nitab4.5_0006317g0010;Name=Nitab4.5_0006317g0010.1;_AED=0.08;_eAED=0.08;_QI=0|0.45|0.25|1|0.90|0.75|12|0|1011;Note="Peptidase S59%2C nucleoporin"
ID=Nitab4.5_0006317g0020.1;Parent=Nitab4.5_0006317g0020;Name=Nitab4.5_0006317g0020.1;_AED=0.26;_eAED=0.26;_QI=15|0|0|0.83|0.6|0.33|6|0|424;Note="Putative S-adenosyl-L-methionine-dependent methyltransferase"
ID=Nitab4.5_0006317g0030.1;Parent=Nitab4.5_0006317g0030;Name=Nitab4.5_0006317g0030.1;_AED=0.01;_eAED=0.01;_QI=161|1|1|1|0|0.5|2|358|141;Note="Unknown"
ID=Nitab4.5_0006317g0040.1;Parent=Nitab4.5_0006317g0040;Name=Nitab4.5_0006317g0040.1;_AED=0.02;_eAED=0.02;_QI=0|0|0|1|1|1|3|0|187;Note="Heavy metal-associated domain%2C HMA"
ID=Nitab4.5_0006317g0050.1;Parent=Nitab4.5_0006317g0050;Name=Nitab4.5_0006317g0050.1;_AED=0.24;_eAED=0.24;_QI=0|0|0|0.5|1|1|2|0|72;Note="Unknown"
ID=Nitab4.5_0002367g0010.1;Parent=Nitab4.5_0002367g0010;Name=Nitab4.5_0002367g0010.1;_AED=0.11;_eAED=0.11;_QI=0|0|0|1|1|1|14|0|668;Note="Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation"
ID=Nitab4.5_0002367g0020.1;Parent=Nitab4.5_0002367g0020;Name=Nitab4.5_0002367g0020.1;_AED=0.26;_eAED=0.26;_QI=0|0.5|0|0.66|0|0|3|0|258;Note="Cellulose synthase"
ID=Nitab4.5_0002367g0030.1;Parent=Nitab4.5_0002367g0030;Name=Nitab4.5_0002367g0030.1;_AED=0.33;_eAED=0.33;_QI=0|0|0|0.5|0|0.5|2|0|213;Note="Cellulose synthase"
ID=Nitab4.5_0002367g0040.1;Parent=Nitab4.5_0002367g0040;Name=Nitab4.5_0002367g0040.1;_AED=0.09;_eAED=0.09;_QI=0|0|0|1|0.5|0.66|3|0|230;Note="Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type"
ID=Nitab4.5_0002367g0050.1;Parent=Nitab4.5_0002367g0050;Name=Nitab4.5_0002367g0050.1;_AED=0.08;_eAED=0.08;_QI=75|1|1|1|1|1|6|146|267;Note="SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12"
値のみを取得するには、Note=
次のようにします。
$ awk -F"\t" '$3=="mRNA"{sub(/.*Note=/,"",$9); print $9}' file.gff
"Peptidase S59%2C nucleoporin"
"Putative S-adenosyl-L-methionine-dependent methyltransferase"
"Unknown"
"Heavy metal-associated domain%2C HMA"
"Unknown"
"Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation"
"Cellulose synthase"
"Cellulose synthase"
"Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type"
"SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12"
そして、コメント自体に属することができる部分を維持しながら、コメントの先頭と末尾の引用符を削除するには、次の手順を実行します。
$ awk -F"\t" '$3=="mRNA"{sub(/.*Note=/,"",$9); print $9}' file.gff | sed 's/^"//; s/"$//'
Peptidase S59%2C nucleoporin
Putative S-adenosyl-L-methionine-dependent methyltransferase
Unknown
Heavy metal-associated domain%2C HMA
Unknown
Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Cellulose synthase
Cellulose synthase
Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12
Note
最後に sum 値を取得するには、ID
次のようにします。
$ awk -F"\t" '$3=="mRNA"{n=$9; sub(/.*Note="/,"",n); sub(/"$/,"",n); sub(/.*ID=/,"",$9); sub(/;.*/,"",$9); print $9","n}' file.gff
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12
しかし、個人的に私はPerlでこれを行います。
$ perl -F'\t' -lane 'if($F[2] eq "mRNA"){/ID=([^\;]+).*Note="([^"]+)/; print "$1,$2"}' file.gff
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome
答え2
各行の末尾にある引用文字列にタブやaを含めることができないと仮定すると、おそらく次;
のようなものが必要になります。
$ awk -F'\t' -v OFS=',' '$3=="mRNA"{ gsub(/;/,FS); gsub(/[^\t=]+=|"/,""); print $9, $15 }' file
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimeriion
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12
または(元のアイデア):
$ cat tst.awk
BEGIN { FS="\t"; OFS="," }
$3 == "mRNA" {
split($NF,f,/;/)
for (i in f) {
gsub(/^[^=]+="?|"$/,"",f[i])
}
print f[1], f[7]
}
。
$ awk -f tst.awk file
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12
含めることができる場合は、OFSの,
使用を再考し、タブまたは代わりを使用する必要があります。,
;
答え3
ID=
「最終結果」出力形式がフィールドから派生すると仮定すると、Note=
次のことが機能します。
$ awk -F'\t' '$3=="mRNA"{split($9,f,";"); split(f[1],i,"="); split(f[7],n,"\""); printf("%s,%s\n",i[2],n[2])}' file.gff
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin
Nitab4.5_0006317g0020.1,Putative S-adenosyl-L-methionine-dependent methyltransferase
Nitab4.5_0006317g0030.1,Unknown
Nitab4.5_0006317g0040.1,Heavy metal-associated domain%2C HMA
Nitab4.5_0006317g0050.1,Unknown
Nitab4.5_0002367g0010.1,Auxin response factor%2C B3 DNA binding domain%2C DNA-binding pseudobarrel domain%2C AUX/IAA protein%2C Aux/IAA-ARF-dimerisation
Nitab4.5_0002367g0020.1,Cellulose synthase
Nitab4.5_0002367g0030.1,Cellulose synthase
Nitab4.5_0002367g0040.1,Zinc finger%2C RING-type%2C Zinc finger%2C RING/FYVE/PHD-type
Nitab4.5_0002367g0050.1,SAC3/GANP/Nin1/mts3/eIF-3 p25%2C 26S proteasome non-ATPase regulatory subunit Rpn12
これは、タブで区切られた9番目のフィールドを配列;
に格納されている区切られたサブフィールドに分割しますf
。ここで、最初のサブフィールド(ID=...
)はで分割され、=
7番目のサブフィールド(Note=" ... "
)はで分割されて意味が分離されます。関心のある部分はそれぞれ補助配列"
と区切ります。 。その後、これらの関連部分を印刷します。i
n
これはかなり簡潔なコードを可能にしますが、必ずしも最も効率的なアプローチではありません。
答え4
GNU awkを使う:
awk -F'\t' -vOFS=, 'NF>=9&&$3=="mRNA"{
N = split($9, a, /[=;]/)
for ( i=1; i<=N; i+=2 )
h[a[i]] = a[i+1]
print h["ID"], gensub(/"(.*)"/, "\\1", "g", h["Note"])
}' bioinformatic.file
Nitab4.5_0006317g0010.1,Peptidase S59%2C nucleoporin