次の3つのファイルがあります。
ファイル1:
ko00980 Metabolism of xenobiotics by cytochrome P450 (5)
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00982 Drug metabolism - cytochrome P450 (5)
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00485 FMO; dimethylaniline monooxygenase (N-oxide forming) [EC:1.14.13.8]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00983 Drug metabolism - other enzymes (4)
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ファイル2:
ko00980 Metabolism of xenobiotics by cytochrome P450 (6)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00982 Drug metabolism - cytochrome P450 (4)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00983 Drug metabolism - other enzymes (8)
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00106 XDH; xanthine dehydrogenase/oxidase [EC:1.17.1.4 1.17.3.2]
ko:K00760 hprT; hypoxanthine phosphoribosyltransferase [EC:2.4.2.8]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01464 DPYS; dihydropyrimidinase [EC:3.5.2.2]
ko:K01519 ITPA; inosine triphosphate pyrophosphatase [EC:3.6.1.19]
ko:K13421 UMPS; uridine monophosphate synthetase [EC:2.4.2.10 4.1.1.23]
ファイル3:
ko00980 Metabolism of xenobiotics by cytochrome P450 (7)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00982 Drug metabolism - cytochrome P450 (6)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00485 FMO; dimethylaniline monooxygenase (N-oxide forming) [EC:1.14.13.8]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00983 Drug metabolism - other enzymes (8)
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00207 DPYD; dihydropyrimidine dehydrogenase (NADP+) [EC:1.3.1.2]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01489 cdd; cytidine deaminase [EC:3.5.4.5]
ko:K01951 guaA; GMP synthase (glutamine-hydrolysing) [EC:6.3.5.2]
各ファイルには括弧で始まるヘッダ行がありko*****
、括弧内に字幕行の名前と数があります。たとえば、次のようになります。
ko00980 Metabolism of xenobiotics by cytochrome P450 (5)
字幕行は次から始まります。ko:K*****
各ヘッダー行のサブヘッダー行を3つのファイルにマージして、次のようなuniq
結果を実行したいと思います。
ko00980:
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko00982
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00485 FMO; dimethylaniline monooxygenase (N-oxide forming) [EC:1.14.13.8]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00207 DPYD; dihydropyrimidine dehydrogenase (NADP+) [EC:1.3.1.2]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01489 cdd; cytidine deaminase [EC:3.5.4.5]
ko:K01951 guaA; GMP synthase (glutamine-hydrolysing) [EC:6.3.5.2]
ko00983
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K00106 XDH; xanthine dehydrogenase/oxidase [EC:1.17.1.4 1.17.3.2]
ko:K00760 hprT; hypoxanthine phosphoribosyltransferase [EC:2.4.2.8]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01464 DPYS; dihydropyrimidinase [EC:3.5.2.2]
ko:K01519 ITPA; inosine triphosphate pyrophosphatase [EC:3.6.1.19]
ko:K13421 UMPS; uridine monophosphate synthetase [EC:2.4.2.10 4.1.1.23]
ko:K00207 DPYD; dihydropyrimidine dehydrogenase (NADP+) [EC:1.3.1.2]
ko:K01489 cdd; cytidine deaminase [EC:3.5.4.5]
ko:K01951 guaA; GMP synthase (glutamine-hydrolysing) [EC:6.3.5.2]
答え1
これにより、awk
以下を実行できます。
awk '/^ko[^:]/{fn=$1;next};/./{id=fn$1;if (!(seen[id]++)){print > fn}}' file[123]
各ヘッダー行では識別子ko*****
をとして保存し、サブヘッダー行では1を配列のインデックスとしてfn
保存し、最初に表示される場合はその行を書き込みます。fn$1
id
seen
id
fn
1: また使用できますfn$0
答え2
いくつかの魔法のスーパーマッシュアップコマンドがあるかもしれませんが、時には「線形」が理解し維持するのが最も簡単です。
したがって、ヘッダー行に基づいてファイル名を追跡してデータを追加するだけです。その後、sort -u
結果を渡して一意の行を取得できます。
#!/bin/bash
# Clean out old results from previous runs
/bin/rm -f ko*
for file in $@
do
filename=UNKNOWN
echo Processing $file
while read -r line
do
case $line in
ko:*) printf "%s\n" "$line" >> $filename ;;
ko*) filename=${line%% *} ; echo Switching to $filename ;;
"") # Do nothing
;;
*) echo Ignoring unknown line: $line
esac
done < $file
done
for file in ko*
do
echo Making unique: $file
sort -u -o $file $file
done
3つのソースファイルを使用して実行できます。
$ ./pattern_split file1 file2 file3
Processing file1
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file2
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file3
Switching to ko00980
Switching to ko00982
Switching to ko00983
Making unique: ko00980
Making unique: ko00982
Making unique: ko00983
3つの一意のファイルが作成されていることがわかります。最初を見てください:
$ cat ko00980
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
これで解決策は次のとおりです。固まったデータファイルの悪意のあるデータをターゲットにした場合(ファイルが存在する場合はどうなりますかko123/456
?破損する可能性がありますが、これは問題を解決する方法の概要です)。
答え3
それでは、ファイルの行をヘッダーに基づいて別々のファイルに移動しますか?
私はこのようなことがトリックを実行すると思います。
#!/usr/bin/env perl
use strict;
use warnings 'all';
#hash of output filehandles.
my %output_files;
#detect dupes
my %seen;
my $ko_num = 'NULL';
#<> is the 'magic' filehandle. You can either use it to iterate STDIN
#or take a list of file names on the command line (just like sed/grep etc.)
while ( my $line = <> ) {
#see if the line starts with 'ko':
if ( $line =~ m/(^ko\d+)/) {
$ko_num = $1;
#open a new file - for overwriting (so we only do this once)
open ( $output_files{$ko_num}, '>', $ko_num ) or die $! unless $output_files{$ko_num};
#skip printing - could write a header here instead.
next;
}
#look for a 'K' number.
if ( my ($K_id) = $line =~ m/ko:(K\d+)/ ) {
#skip it if we've already seen this combination of 'ko' number
#and k number.
next if $seen{$ko_num}{$K_id}++;
#print the output to this particular output file.
print {$output_files{$ko_num}} $line;
}
}
#close the filehandles.
close ( $_ ) for values %output_files;
したがって、この方法で "myscript.pl file1.txt file2.txt file3.txt" を実行でき、拡張可能な方法で正しい操作を実行する必要があります。別々のファイルなのか単一ストリームなのかは関係ありません。