ファイルBをAと比較し、awk、sed、またはgrepを使用してAからデータを抽出します。

Question 1

あなたが使用できるawk：

awk 'NR==FNR{         # On the first file,
       a[$0];         # store the content in the array a
       next
     } 
     {                        # On the second file, 
         for(i in a)          # for all element in the array a,
            if(index($0,i)) { # check if there is match in the current record
               print "C" $0   # in that case print it with the record separator
               next
            }
     }' fileB RS='\nC' fileA
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Answer

あなたが使用できるawk：

awk 'NR==FNR{         # On the first file,
       a[$0];         # store the content in the array a
       next
     } 
     {                        # On the second file, 
         for(i in a)          # for all element in the array a,
            if(index($0,i)) { # check if there is match in the current record
               print "C" $0   # in that case print it with the record separator
               next
            }
     }' fileB RS='\nC' fileA
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Question 2

C <word>との間の部分を正確に一致させたい場合[PATH:...]（そして*サンプルの部分は実際のデータの一部ではなく強調のためのものであると仮定）、次のことができます。

awk '
  !start {all_strings[$0]; next}
  /^C/ {
    key = $0

    # strip the leading C <word>:
    sub(/^C[[:blank:]]+[^[:blank:]]+[[:blank:]]*/, "", key)

    # strip the trailing [...]:
    sub(/[[:blank:]]*\[[^]]*][[:blank:]]*$/, "", key)
    selected = key in all_strings
  }
  selected' fileB start=1 fileA

信頼性を向上させることに加えて、ファイルを一度だけ読み取る（たとえば、 Bacterial secretion1つのレコードのみが一致、Bacterial secretion一致しないBacterial secretion system）、一致は多くの部分文字列検索や正規表現の一致ではなくハッシュテーブルの検索であるため、非常に効率的です。

Answer

C <word>との間の部分を正確に一致させたい場合[PATH:...]（そして*サンプルの部分は実際のデータの一部ではなく強調のためのものであると仮定）、次のことができます。

awk '
  !start {all_strings[$0]; next}
  /^C/ {
    key = $0

    # strip the leading C <word>:
    sub(/^C[[:blank:]]+[^[:blank:]]+[[:blank:]]*/, "", key)

    # strip the trailing [...]:
    sub(/[[:blank:]]*\[[^]]*][[:blank:]]*$/, "", key)
    selected = key in all_strings
  }
  selected' fileB start=1 fileA

信頼性を向上させることに加えて、ファイルを一度だけ読み取る（たとえば、 Bacterial secretion1つのレコードのみが一致、Bacterial secretion一致しないBacterial secretion system）、一致は多くの部分文字列検索や正規表現の一致ではなくハッシュテーブルの検索であるため、非常に効率的です。

Question 3

ループを使用すると倒れると確信していますが、それでも...ここに1つのアプローチがあります。

#!/bin/bash

while read -r line; do
        sed -n "/$line/,/^C/p" fileA | sed '$d'
        done < fileB

例:

./bacteria.sh 
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

サンプルファイルはどこにありますかfileA？fileB

正規表現分析：

sed -n "/$line/,/^C/p" fileA | sed '$d'

$line文字で始まる行と次の行の間の行を印刷しますC。ただし、sed '$d'最後の行は「停止表示」の役割のみするので除外( )します。

sed --version
sed (GNU sed) 4.2.2

bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)

Answer

ループを使用すると倒れると確信していますが、それでも...ここに1つのアプローチがあります。

#!/bin/bash

while read -r line; do
        sed -n "/$line/,/^C/p" fileA | sed '$d'
        done < fileB

例:

./bacteria.sh 
C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein 
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD 
D      NT05HA_1310 protein-export membrane protein SecF 
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

サンプルファイルはどこにありますかfileA？fileB

正規表現分析：

sed -n "/$line/,/^C/p" fileA | sed '$d'

$line文字で始まる行と次の行の間の行を印刷しますC。ただし、sed '$d'最後の行は「停止表示」の役割のみするので除外( )します。

sed --version
sed (GNU sed) 4.2.2

bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)

Question 4

データは新しい行で始まるレコードfileAに分割されます。各レコードは、新しい行で始まるinteフィールドCに分割されます。D

行を読み取り、fileBそれを使用して各レコードの最初のフィールドを照会する必要がありますfileA。

while read -r query; do
    awk -vq="$query" 'BEGIN { RS="^C|\nC"; FS=OFS="\nD" } $1 ~ q {print "C" $0}' fileA
done <fileB

行の先頭のどこにでもRS一致するようにレコード区切り記号（）を設定しました。Cまたは改行文字の後にある場合、最初のレコードのどの項目も正しく一致しない可能性があります。awk変数を使用してqファイルから読み取った値を保持し、各レコードの最初のフィールドをその値と一致させます。

結果：

C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD
D      NT05HA_1310 protein-export membrane protein SecF
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

Answer

データは新しい行で始まるレコードfileAに分割されます。各レコードは、新しい行で始まるinteフィールドCに分割されます。D

行を読み取り、fileBそれを使用して各レコードの最初のフィールドを照会する必要がありますfileA。

while read -r query; do
    awk -vq="$query" 'BEGIN { RS="^C|\nC"; FS=OFS="\nD" } $1 ~ q {print "C" $0}' fileA
done <fileB

行の先頭のどこにでもRS一致するようにレコード区切り記号（）を設定しました。Cまたは改行文字の後にある場合、最初のレコードのどの項目も正しく一致しない可能性があります。awk変数を使用してqファイルから読み取った値を保持し、各レコードの最初のフィールドをその値と一致させます。

結果：

C    02030 *Bacterial chemotaxis* [PATH:aap02030]
D      NT05HA_0919 maltose-binding periplasmic protein
D      NT05HA_0918 maltose-binding periplasmic protein
C    03070 *Bacterial secretion system* [PATH:aap03070]
D      NT05HA_1309 protein-export membrane protein SecD
D      NT05HA_1310 protein-export membrane protein SecF
D      NT05HA_1819 preprotein translocase subunit SecE
D      NT05HA_1287 protein-export membrane protein

ファイルBをAと比較し、awk、sed、またはgrepを使用してAからデータを抽出します。

答え1

答え2

答え3

答え4

関連情報