汎用IDSを使用した複数ファイルのマージ

Question 1

単一のツールでは不可能かもしれません。これはsort、2つの一時的な外部ファイル呼び出しに関連するスクリプトベースの提案です。

#!/bin/bash

# The number of columns is equal to the number of input files, which is
# equal to the number of command-line arguments.
NUMCOLS=$#


# Use associative container to record all "IDs" and associated fields
declare -A entries
col=0


# Read the fields from all files and store them so that the field values can be
# associated with the file they came from (= the column they belong).
for FILE in "$@"
do
    while read id value
    do
        SORTKEY="$id"__"$col"
        entries[$SORTKEY]="$value"
        echo "$id" >> "tmp.ids"
    done < $FILE
    let col=$col+1
done

# Sort the IDs
sort -u "tmp.ids" > "tmp.ids.sorted"


# Read the sorted IDs back in and generate output lines, where the
# column fields are taken from the associative container "entries" and
# tab-separated.
# If "entries" doesn't contain a value for a given key, output "-NA-" instead.

while read id
do
    LINE="$id"
    for (( col=0; col<NUMCOLS; col++ ))
    do
        SORTKEY="$id"__"$col"
        if [[ -z "${entries[$SORTKEY]}" ]]
        then
            LINE=$(printf "%s\t-NA-" "$LINE")
        else
            LINE=$(printf "%s\t%s" "$LINE" "${entries[$SORTKEY]}")
        fi
    done
    echo "$LINE" >> "outfile.txt"
done < "tmp.ids.sorted"

rm tmp.ids tmp.ids.sorted

これを呼び出すことができます./sortscript.sh <file1> <file2> ... <fileN>。

これにより、関連コンテナが作成され、entries「ID」フィールドと列番号で生成されたキーの下に入力ファイルから読み取られたすべてのフィールドが保存されます。 IDはソートできるように外部ファイルに書き込まれます。tmp.idsこれは必要に応じて表示されます。

ソート後にIDを再読み込みします。次に、各IDについて、コンテナはそのキーに属するすべての利用可能なフィールドを読み取り、entries出力行（変数LINE）に配置します。特定のID /列の組み合わせに使用できる値がない場合は、-NA-代わりに作成してください。

その後、出力行がファイルに書き込まれますoutfile.txt。

Answer

単一のツールでは不可能かもしれません。これはsort、2つの一時的な外部ファイル呼び出しに関連するスクリプトベースの提案です。

#!/bin/bash

# The number of columns is equal to the number of input files, which is
# equal to the number of command-line arguments.
NUMCOLS=$#


# Use associative container to record all "IDs" and associated fields
declare -A entries
col=0


# Read the fields from all files and store them so that the field values can be
# associated with the file they came from (= the column they belong).
for FILE in "$@"
do
    while read id value
    do
        SORTKEY="$id"__"$col"
        entries[$SORTKEY]="$value"
        echo "$id" >> "tmp.ids"
    done < $FILE
    let col=$col+1
done

# Sort the IDs
sort -u "tmp.ids" > "tmp.ids.sorted"


# Read the sorted IDs back in and generate output lines, where the
# column fields are taken from the associative container "entries" and
# tab-separated.
# If "entries" doesn't contain a value for a given key, output "-NA-" instead.

while read id
do
    LINE="$id"
    for (( col=0; col<NUMCOLS; col++ ))
    do
        SORTKEY="$id"__"$col"
        if [[ -z "${entries[$SORTKEY]}" ]]
        then
            LINE=$(printf "%s\t-NA-" "$LINE")
        else
            LINE=$(printf "%s\t%s" "$LINE" "${entries[$SORTKEY]}")
        fi
    done
    echo "$LINE" >> "outfile.txt"
done < "tmp.ids.sorted"

rm tmp.ids tmp.ids.sorted

これを呼び出すことができます./sortscript.sh <file1> <file2> ... <fileN>。

これにより、関連コンテナが作成され、entries「ID」フィールドと列番号で生成されたキーの下に入力ファイルから読み取られたすべてのフィールドが保存されます。 IDはソートできるように外部ファイルに書き込まれます。tmp.idsこれは必要に応じて表示されます。

ソート後にIDを再読み込みします。次に、各IDについて、コンテナはそのキーに属するすべての利用可能なフィールドを読み取り、entries出力行（変数LINE）に配置します。特定のID /列の組み合わせに使用できる値がない場合は、-NA-代わりに作成してください。

その後、出力行がファイルに書き込まれますoutfile.txt。

Question 2

このjoinユーティリティを2回使用して、3つのファイルに対して2つの「外部結合」を作成できます。 3つのファイルがすべてタブで区切られているとし、最初の2つのファイルから始めます。

$ join -a 1 -a 2 -o 0,1.2,2.2 -e '-NA-' -t $'\t' <( sort File1 ) <( sort File2 )
MYORGANISM_I_05140.t1   Atypical/PIKK/FRAP      VALUES to be taken
MYORGANISM_I_06518.t1   CAMK/MLCK       -NA-
MYORGANISM_I_00854.t1   TK-assoc/SH2/SH2-R      -NA-
MYORGANISM_I_12755.t1   TK-assoc/SH2/Unique     -NA-
MYORGANISM_I_12766.t1   -NA-    what

これを行うには、joinユーティリティは最初のフィールド（デフォルト）でソートされたファイルを結合する必要があります。-a 1 -a2一致しない場合でも、両方のファイルからすべての行を明示的に取得したいと思い、出力に-o 0,1.2,2.2各ファイルの結合フィールド（最初の列）と2番目の列を含めるように要求します。この-e '-NA-'オプションは、空のフィールドを埋めるために使用する文字列を指定します。

上記は、3番目のファイルへの2番目の接続で使用できる新しいデータセットを提供します。単純化のために上記の結果をtmpdata使用できるとします（リダイレクト後）。

$ join -a 1 -a 2 -o 0,1.2,1.3,2.2 -e '-NA-' -t $'\t' tmpdata <( sort FILE3 )
MYORGANISM_I_00854.t1   TK-assoc/SH2/SH2-R      -NA-    -NA-
MYORGANISM_I_05140.t1   Atypical/PIKK/FRAP      VALUES to be taken      -NA-
MYORGANISM_I_06518.t1   CAMK/MLCK       -NA-    -NA-
MYORGANISM_I_12755.t1   TK-assoc/SH2/Unique     -NA-    -NA-
MYORGANISM_I_12766.t1   -NA-    what    -NA-
MYORGANISM_I_16941.t1   -NA-    -NA-    OK
MYORGANISM_I_93484.t1   -NA-    -NA-    LET IT BE

これは以前の「外部結合」をある程度繰り返すが、-oオプションの追加列も追加します。

Answer