2つのファイルを共通列と比較した後に各ファイルの列を含む出力ファイルを取得する方法

Question 1

生物情報学は興味深いようです。 awkではなくソリューションに開いている場合、これは簡単です。miller:

mlr --itsv join -u -j chrom,pos --lp tr_ --rp untr_ -f treated.bam.tsv untreated.bam.tsv | # join data from treated and untreated files by fields chrom and pos
mlr put '$tr_pct=($tr_mismatches+$tr_deletions+$tr_insertions)/$tr_reads_all' | # calculate pct for treated data
mlr put '$untr_pct=($untr_mismatches+$untr_deletions+$untr_insertions)/$untr_reads_all' | # calculate pct for untreated data
mlr cut -o -f chrom,pos,tr_ref,tr_reads_all,tr_mismatches,tr_deletions,tr_insertions,tr_pct,untr_ref,untr_reads_all,untr_mismatches,untr_deletions,untr_insertions,untr_pct | # remove superfluous fields
mlr --otsv put '$pct_sub=$tr_pct-$untr_pct' # append pct subtraction field

chrom   pos tr_ref  tr_reads_all    tr_mismatches   tr_deletions    tr_insertions   tr_pct  untr_ref    untr_reads_all  untr_mismatches untr_deletions  untr_insertions untr_pct    pct_sub
chrY    59363551    G   8   0   1   5   0.750000    G   2   0   0   1   0.500000    0.250000
chrY    59363552    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363553    T   7   0   0   0   0   T   1   0   0   0   0   0
chrY    59363554    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363555    T   7   0   0   0   0   T   1   0   0   0   0   0

実際より怖いようです。本当。

Answer

生物情報学は興味深いようです。 awkではなくソリューションに開いている場合、これは簡単です。miller:

mlr --itsv join -u -j chrom,pos --lp tr_ --rp untr_ -f treated.bam.tsv untreated.bam.tsv | # join data from treated and untreated files by fields chrom and pos
mlr put '$tr_pct=($tr_mismatches+$tr_deletions+$tr_insertions)/$tr_reads_all' | # calculate pct for treated data
mlr put '$untr_pct=($untr_mismatches+$untr_deletions+$untr_insertions)/$untr_reads_all' | # calculate pct for untreated data
mlr cut -o -f chrom,pos,tr_ref,tr_reads_all,tr_mismatches,tr_deletions,tr_insertions,tr_pct,untr_ref,untr_reads_all,untr_mismatches,untr_deletions,untr_insertions,untr_pct | # remove superfluous fields
mlr --otsv put '$pct_sub=$tr_pct-$untr_pct' # append pct subtraction field

chrom   pos tr_ref  tr_reads_all    tr_mismatches   tr_deletions    tr_insertions   tr_pct  untr_ref    untr_reads_all  untr_mismatches untr_deletions  untr_insertions untr_pct    pct_sub
chrY    59363551    G   8   0   1   5   0.750000    G   2   0   0   1   0.500000    0.250000
chrY    59363552    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363553    T   7   0   0   0   0   T   1   0   0   0   0   0
chrY    59363554    G   7   0   0   0   0   G   1   0   0   0   0   0
chrY    59363555    T   7   0   0   0   0   T   1   0   0   0   0   0

実際より怖いようです。本当。

Question 2

if ( $1 $2 in array )それは動作しません。あなたはそれをしなければなりませんif (($1,$2) in array)。
array[$3]あなたはそれを使用することはできませんarray[$4]。あなたの配置は次のとおりです
```
配列[chrY,59363551]="chrY 59363551 G 8 0 7 0 0 0 1 0 5 0 0 0 0 0 0 0 7 0 0 0"
配列[chrY,59363552]="chrY 59363552 G 7 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0"
             ︙
```
array[$3]and とは存在しない and などをarray[$4]意味します。array[G]array[2]
これをコードで指定する機能は、複数のファイルに書き込もうとするときに便利な機能です。単一の出力ファイルしかない場合はあまり役に立ちません。コマンドの出力をリダイレクトするのはどうですか？> "filename"awkawk
長いキューは良くありません。長いコマンドを短い行に分割します。変数を再利用して重複を減らします。
配列を使用しないでください~と呼ばれる array。これは、という変数variable、というファイルfile、という人Personなどがあるのと同じです。説明的な名前を使用してください。

つまり、

awk 'FNR==NR {file1data[$1,$2]=$0; next}
        {       if (($1,$2) in file1data) {
                        # Save desired values from file2.
                        file2arg03=$3
                        file2arg04=$4
                        file2arg08=$8
                        file2arg10=$10
                        file2arg12=$12
                        pct_file2=($8+$10+$12)/$4
                        # Get data from file1.
                        $0=file1data[$1,$2]
                        pct_file1=($8+$10+$12)/$4
                        print $1, $2, $3, $4, $8, $10, $12, pct_file1, \
                                file2arg03, file2arg04, file2arg08, file2arg10, file2arg12, \
                                pct_file2, pct_file1-pct_file2
                } else printf "(%s,%s) in file2 but not file1.%s", $1, $2, ORS
        }' treated.bam.tsv untreated.bam.tsv > awkoutput.bam.tsv

あなたのバージョンと同様に、これはfile1データを配列に保存し、file2を読み取る間にすべての操作/出力を実行します。 file2から行を受け取ったら、その行の必須フィールドを名前付き変数に保存します（5つの要素の長さの他の配列を使用することもできます）。 それからfile1 の対応する行からデータを回復します。。行全体をに割り当てると$0、、などが$1元の値に戻ります。$2$3$4

出力にヘッダー行を書き込むのに実際に問題がありますか？努力する：

        {       if (FNR == 1) {
                        print "chrom pos ref reads_all mismatches deletions insertions pct_file1 …"
                } else if (($1,$2) in file1data ) {
                        file2arg03=$3
                              ︙

いいですね。これはあなたの試みに近づき、ヘッダー行を処理するバージョンです。

awk 'FNR==NR {file1line[$1,$2]=$0; next}
        {       if (FNR == 1) {
                        print "chrom pos ref reads_all mismatches deletions insertions pct_file1 ref reads_all mismatches deletions insertions pct_file2 pct_sub …"
                } else if (($1,$2) in file1line ) {
                        # Get data from file1.
                        split(file1line[$1,$2], file1arg)
                        pct_file1=(file1arg[8]+file1arg[10]+file1arg[12])/file1arg[4]
                        pct_file2=($8+$10+$12)/$4
                        print $1, $2, file1arg[3], file1arg[4], file1arg[8], \
                                file1arg[10], file1arg[12], pct_file1, \
                                $3, $4, $8, $10, $12, pct_file2, pct_file1-pct_file2
                } else printf "(%s,%s) in file2 but not file1.%s", $1, $2, ORS
        }' treated.bam.tsv untreated.bam.tsv > awkoutput.bam.tsv

file1lineこれは、file1（from）から行を検索してそれを渡し、split23個のコンポーネント値で除算してarrayに保存しますfile1arg。その後、、file1arg[3]...を使用するのと同じようにfile1arg[4]使用できます。array[$3]array[$4]

Answer

if ( $1 $2 in array )それは動作しません。あなたはそれをしなければなりませんif (($1,$2) in array)。
array[$3]あなたはそれを使用することはできませんarray[$4]。あなたの配置は次のとおりです
```
配列[chrY,59363551]="chrY 59363551 G 8 0 7 0 0 0 1 0 5 0 0 0 0 0 0 0 7 0 0 0"
配列[chrY,59363552]="chrY 59363552 G 7 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0"
             ︙
```
array[$3]and とは存在しない and などをarray[$4]意味します。array[G]array[2]
これをコードで指定する機能は、複数のファイルに書き込もうとするときに便利な機能です。単一の出力ファイルしかない場合はあまり役に立ちません。コマンドの出力をリダイレクトするのはどうですか？> "filename"awkawk
長いキューは良くありません。長いコマンドを短い行に分割します。変数を再利用して重複を減らします。
配列を使用しないでください~と呼ばれる array。これは、という変数variable、というファイルfile、という人Personなどがあるのと同じです。説明的な名前を使用してください。