6つのファイル間の共通点を見つける

Question 1

一緒にawk次のことができます。

#skip if multiple appearance in one file
{if ( seenin[$0] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$0]=seenin[$0]" "FILENAME ; nseen[$0]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

制限事項：すべての行がRAMに保持されるため、メモリです。

発生回数に基づいてソートするには、それに応じて印刷コマンドを調整する必要があります（例：値によるソート）nseen。簡単ですgawk。ENDブロックに次のbefore -loopを追加しますfor。

PROCINFO["sorted_in"]="@val_num_desc"

入力ファイル：

$ cat file1
a
a
b
b
c
d
e

$ cat file2
c
c
x
z
e
y
z
f

$ cat file3
f
i
a
c
z
i
k

出力（gawk配列巡回機能付きPROCINFO）

$awk -f compare_lines_multifiles.awk file1 file2 file3
line "c" seen in 3 files:  file1 file2 file3
line "z" seen in 2 files:  file2 file3
line "a" seen in 2 files:  file1 file3
line "e" seen in 2 files:  file1 file2
line "f" seen in 2 files:  file2 file3

編集する：

提供されたファイルはMSDOS形式です。変換して

 dos2unix file1.txt file2.txt ....

またはでレコード区切り記号を調整しますawk。コードの最初の項目として次を追加します。

 BEGIN { RS="\r\n" }

編集2：ファイルに不規則な区切り記号があります。問題は、とがa<tab>b別a<tab>b<tab>の行で処理されるのに対し、同じと考えることができるということです。

ファイルごとに2つの関心フィールドがある特別な場合は、行全体ではなく2つのフィールドの内容を比較することをお勧めします。 MSDOS 形式も考慮してください。

BEGIN { RS="\r\n" }
#skip if multiple appearance in one file
{if ( seenin[$1"\t"$2] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$1"\t"$2]=seenin[$1"\t"$2]" "FILENAME ; nseen[$1"\t"$2]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

結局、6つのファイルはすべて重複していました。タブ区切り記号がある2つのフィールドに焦点を合わせ、1行の出力を印刷します。

Answer

一緒にawk次のことができます。

#skip if multiple appearance in one file
{if ( seenin[$0] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$0]=seenin[$0]" "FILENAME ; nseen[$0]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

制限事項：すべての行がRAMに保持されるため、メモリです。

発生回数に基づいてソートするには、それに応じて印刷コマンドを調整する必要があります（例：値によるソート）nseen。簡単ですgawk。ENDブロックに次のbefore -loopを追加しますfor。

PROCINFO["sorted_in"]="@val_num_desc"

入力ファイル：

$ cat file1
a
a
b
b
c
d
e

$ cat file2
c
c
x
z
e
y
z
f

$ cat file3
f
i
a
c
z
i
k

出力（gawk配列巡回機能付きPROCINFO）

$awk -f compare_lines_multifiles.awk file1 file2 file3
line "c" seen in 3 files:  file1 file2 file3
line "z" seen in 2 files:  file2 file3
line "a" seen in 2 files:  file1 file3
line "e" seen in 2 files:  file1 file2
line "f" seen in 2 files:  file2 file3

編集する：

提供されたファイルはMSDOS形式です。変換して

 dos2unix file1.txt file2.txt ....

またはでレコード区切り記号を調整しますawk。コードの最初の項目として次を追加します。

 BEGIN { RS="\r\n" }

編集2：ファイルに不規則な区切り記号があります。問題は、とがa<tab>b別a<tab>b<tab>の行で処理されるのに対し、同じと考えることができるということです。

ファイルごとに2つの関心フィールドがある特別な場合は、行全体ではなく2つのフィールドの内容を比較することをお勧めします。 MSDOS 形式も考慮してください。

BEGIN { RS="\r\n" }
#skip if multiple appearance in one file
{if ( seenin[$1"\t"$2] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$1"\t"$2]=seenin[$1"\t"$2]" "FILENAME ; nseen[$1"\t"$2]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

結局、6つのファイルはすべて重複していました。タブ区切り記号がある2つのフィールドに焦点を合わせ、1行の出力を印刷します。

Question 2

私は別のアプローチを提案したいと思います。すべてを繰り返しながら、sort各行uniq -cが何回表示されるかを計算します。

sort 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt | uniq -c

これにより、各行が1回印刷されますが、その行が表示された回数も印刷されます。たとえば、次の3つのファイルがあるとします。

$ cat file1 
dog
cat
bird

$ cat file2
fly
bird
moose

$ cat file3
bird
dog
flea

以下の結果が出力されます。

$ sort file1 file2 file3 | uniq -c
      3 bird
      1 cat
      2 dog
      1 flea
      1 fly
      1 moose

したがって、見つかった回数に基づいて行を区切るには、次のようにして3つ（またはあなたの場合は6つ）ファイルのすべてに表示される行のみを表示できます。

$ sort file1 file2 file3 | uniq -c | awk '$1==3'
  3 bird
$ sort file1 file2 file3 | uniq -c | awk '$1==2'
      2 dog
$ sort file1 file2 file3 | uniq -c | awk '$1==1'
      1 cat
      1 flea
      1 fly
      1 moose

Answer

私は別のアプローチを提案したいと思います。すべてを繰り返しながら、sort各行uniq -cが何回表示されるかを計算します。

sort 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt | uniq -c

これにより、各行が1回印刷されますが、その行が表示された回数も印刷されます。たとえば、次の3つのファイルがあるとします。

$ cat file1 
dog
cat
bird

$ cat file2
fly
bird
moose

$ cat file3
bird
dog
flea

以下の結果が出力されます。

$ sort file1 file2 file3 | uniq -c
      3 bird
      1 cat
      2 dog
      1 flea
      1 fly
      1 moose

したがって、見つかった回数に基づいて行を区切るには、次のようにして3つ（またはあなたの場合は6つ）ファイルのすべてに表示される行のみを表示できます。

$ sort file1 file2 file3 | uniq -c | awk '$1==3'
  3 bird
$ sort file1 file2 file3 | uniq -c | awk '$1==2'
      2 dog
$ sort file1 file2 file3 | uniq -c | awk '$1==1'
      1 cat
      1 flea
      1 fly
      1 moose

Question 3

最初の試みは正しいアプローチです。

comm -12 2.txt 3.txt | comm -12 - 4.txt | comm -12 - 5.txt | comm -12 - 6.txt | comm -12 - 7.txt

これは、ジョブを並列に完了するストリームのように機能します。原則として、数百万行のファイルをこの方法で処理できます。

あなたが直面する問題コミュニケーション(1) 入力の問題、つまり空白や行末が原因で発生したようです。このようなものを先にまとめてみると、元の方法が早くて便利であることがわかります。

これを示す例があります。少数の配列に分けることができる数字を見つけます。

$ for D in 2 3 5 7 11 13 
> do seq 1 1000 | 
> awk -v D=$D '$0 % D == 0 { print $0 }' | 
> sort > $D
> done

$ comm -12 2 3 | comm -12 - 5 | comm -12 - 7 
210
420
630
840

1から1000までの数字は2、3、5、7、11に分割されていないことがわかりました。

Answer