bashでディレクトリ内のすべてのファイルを比較するには？

Question 1

はい、2つのループが必要です。しかし、diff出力を使用できるdiff/ dev / nullで削除するので、そうする必要はないようです。cmp

たとえば、

#!/bin/bash

# Read the list of files into an array called $files - we're iterating over
# the same list of files twice, but there's no need to run find twice.
#
# This uses NUL as the separator between filenames, so will work with
# filenames containing ANY valid character. Requires GNU find and GNU
# sort or non-GNU versions that support the `-print0` and `-z` options.
#
# The `-maxdepth 1` predicate prevents recursion into subdirs, remove it
# if that's not what you want.
mapfile -d '' -t files < <(find ./ -maxdepth 1 -type f -print0 | sort -z -u)

for a in "${files[@]}" ; do
  for b in "${files[@]}" ; do
    if [ ! "$a" = "$b" ] ; then
      if cmp --quiet "$a" "$b" ; then
        echo "Yes. $a is equal to $b"
      else
        echo "No. $a is not equal to $b"
      fi
    fi
  done
done

しかし、これを行うと、多くの出力が生成されます（n×(n-1)出力ライン、ここでnはファイルの数）。個人的に他のファイルと同じファイルは、固有のファイルよりもはるかに一般的ではない可能性が高いため、else行を削除またはコメントアウトします。echo "No...."

また、ファイルabcとファイルxyzが同じ場合は、ファイルを2回比較し、Yesを2回印刷します。

Yes. ./abc is equal to ./xyz
Yes. ./xyz is equal to ./abc

これを防ぐ方法はいくつかあります。最も簡単な方法は、連想配列を使用して互いに比較するファイルを追跡することです。例えば

#!/bin/bash

# Read the list of files into an array called $files
mapfile -d '' -t files < <(find ./ -maxdepth 1 -type f -print0 | sort -z -u)

# declare that $seen is an associative array
declare -A seen

for a in "${files[@]}" ; do
  for b in "${files[@]}" ; do
    if [ ! "$a" = "$b" ] && [ -z "${seen[$a$b]}" ] && [ -z "${seen[$b$a]}" ] ; then
      seen[$a$b]=1
      seen[$b$a]=1
      if cmp --quiet "$a" "$b" ; then
        echo "Yes. $a is equal to $b"
      #else
      #  echo "No. $a is not equal to $b"
      fi
    fi
  done
done

Answer

はい、2つのループが必要です。しかし、diff出力を使用できるdiff/ dev / nullで削除するので、そうする必要はないようです。cmp

たとえば、

#!/bin/bash

# Read the list of files into an array called $files - we're iterating over
# the same list of files twice, but there's no need to run find twice.
#
# This uses NUL as the separator between filenames, so will work with
# filenames containing ANY valid character. Requires GNU find and GNU
# sort or non-GNU versions that support the `-print0` and `-z` options.
#
# The `-maxdepth 1` predicate prevents recursion into subdirs, remove it
# if that's not what you want.
mapfile -d '' -t files < <(find ./ -maxdepth 1 -type f -print0 | sort -z -u)

for a in "${files[@]}" ; do
  for b in "${files[@]}" ; do
    if [ ! "$a" = "$b" ] ; then
      if cmp --quiet "$a" "$b" ; then
        echo "Yes. $a is equal to $b"
      else
        echo "No. $a is not equal to $b"
      fi
    fi
  done
done

しかし、これを行うと、多くの出力が生成されます（n×(n-1)出力ライン、ここでnはファイルの数）。個人的に他のファイルと同じファイルは、固有のファイルよりもはるかに一般的ではない可能性が高いため、else行を削除またはコメントアウトします。echo "No...."

また、ファイルabcとファイルxyzが同じ場合は、ファイルを2回比較し、Yesを2回印刷します。

Yes. ./abc is equal to ./xyz
Yes. ./xyz is equal to ./abc

これを防ぐ方法はいくつかあります。最も簡単な方法は、連想配列を使用して互いに比較するファイルを追跡することです。例えば

#!/bin/bash

# Read the list of files into an array called $files
mapfile -d '' -t files < <(find ./ -maxdepth 1 -type f -print0 | sort -z -u)

# declare that $seen is an associative array
declare -A seen

for a in "${files[@]}" ; do
  for b in "${files[@]}" ; do
    if [ ! "$a" = "$b" ] && [ -z "${seen[$a$b]}" ] && [ -z "${seen[$b$a]}" ] ; then
      seen[$a$b]=1
      seen[$b$a]=1
      if cmp --quiet "$a" "$b" ; then
        echo "Yes. $a is equal to $b"
      #else
      #  echo "No. $a is not equal to $b"
      fi
    fi
  done
done

Question 2

多くのファイルをペアで比較するのはすぐに面倒です。 10ファイルには45回の比較が必要です。 100個のファイルにはほぼ5000個のファイルが必要です。私のテストセットは595ファイル（合計10GB）で、175,000回以上のペアごとの比較が必要です。（このセットは、さまざまなパーティションの完全および部分的なバックアップから取得されたメタデータを含む9つの古いアーカイブディレクトリセットです。）

アプローチは、各ファイルのチェックサムを計算してから（ノートブックから合計2分少しかかる）、awkを使用してチェックサム別にファイルをグループ化することです（1秒未満）。

チェックサムプロセスは次のバッシュの断片です。

#.. Checksum all the files, and show some statistics.
[ x ] && {
    time ( cd "${BackUpDir}" && cksum */* ) > "${CkSums}"
    du -s -h "${BackUpDir}"
    head -n 3 "${CkSums}"
    awk '{ Bytes += $2; }
        END { printf ("Files %d, %.2f MB\n", FNR, Bytes / (1024 * 1024)); }
        ' "${CkSums}"
}

このログを見せてください。

$ ./fileGroup
real    2m5.139s
user    1m3.141s
sys 0m24.685s
9.8G    /media/paul/WinData/tarMETADATA
2288228966 156844 20220107_002000/02_History.tld
1812380507 156992 20220107_002000/02_History.toc
3028427874 1000411 20220107_002000/06_TechHist.tld
Files 565, 10001.10 MB

real    0m0.024s  #.. (Runtime of the awk component)
user    0m0.018s
sys 0m0.008s

結果の抜粋：

Group of 5 files for cksum 1459775330
    20220319_114500/lib64.tld
    20220401_182500/lib64.tld
    20220407_192000/lib64.tld
    20220503_190500/lib64.tld
    20220503_232500/lib64.tld

Group of 3 files for cksum 2937156162
    20220407_192000/sbin.tld
    20220503_190500/sbin.tld
    20220503_232500/sbin.tld

Group of 2 files for cksum 3291901599
    20220503_190500/30_Photos.tld
    20220503_232500/30_Photos.tld

Counted 304 non-grouped files.

Bashスクリプトは約60行で構成されており、そのうち30行にはawkスクリプトが含まれています（必要なGNU /特定の構文はわかりません）。

#! /bin/bash --

#.. Determine groups of identical files.

BackUpDir="/media/paul/WinData/tarMETADATA"
CkSums="Cksum.txt"
Groups="Groups.txt"

#.. Group all the files by checksum, and report them.

fileGroup () {

    local Awk='
BEGIN { Db = 0; reCut2 = "^[ ]*[^ ]+[ ]+[^ ]+[ ]+"; }
{ if (Db) printf ("\n%s\n", $0); }

#.. Add a new cksum value.
! (($1,0) in Fname) {
    Cksum[++Cksum[0]] = $1;
    if (Db) printf ("Added Cksum %d value %s.\n", 
        Cksum[0], Cksum[Cksum[0]]);
    Fname[$1,0] = 0;
}
#.. Add a filename.
{
    Fname[$1,++Fname[$1,0]] = $0;
    sub (reCut2, "", Fname[$1,Fname[$1,0]]);
    if (Db) printf ("Fname [%s,%s] is \047%s\047\n",
        $1, Fname[$1,0], Fname[$1, Fname[$1,0]]);
}
#.. Report the identical files, grouped by checksum.
function Report (Local, k, ke, cs, j, je, Single) {
    ke = Cksum[0];
    for (k = 1; k <= ke; ++k) {
        cs = Cksum[k];
        je = Fname[cs,0];
        if (je < 2) { ++Single; continue; }
        printf ("\nGroup of %d files for cksum %s\n", je, cs);
        for (j = 1; j <= je; ++j) printf ("    %s\n", Fname[cs,j]);
    }
    printf ("\nCounted %d non-grouped files.\n", Single);
}
END { Report( ); }
'
    awk -f <( printf '%s' "${Awk}" )
}

#### Script Body Starts Here.

    #.. Checksum all the files, and show some statistics.
    [ x ] && {
        time ( cd "${BackUpDir}" && cksum */* ) > "${CkSums}"
        du -s -h "${BackUpDir}"
        head -n 3 "${CkSums}"
        awk '{ Bytes += $2; }
            END { printf ("Files %d, %.2f MB\n", FNR, Bytes / (1024 * 1024)); }
            ' "${CkSums}"
    }

    #.. Analyse the cksum data.
    time fileGroup < "${CkSums}" > "${Groups}"

Answer