ファイルパスが与えられたら、そのファイルのすべてのコピーを見つける方法は？

Question 1

簡単な方法：

ターゲットファイルをインポートしてmd5sum変数に保存
ファイルサイズを取得して変数に保存します。
同じサイズのすべてのファイルでfind実行md5sum
grepfind目標MD5ハッシュ値の出力

target_hash=$(md5sum needle.file | awk '{ print $1 }')
target_size=$(du -b needle.file | awk '{ print $1 }')
find haystack/ -type f -size "$target_size"c -exec md5sum {} \; | grep $target_hash

Answer

簡単な方法：

ターゲットファイルをインポートしてmd5sum変数に保存
ファイルサイズを取得して変数に保存します。
同じサイズのすべてのファイルでfind実行md5sum
grepfind目標MD5ハッシュ値の出力

target_hash=$(md5sum needle.file | awk '{ print $1 }')
target_size=$(du -b needle.file | awk '{ print $1 }')
find haystack/ -type f -size "$target_size"c -exec md5sum {} \; | grep $target_hash

Question 2

Czkawka(フラットホイールそしてGitHub）。

これは、チェックサムなどの高度な機能を備えた優れたGUIツールです。

Answer

Czkawka(フラットホイールそしてGitHub）。

これは、チェックサムなどの高度な機能を備えた優れたGUIツールです。

Question 3

ファイル数が多い場合（例：1000未満）、bashスクリプトが適している可能性があります。それ以外の場合、ループ（md5sum、）でバイナリを実行すると、かなりのオーバーヘッドが発生します。stat

1Gサイズのファイルが1000個ある場合、バイナリロードオーバーヘッドが比較的小さいため、ファイルサイズが重要です。しかし、1Kサイズのファイルが1000,000個ある場合、話は異なります。

バリアント番号1

md5sum使用法。

find_dups_by_md5.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

get_hash() {
    md5sum "$1" | cut -d' ' -f1 
}

needle=$1
needle_size=$(get_size "$needle")
needle_hash=$(get_hash "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    cur_file_hash=$(get_hash "$f")
    if [[ "$needle_hash" != "$cur_file_hash" ]]; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

亜種2

cmp使用法。

単純なバイト単位の比較がはるかに優れています。コードは少なく、結果は同じで、少し高速です。ここではハッシュは一度だけ使用されるため、ハッシュ計算は重複します。各ファイルに対してmd5sum（needleファイルを含む）ハッシュを実行し、定義されmd5sumているようにファイル全体を処理します。したがって、100個の1GBファイルがある場合、md5sumファイルが元々異なるキロバイトであったとしても、100Gがすべて処理されます。

したがって、各ファイルをターゲットと1つ比較すると、バイトごとの比較時間は最悪の場合は同じか（すべてのファイルの内容が同じ）、ファイルの内容が異なる場合（md5の仮定の下でより速くなります）。同じです）。

find_dups_by_cmp.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

needle=$1
needle_size=$(get_size "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    if ! cmp -s "$needle" "$f"; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

テスト

テストファイルの生成

###Generate test files
echo_random_bytes () {
    openssl rand -base64 100000;
}

shopt -s globstar

mkdir -p {a..d}/{e..g}/{m..o}

#Fill directories by some files with random content
touch {a..d}/{e..g}/{m..o}/file_{1..5}.txt
for f in **; do
    [ -f "$f" ] && echo_random_bytes > "$f"
done

#Creation of duplicates
same_string=$(echo_random_bytes)

touch {a..d}/{e..g}/m/dup_file.txt
for f in {a..d}/{e..g}/m/dup_file.txt; do
    echo "$same_string" > "$f"
done

#Target file creation
echo "$same_string" > needle_file.txt

重複検索

$ ./find_dups_by_md5.sh needle_file.txt
duplicate:  a/e/m/dup_file.txt
duplicate:  a/f/m/dup_file.txt
duplicate:  a/g/m/dup_file.txt
duplicate:  b/e/m/dup_file.txt
duplicate:  b/f/m/dup_file.txt
duplicate:  b/g/m/dup_file.txt
duplicate:  c/e/m/dup_file.txt
duplicate:  c/f/m/dup_file.txt
duplicate:  c/g/m/dup_file.txt
duplicate:  d/e/m/dup_file.txt
duplicate:  d/f/m/dup_file.txt
duplicate:  d/g/m/dup_file.txt

性能比較

$ time ./find_dups_by_md5.sh needle_file.txt > /dev/null

real    0m0,761s
user    0m0,809s
sys 0m0,169s

$ time ./find_dups_by_cmp.sh needle_file.txt > /dev/null

real    0m0,645s
user    0m0,526s
sys 0m0,162s

Answer

ファイル数が多い場合（例：1000未満）、bashスクリプトが適している可能性があります。それ以外の場合、ループ（md5sum、）でバイナリを実行すると、かなりのオーバーヘッドが発生します。stat

1Gサイズのファイルが1000個ある場合、バイナリロードオーバーヘッドが比較的小さいため、ファイルサイズが重要です。しかし、1Kサイズのファイルが1000,000個ある場合、話は異なります。

バリアント番号1

md5sum使用法。

find_dups_by_md5.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

get_hash() {
    md5sum "$1" | cut -d' ' -f1 
}

needle=$1
needle_size=$(get_size "$needle")
needle_hash=$(get_hash "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    cur_file_hash=$(get_hash "$f")
    if [[ "$needle_hash" != "$cur_file_hash" ]]; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

亜種2

cmp使用法。

単純なバイト単位の比較がはるかに優れています。コードは少なく、結果は同じで、少し高速です。ここではハッシュは一度だけ使用されるため、ハッシュ計算は重複します。各ファイルに対してmd5sum（needleファイルを含む）ハッシュを実行し、定義されmd5sumているようにファイル全体を処理します。したがって、100個の1GBファイルがある場合、md5sumファイルが元々異なるキロバイトであったとしても、100Gがすべて処理されます。

したがって、各ファイルをターゲットと1つ比較すると、バイトごとの比較時間は最悪の場合は同じか（すべてのファイルの内容が同じ）、ファイルの内容が異なる場合（md5の仮定の下でより速くなります）。同じです）。

find_dups_by_cmp.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

needle=$1
needle_size=$(get_size "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    if ! cmp -s "$needle" "$f"; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

テスト

テストファイルの生成

###Generate test files
echo_random_bytes () {
    openssl rand -base64 100000;
}

shopt -s globstar

mkdir -p {a..d}/{e..g}/{m..o}

#Fill directories by some files with random content
touch {a..d}/{e..g}/{m..o}/file_{1..5}.txt
for f in **; do
    [ -f "$f" ] && echo_random_bytes > "$f"
done

#Creation of duplicates
same_string=$(echo_random_bytes)

touch {a..d}/{e..g}/m/dup_file.txt
for f in {a..d}/{e..g}/m/dup_file.txt; do
    echo "$same_string" > "$f"
done

#Target file creation
echo "$same_string" > needle_file.txt

重複検索

$ ./find_dups_by_md5.sh needle_file.txt
duplicate:  a/e/m/dup_file.txt
duplicate:  a/f/m/dup_file.txt
duplicate:  a/g/m/dup_file.txt
duplicate:  b/e/m/dup_file.txt
duplicate:  b/f/m/dup_file.txt
duplicate:  b/g/m/dup_file.txt
duplicate:  c/e/m/dup_file.txt
duplicate:  c/f/m/dup_file.txt
duplicate:  c/g/m/dup_file.txt
duplicate:  d/e/m/dup_file.txt
duplicate:  d/f/m/dup_file.txt
duplicate:  d/g/m/dup_file.txt

性能比較

$ time ./find_dups_by_md5.sh needle_file.txt > /dev/null

real    0m0,761s
user    0m0,809s
sys 0m0,169s

$ time ./find_dups_by_cmp.sh needle_file.txt > /dev/null

real    0m0,645s
user    0m0,526s
sys 0m0,162s

Question 4

Pankiの回答によれば、呼び出し回数が減り、md5sum確認するファイルが数千個ある場合、パフォーマンスが向上します。

target_hash="$(md5sum needle.file | awk '{ print $1 }')"
target_size="$(du -b needle.file | awk '{ print $1 }')"
find haystack/ -type f -size "$target_size"c -print0 | xargs -0 md5sum | grep "^$target_hash"

注：ソースファイルと同様に、ファイル名に改行が含まれていると表示の問題が発生する可能性があります。

Answer

Pankiの回答によれば、呼び出し回数が減り、md5sum確認するファイルが数千個ある場合、パフォーマンスが向上します。

target_hash="$(md5sum needle.file | awk '{ print $1 }')"
target_size="$(du -b needle.file | awk '{ print $1 }')"
find haystack/ -type f -size "$target_size"c -print0 | xargs -0 md5sum | grep "^$target_hash"

注：ソースファイルと同様に、ファイル名に改行が含まれていると表示の問題が発生する可能性があります。

ファイルパスが与えられたら、そのファイルのすべてのコピーを見つける方法は？

答え1

答え2

答え3

バリアント番号1

亜種2

答え4

関連情報