バイナリファイルからバイト発生に関する統計をどのように収集できますか？

Question 1

GNUの使用od:

od -vtu1 -An -w1 my.file | sort -n | uniq -c

またはより効率的に使用するにはperl（存在しないバイトの出力数（0））：

perl -ne 'BEGIN{$/ = \4096};
          $c[$_]++ for unpack("C*");
          END{for ($i=0;$i<256;$i++) {
              printf "%3d: %d\n", $i, $c[$i]}}' my.file

Answer

GNUの使用od:

od -vtu1 -An -w1 my.file | sort -n | uniq -c

またはより効率的に使用するにはperl（存在しないバイトの出力数（0））：

perl -ne 'BEGIN{$/ = \4096};
          $c[$_]++ for unpack("C*");
          END{for ($i=0;$i<256;$i++) {
              printf "%3d: %d\n", $i, $c[$i]}}' my.file

Question 2

大容量ファイルの場合、 sort を使用すると速度が遅くなることがあります。私は同等の問題を解決するために短いCプログラムを書いています（テストを含むMakefileのこのポイントを参照してください。):

#include <stdio.h>

#define BUFFERLEN 4096

int main(){
    // This program reads standard input and calculate frequencies of different
    // bytes and present the frequences for each byte value upon exit.
    //
    // Example:
    //
    //     $ echo "Hello world" | ./a.out
    //
    // Copyright (c) 2015 Björn Dahlgren
    // Open source: MIT License

    long long tot = 0; // long long guaranteed to be 64 bits i.e. 16 exabyte
    long long n[256]; // One byte == 8 bits => 256 unique bytes

    const int bufferlen = BUFFERLEN;
    char buffer[BUFFERLEN];
    int i;
    size_t nread;

    for (i=0; i<256; ++i)
        n[i] = 0;

    do {
        nread = fread(buffer, 1, bufferlen, stdin);
        for (i = 0; i < nread; ++i)
            ++n[(unsigned char)buffer[i]];
        tot += nread;
    } while (nread == bufferlen);
    // here you may want to inspect ferror of feof

    for (i=0; i<256; ++i){
        printf("%d ", i);
        printf("%f\n", n[i]/(float)tot);
    }
    return 0;
}

使用法:

gcc main.c
cat my.file | ./a.out

Answer

大容量ファイルの場合、 sort を使用すると速度が遅くなることがあります。私は同等の問題を解決するために短いCプログラムを書いています（テストを含むMakefileのこのポイントを参照してください。):

#include <stdio.h>

#define BUFFERLEN 4096

int main(){
    // This program reads standard input and calculate frequencies of different
    // bytes and present the frequences for each byte value upon exit.
    //
    // Example:
    //
    //     $ echo "Hello world" | ./a.out
    //
    // Copyright (c) 2015 Björn Dahlgren
    // Open source: MIT License

    long long tot = 0; // long long guaranteed to be 64 bits i.e. 16 exabyte
    long long n[256]; // One byte == 8 bits => 256 unique bytes

    const int bufferlen = BUFFERLEN;
    char buffer[BUFFERLEN];
    int i;
    size_t nread;

    for (i=0; i<256; ++i)
        n[i] = 0;

    do {
        nread = fread(buffer, 1, bufferlen, stdin);
        for (i = 0; i < nread; ++i)
            ++n[(unsigned char)buffer[i]];
        tot += nread;
    } while (nread == bufferlen);
    // here you may want to inspect ferror of feof

    for (i=0; i<256; ++i){
        printf("%d ", i);
        printf("%f\n", n[i]/(float)tot);
    }
    return 0;
}

使用法:

gcc main.c
cat my.file | ./a.out

Question 3

平均、シグマ、およびCVはバイナリファイルの内容の統計を判断するためにしばしば重要であるため、これらすべてのデータをシグマバイトから外れるASCII円として表示するcmdlineプログラムを作成しました。
http://wp.me/p2FmmK-96
grep、xargsなどのツールで使用して統計を抽出できます。

Answer

平均、シグマ、およびCVはバイナリファイルの内容の統計を判断するためにしばしば重要であるため、これらすべてのデータをシグマバイトから外れるASCII円として表示するcmdlineプログラムを作成しました。
http://wp.me/p2FmmK-96
grep、xargsなどのツールで使用して統計を抽出できます。

Question 4

これはStephaneの答えと似ていますodが、バイトのASCII値を表示します。また、頻度/発生回数でソートされます。

xxd -c1 my.file|cut -c10-|sort|uniq -c|sort -nr

多くのプロセスが始まるので、それほど効率的ではないと思いますが、単一のファイル、特に小さなファイルには優れています。

Answer

これはStephaneの答えと似ていますodが、バイトのASCII値を表示します。また、頻度/発生回数でソートされます。

xxd -c1 my.file|cut -c10-|sort|uniq -c|sort -nr

多くのプロセスが始まるので、それほど効率的ではないと思いますが、単一のファイル、特に小さなファイルには優れています。

バイナリファイルからバイト発生に関する統計をどのように収集できますか？

答え1

答え2

答え3

答え4

関連情報