ファイルで最も一般的な文字/文字の組み合わせを見つける

Question 1

繰り返しモノグラム文字列をすべてリストする必要があります...

...それで、スクリプトから1文字から全行の長さ（サンプルデータが1行に1単語を提供するため、単語の長さ）までの可能なすべての長さを調べました。

文書ssf.mawk:

#!/usr/bin/mawk -f
BEGIN {
        FS=""
}
{
        _=tolower($0)
        for(i=1;i<=NF;i++)
                for(j=i;j<=NF;j++)
                        print substr(_,i,j-i+1) | "sort|uniq -c|sort -n"
}

サンプル入力を使用して実行出力を低減します。

$ printf '%s\n' Stack Exchange Internet Web Question Find Frequent Words Combination Letters .... | ./ssf.mawk
      1 ....
      1 ac
      1 ack
      1 an
      1 ang

(((ここでは多くの行が省略されています)))

mawk-1.3.3私はこれをDebian8でテストしましたgawk-4.1.1。

Answer

繰り返しモノグラム文字列をすべてリストする必要があります...

...それで、スクリプトから1文字から全行の長さ（サンプルデータが1行に1単語を提供するため、単語の長さ）までの可能なすべての長さを調べました。

文書ssf.mawk:

#!/usr/bin/mawk -f
BEGIN {
        FS=""
}
{
        _=tolower($0)
        for(i=1;i<=NF;i++)
                for(j=i;j<=NF;j++)
                        print substr(_,i,j-i+1) | "sort|uniq -c|sort -n"
}

サンプル入力を使用して実行出力を低減します。

$ printf '%s\n' Stack Exchange Internet Web Question Find Frequent Words Combination Letters .... | ./ssf.mawk
      1 ....
      1 ac
      1 ack
      1 an
      1 ang

(((ここでは多くの行が省略されています)))

mawk-1.3.3私はこれをDebian8でテストしましたgawk-4.1.1。

Question 2

少なくとも2つ（最も小さいN変更について）があると仮定し、大文字と小文字を無視し、1行に任意の組み合わせを次のように実行できます。{2,$l}{N,$l}

% < examplelist 
Stack
Exchange
Internet
Web
Question
Find
Frequent
Words
Combination
Letters
% < examplelist perl -nlE '$_=lc; $l=length; next if $l < 2; m/(.{2,$l})(?{ $freq{$1}++ })^/; END { say "$freq{$_} $_" for keys %freq }' | sort -rg | head -4
3 in
2 ue
2 tion
2 tio

Answer

少なくとも2つ（最も小さいN変更について）があると仮定し、大文字と小文字を無視し、1行に任意の組み合わせを次のように実行できます。{2,$l}{N,$l}

% < examplelist 
Stack
Exchange
Internet
Web
Question
Find
Frequent
Words
Combination
Letters
% < examplelist perl -nlE '$_=lc; $l=length; next if $l < 2; m/(.{2,$l})(?{ $freq{$1}++ })^/; END { say "$freq{$_} $_" for keys %freq }' | sort -rg | head -4
3 in
2 ue
2 tion
2 tio

Question 3

これは、発生順に出力をソートするPerlスクリプトです。最小文字列長は設定可能で、進行状況を確認するためのデバッグオプションが含まれています。

#!/usr/bin/perl
# Usage: perl script_file input_file

use strict;
my $min_str_len = 2;
my $debug = 0;

my %uniq_substrings;

while(<>)
{
    chomp;
    my $s = lc $_; # assign to $s for clearity

    printf STDERR qq|#- String: [%s]\n|, $s if $debug;
    my $line_len = length($s);
    for my $len ($min_str_len .. $line_len)
    {
        printf STDERR qq|# Length: %u\n|, $len if $debug;
        # break string into characters
        my @c  = split(//,$s);
        # iterate over list while large enough to provide strings of $len characters
        while(@c>=$len)
        {
            my $substring = join('', @c[0..$len-1]);
            my $curr_count = ++$uniq_substrings{$substring};
            printf STDERR qq|%s (%s)\n|, $substring, $curr_count if $debug;
            shift @c;
        }
    }
}

sub mysort
{
    # sort by count, subsort by alphabetic
    my $retval =
        ($uniq_substrings{$b} <=> $uniq_substrings{$a})
        || ($a cmp $b);
    return $retval;
}

for my $str (sort(mysort keys %uniq_substrings))
{
    printf qq|%s = %u\n|, $str, $uniq_substrings{$str};
}

Answer

これは、発生順に出力をソートするPerlスクリプトです。最小文字列長は設定可能で、進行状況を確認するためのデバッグオプションが含まれています。

#!/usr/bin/perl
# Usage: perl script_file input_file

use strict;
my $min_str_len = 2;
my $debug = 0;

my %uniq_substrings;

while(<>)
{
    chomp;
    my $s = lc $_; # assign to $s for clearity

    printf STDERR qq|#- String: [%s]\n|, $s if $debug;
    my $line_len = length($s);
    for my $len ($min_str_len .. $line_len)
    {
        printf STDERR qq|# Length: %u\n|, $len if $debug;
        # break string into characters
        my @c  = split(//,$s);
        # iterate over list while large enough to provide strings of $len characters
        while(@c>=$len)
        {
            my $substring = join('', @c[0..$len-1]);
            my $curr_count = ++$uniq_substrings{$substring};
            printf STDERR qq|%s (%s)\n|, $substring, $curr_count if $debug;
            shift @c;
        }
    }
}

sub mysort
{
    # sort by count, subsort by alphabetic
    my $retval =
        ($uniq_substrings{$b} <=> $uniq_substrings{$a})
        || ($a cmp $b);
    return $retval;
}

for my $str (sort(mysort keys %uniq_substrings))
{
    printf qq|%s = %u\n|, $str, $uniq_substrings{$str};
}

Question 4

スクリプト：

MIN=2
MAX=5
while read A; do
    [ ${MAX} -lt ${#A} ] && max=${MAX} || max=${#A}
    for LEN in $(seq ${MIN} ${max}); do
        for k in $(seq 0 $((${#A}-${LEN}))); do
            echo "${A:$k:${LEN}}"
        done
    done
done <<< "$(cat file1|tr 'A-Z' 'a-z')" |sort|uniq -c|sort -k1,7rn -k9

いくつかの説明があります。

# define minimal length of letters combinations
MIN=2
# define maximal length of letters combinations
MAX=5
# take line by line
while read A; do
    # determine max length of letters combination for this line
    # because it is shorter than MAX above if length of the line is shorter
    [ ${MAX} -lt ${#A} ] && max=${MAX} || max=${#A}
    # in cycle take one by one possible lengths of letters combination for line
    for LEN in $(seq ${MIN} ${max}); do
        # in cycle take all possible letters combination for length LEN for line
        for k in $(seq 0 $((${#A}-${LEN}))); do
            # print a letter combination
            echo "${A:$k:${LEN}}"
        done
    done
done <<< "$(cat file1|tr 'A-Z' 'a-z')" |sort|uniq -c|sort -k1,7rn -k9
# the data are taken from file "file1" and converted to lowercase,
# the data are sorted, unique lines counted,
# after results sorted according to string numerical values for numbers
# and strings with the same numbers sorted in alphabetical order

パラメータMIN = 2とMAX = 5の場合、最初の30行が出力されます（合計出力は152行です）。

      3 in
      2 er
      2 et
      2 io
      2 ion
      2 nt
      2 on
      2 qu
      2 que
      2 st
      2 te
      2 ter
      2 ti
      2 tio
      2 tion
      2 ue
      1 ac
      1 ack
      1 an
      1 ang
      1 ange
      1 at
      1 ati
      1 atio
      1 ation
      1 bi
      1 bin
      1 bina
      1 binat
      1 ch
      ...

パラメータMIN = 1とMAX = 3の場合、最初の20行が出力されます（合計出力は109行です）。

Answer

スクリプト：

MIN=2
MAX=5
while read A; do
    [ ${MAX} -lt ${#A} ] && max=${MAX} || max=${#A}
    for LEN in $(seq ${MIN} ${max}); do
        for k in $(seq 0 $((${#A}-${LEN}))); do
            echo "${A:$k:${LEN}}"
        done
    done
done <<< "$(cat file1|tr 'A-Z' 'a-z')" |sort|uniq -c|sort -k1,7rn -k9

いくつかの説明があります。

# define minimal length of letters combinations
MIN=2
# define maximal length of letters combinations
MAX=5
# take line by line
while read A; do
    # determine max length of letters combination for this line
    # because it is shorter than MAX above if length of the line is shorter
    [ ${MAX} -lt ${#A} ] && max=${MAX} || max=${#A}
    # in cycle take one by one possible lengths of letters combination for line
    for LEN in $(seq ${MIN} ${max}); do
        # in cycle take all possible letters combination for length LEN for line
        for k in $(seq 0 $((${#A}-${LEN}))); do
            # print a letter combination
            echo "${A:$k:${LEN}}"
        done
    done
done <<< "$(cat file1|tr 'A-Z' 'a-z')" |sort|uniq -c|sort -k1,7rn -k9
# the data are taken from file "file1" and converted to lowercase,
# the data are sorted, unique lines counted,
# after results sorted according to string numerical values for numbers
# and strings with the same numbers sorted in alphabetical order

パラメータMIN = 2とMAX = 5の場合、最初の30行が出力されます（合計出力は152行です）。

      3 in
      2 er
      2 et
      2 io
      2 ion
      2 nt
      2 on
      2 qu
      2 que
      2 st
      2 te
      2 ter
      2 ti
      2 tio
      2 tion
      2 ue
      1 ac
      1 ack
      1 an
      1 ang
      1 ange
      1 at
      1 ati
      1 atio
      1 ation
      1 bi
      1 bin
      1 bina
      1 binat
      1 ch
      ...

パラメータMIN = 1とMAX = 3の場合、最初の20行が出力されます（合計出力は109行です）。

ファイルで最も一般的な文字/文字の組み合わせを見つける

答え1

答え2

答え3

答え4

関連情報