ファイルで最も一般的なN個の単語を見つけてハイフンを処理する方法は？

Question 1

これにより、トリックを実行できます。

sed ':1;/-$/{N;b1};s/-\n//g;y/ /\n/' file | sort | uniq -c

Answer

これにより、トリックを実行できます。

sed ':1;/-$/{N;b1};s/-\n//g;y/ /\n/' file | sort | uniq -c

Question 2

Perlはこれに便利です。 -0777 スイッチはファイル全体を単一の文字列に変換します。

perl -0777 -ne '
   s/-\n//g;                  # join the hyphenated words
   $count{$_}++ for split;    # count all the words
   while (($k,$v) = each %count) {print "$k:$v\n"}
' file

world:2
helo:1
hello:2
words:2
test:2

出力には特別な順序はありません。

もっと曖昧なものもあります：チクル。 tclshは他の言語のように選択の幅が広くないので、1-e行のコードにはより多くの作業が必要です。これは、ファイルの単語の順序を維持するという利点があります。

echo '
    set fh [open [lindex $argv 1] r]
    set data [read -nonewline $fh]
    close $fh
    foreach word [split [string map {"-\n" ""} $data]] {
        dict incr count $word
    }
    dict for {k v} $count {puts "$k:$v"}
' | tclsh -- file

hello:2
world:2
test:2
helo:1
words:2

Answer

Perlはこれに便利です。 -0777 スイッチはファイル全体を単一の文字列に変換します。

perl -0777 -ne '
   s/-\n//g;                  # join the hyphenated words
   $count{$_}++ for split;    # count all the words
   while (($k,$v) = each %count) {print "$k:$v\n"}
' file

world:2
helo:1
hello:2
words:2
test:2

出力には特別な順序はありません。

もっと曖昧なものもあります：チクル。 tclshは他の言語のように選択の幅が広くないので、1-e行のコードにはより多くの作業が必要です。これは、ファイルの単語の順序を維持するという利点があります。

echo '
    set fh [open [lindex $argv 1] r]
    set data [read -nonewline $fh]
    close $fh
    foreach word [split [string map {"-\n" ""} $data]] {
        dict incr count $word
    }
    dict for {k v} $count {puts "$k:$v"}
' | tclsh -- file

hello:2
world:2
test:2
helo:1
words:2

Question 3

tr++sedパイプを使用してくださいdatamash。

$ tr ' ' '\n' <file | sed '/-/N;s/-\n//' | datamash -s -g1 --output-delimiter=':' count 1
hello:2
helo:1
test:2
words:2
world:2

Answer

tr++sedパイプを使用してくださいdatamash。

$ tr ' ' '\n' <file | sed '/-/N;s/-\n//' | datamash -s -g1 --output-delimiter=':' count 1
hello:2
helo:1
test:2
words:2
world:2

ファイルで最も一般的なN個の単語を見つけてハイフンを処理する方法は？

答え1

答え2

答え3

関連情報