行番号を使用して重複した単語スペルエラーを見つけるためのコマンドライン方法

Question 1

編集する：インストールとデモの追加

次の極端なケースを少なくとも処理する必要があります。

行の終わり（および開始）で単語を繰り返します。
が頻繁に表示されるため、検索時に大文字と小文字を区別する必要がありますThe the apple。
おそらく、検索を単語のコンポーネントに制限することで( ( a + b) + c )（繰り返し開くかっこと一致しない可能性があります）。
完全な単語だけを一致させることで削除できます。the thesis
人間の言語では、単語のUnicode文字を正しく解釈する必要があります。

おおむねpcregrep次の解決策をお勧めします。

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

明らかに、色と行番号（nオプション）はオプションですが、通常は問題ありません。

インストールする

Debianベースのディストリビューションでは、以下からインストールできます。

$ sudo apt-get install pcregrep

はい

次のコマンドを実行しjefferson_typo.txtて確認してください。

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

上記はテキストキャプチャだけですが、色をサポートする端末では一致するものに色が割り当てられます。

はいはい
そして
そして
はいはい

Answer

編集する：インストールとデモの追加

次の極端なケースを少なくとも処理する必要があります。

行の終わり（および開始）で単語を繰り返します。
が頻繁に表示されるため、検索時に大文字と小文字を区別する必要がありますThe the apple。
おそらく、検索を単語のコンポーネントに制限することで( ( a + b) + c )（繰り返し開くかっこと一致しない可能性があります）。
完全な単語だけを一致させることで削除できます。the thesis
人間の言語では、単語のUnicode文字を正しく解釈する必要があります。

おおむねpcregrep次の解決策をお勧めします。

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

明らかに、色と行番号（nオプション）はオプションですが、通常は問題ありません。

インストールする

Debianベースのディストリビューションでは、以下からインストールできます。

$ sudo apt-get install pcregrep

はい

次のコマンドを実行しjefferson_typo.txtて確認してください。

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

上記はテキストキャプチャだけですが、色をサポートする端末では一致するものに色が割り当てられます。

はいはい
そして
そして
はいはい

Question 2

これにより、繰り返しの単語（ファイル名と行番号を含む）を含む行が印刷されます。

for f in *.txt; do
    perl -ne 'print "$ARGV: $.: $_" if /\b(\w+)\W+\1/' "$f"
done

複数行の一致の場合、これはありますが、ファイルを段落単位で吸い込むため、行番号が失われます（これは-00このオプションの効果です）。 2つの単語の\W+間は、改行を含む「単語ではない」文字を表します。

perl -00 -nE '
    @matches = /\b((\w+)\W+\2)/g; 
    while (@matches) {
        ($match,$word) = splice @matches, 0, 2;
        say "dup: $match";
    }
' jefferson_typo.txt

dup: has has
dup: and
and
dup: be be

Answer

これにより、繰り返しの単語（ファイル名と行番号を含む）を含む行が印刷されます。

for f in *.txt; do
    perl -ne 'print "$ARGV: $.: $_" if /\b(\w+)\W+\1/' "$f"
done

複数行の一致の場合、これはありますが、ファイルを段落単位で吸い込むため、行番号が失われます（これは-00このオプションの効果です）。 2つの単語の\W+間は、改行を含む「単語ではない」文字を表します。

perl -00 -nE '
    @matches = /\b((\w+)\W+\2)/g; 
    while (@matches) {
        ($match,$word) = splice @matches, 0, 2;
        say "dup: $match";
    }
' jefferson_typo.txt

dup: has has
dup: and
and
dup: be be

Question 3

尊敬する人に会わなければならないdiction(1)そしてstyle(1)注文する。彼らはいろいろな種類の野生を受けました。新しいバージョンがあります（Fedora 23のGPLv3）。

インストールする

たとえば、Debianベースのディストリビューションでは、以下をdiction含むパッケージをインストールしますstyle。

$ sudo apt-get install diction

少なくともFedoraでは次のようになります。

$ dnf install diction

Red Hat Enterprise Edition（およびクローン）には、次のものが必要な場合があります。

$ yum install diction

とにかく、これは次のアップストリームGNUパッケージから来たものです。dictionしたがって、ほとんどすべての場所で同じように呼び出す必要があります。

はい

$ diction jefferson_typo.txt
jefferson_typo.txt:1: He has [has] refused his Assent to Laws, the [most] wholesome and necessary for the public good.

jefferson_typo.txt:3: He has forbidden his Governors to pass Laws of immediate and [and] pressing importance, unless suspended in their operation till his Assent should be [be] obtained; and when [so] suspended, he has utterly neglected to attend to them.

2 phrases in 2 sentences found.

利点

何よりも繰り返される言葉をつかむ

欠点

[]繰り返し単語に関連付けられていない項目のマーカーを導入します。たとえば[so]、関連性がないと見なされる可能性があるため、フラグを立てることができます。Strunkの「スタイル要素」。バラよりman diction
表示される数字は、常に元の入力の行番号ではなく、文が始まる行番号です。たとえば、[be]元の入力の行番号は5です。ここでは、lineで始まる文の一部であるため3にのみ表示されます。だからこれはあなたが望むものとは少し異なります[be]3

Answer

尊敬する人に会わなければならないdiction(1)そしてstyle(1)注文する。彼らはいろいろな種類の野生を受けました。新しいバージョンがあります（Fedora 23のGPLv3）。

インストールする

たとえば、Debianベースのディストリビューションでは、以下をdiction含むパッケージをインストールしますstyle。

$ sudo apt-get install diction

少なくともFedoraでは次のようになります。

$ dnf install diction

Red Hat Enterprise Edition（およびクローン）には、次のものが必要な場合があります。

$ yum install diction

とにかく、これは次のアップストリームGNUパッケージから来たものです。dictionしたがって、ほとんどすべての場所で同じように呼び出す必要があります。

はい

$ diction jefferson_typo.txt
jefferson_typo.txt:1: He has [has] refused his Assent to Laws, the [most] wholesome and necessary for the public good.

jefferson_typo.txt:3: He has forbidden his Governors to pass Laws of immediate and [and] pressing importance, unless suspended in their operation till his Assent should be [be] obtained; and when [so] suspended, he has utterly neglected to attend to them.

2 phrases in 2 sentences found.

利点

何よりも繰り返される言葉をつかむ

欠点

[]繰り返し単語に関連付けられていない項目のマーカーを導入します。たとえば[so]、関連性がないと見なされる可能性があるため、フラグを立てることができます。Strunkの「スタイル要素」。バラよりman diction
表示される数字は、常に元の入力の行番号ではなく、文が始まる行番号です。たとえば、[be]元の入力の行番号は5です。ここでは、lineで始まる文の一部であるため3にのみ表示されます。だからこれはあなたが望むものとは少し異なります[be]3

Question 4

質問にタグを付けたので、ただawk使用するのはどうですかawk？

$ awk '
    BEGIN{RS=FS="\\W+"}
    $0==t{printf("%s:%s\t%s %s\n", FILENAME, FNR, t, $0)}
    {t=$0}
' *.txt
highlander_typo.txt:6   one one
jefferson_typo.txt:3    has has
jefferson_typo.txt:29   and and
jefferson_typo.txt:42   be be
kylie_minogue.txt:3 la la

視覚的に役に立たなかったので改行を守らなかったのですがjefferson_typo.txt、お好みに合わせて調整してください。

Answer

質問にタグを付けたので、ただawk使用するのはどうですかawk？

$ awk '
    BEGIN{RS=FS="\\W+"}
    $0==t{printf("%s:%s\t%s %s\n", FILENAME, FNR, t, $0)}
    {t=$0}
' *.txt
highlander_typo.txt:6   one one
jefferson_typo.txt:3    has has
jefferson_typo.txt:29   and and
jefferson_typo.txt:42   be be
kylie_minogue.txt:3 la la

視覚的に役に立たなかったので改行を守らなかったのですがjefferson_typo.txt、お好みに合わせて調整してください。

行番号を使用して重複した単語スペルエラーを見つけるためのコマンドライン方法

実施例１

実施例２

例 3: 複数行

期待される出力

その他の考慮事項

失敗した試み

答え1

インストールする

はい

答え2

答え3

インストールする

はい

答え4

関連情報