grepを使用して、マーカーの周りのnワードに関する句読点の問題を確認してください。

Question

grep -oGNUが同じテキスト（例えば、meaning n words before theまたは）を2回出力することはできませんand n words after the。ただし、where is ^thpcregrepキャプチャグループを使用し、forehead 演算子で一致するアイテムをキャプチャすることでこれを行うことができます（これにより、カーソルは次の一致に移動しません）。-o<n>nn

$ pcregrep -o0 -o2  '(\w+\W+){0,5}token(?=((\W+\w+){0,5}))' file
This is a token, but when any punctuation is
n words around a specific token, meaning n words before the
meaning n words before the token and n words after the
and n words after the token. There is no fix pattern

-o0全文が一致するかどうかは、-o1予測演算子内で一致するものです。(....)(?=(here))

次の入力に注意してください。

6 5 4 3 2 1 token token 1 2 3 4 5 6

それは以下を提供します：

5 4 3 2 1 token token 1 2 3 4
token 1 2 3 4 5

これは、最初の一致から2番目の一致を探し始めるからです。トークンしたがって、02番目の単語の前の単語のみが検索されますtoken。

$ echo 6 5 4 3 2 1 token token 1 2 3 4 5 6 |
   pcregrep -o1  '(?=((\w+\W+){0,5}token(\W+\w+){0,5}))\w*'
5 4 3 2 1 token token 1 2 3 4
4 3 2 1 token token 1 2 3 4 5
3 2 1 token token 1 2 3 4 5
2 1 token token 1 2 3 4 5
1 token token 1 2 3 4 5
token token 1 2 3 4 5
token 1 2 3 4 5

おそらくあなたが望むものではないかもしれません（各「トークン」の前に最大5つの単語があります）。

「トークン」が出るたびに両側に最大5つの単語で行を生成するには、一人では簡単ではないようですpcregrep。

各「タグ付き」単語の位置を記録し、各up-to-5-words<that-position>"token"up-to-5-words位置を一致させる必要があります。

それは次のとおりです。

$ echo 6 5 4 3 2 1 token token 1 2 3 4 5 6 | perl -lne '
    my @positions; push @positions, $-[0] while /\btoken\b/g;
    for $o (@positions) {
      print $& if /(\w+\W+){0,5}(?<=^.{$o})token(\W+\w+){0,5}/
    }'
5 4 3 2 1 token token 1 2 3 4
4 3 2 1 token token 1 2 3 4 5

それともどちらかを明確にしてください。トークンすべての場合に一致:

$ echo 6 5 4 3 2 1 token token 1 2 3 4 5 6 | perl -lne '
    my @positions; push @positions, $-[0] while /\btoken\b/g;
    for $o (@positions) {
      print "$1<token>$3" if /((\w+\W+){0,5})(?<=^.{$o})token((\W+\w+){0,5})/
    }'
5 4 3 2 1 <token> token 1 2 3 4
4 3 2 1 token <token> 1 2 3 4 5

（単純化/最適化できることを願っています）。

Answer 1