互いに近い単語のファジー検索

Question 1

以下を考慮することができます。

1) glark, which has an option:
   ( expr1 --and=NUM expr2 )
   Match both of the two expressions, within NUM lines of each other.

2) bool, with expressions like:
   bool -O0 -C0 -D5 -b "two near three"

3) peg, which accepts options like:
   peg "/x/ and near(sub { /y/ or /Y/ }, 5)"

glarkのコードは次の場所にあります。https://github.com/jpace/glarkそしておそらくいくつかのリポジトリにいるでしょう。

boolとpegのいくつかの詳細：

bool    print context matching a boolean expression (man)
Path    : ~/executable/bool
Version : 0.2.1
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help    : probably available with -h,--help
Home    : https://www.gnu.org/software/bool/ (doc)

peg     Perl version of grep, q.v. (what)
Path    : ~/bin/peg
Version : 3.10
Length  : 4749 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Repo    : Debian 8.9 (jessie) 
Home    : http://piumarta.com/software/peg/ (pm)
Home    : http://www.cpan.org/authors/id/A/AD/ADAVIES/peg-3.10 (doc)

頑張って...乾杯、drl

Answer

以下を考慮することができます。

1) glark, which has an option:
   ( expr1 --and=NUM expr2 )
   Match both of the two expressions, within NUM lines of each other.

2) bool, with expressions like:
   bool -O0 -C0 -D5 -b "two near three"

3) peg, which accepts options like:
   peg "/x/ and near(sub { /y/ or /Y/ }, 5)"

glarkのコードは次の場所にあります。https://github.com/jpace/glarkそしておそらくいくつかのリポジトリにいるでしょう。

boolとpegのいくつかの詳細：

bool    print context matching a boolean expression (man)
Path    : ~/executable/bool
Version : 0.2.1
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help    : probably available with -h,--help
Home    : https://www.gnu.org/software/bool/ (doc)

peg     Perl version of grep, q.v. (what)
Path    : ~/bin/peg
Version : 3.10
Length  : 4749 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Repo    : Debian 8.9 (jessie) 
Home    : http://piumarta.com/software/peg/ (pm)
Home    : http://www.cpan.org/authors/id/A/AD/ADAVIES/peg-3.10 (doc)

頑張って...乾杯、drl

Question 2

次の形態素分析ツールから始めましょう。 https://linux.die.net/man/1/hunspell 次に正規表現を使用します。 https://linux.die.net/man/1/grep 次に、wc sort と Unique を使用して、単語の近接性に基づいてソートします。

医師バッシュ

WORDS=$1
HAYSTACK=/var/mail

STEMS=$(hunspell --stem $WORDS)
REGEX=$(echo $STEMS | perl -pe 's/ /.*/g')
while read MATCH ; do
    FILE=$(echo $MATCH | cut -d : 1)
    COUNT=$(echo $MATCH | cut -d : 2 | perl -pe 's/.*('"$REGEXX"').*/$1/g' | wc -c)
    echo $COUNT\t$FILE
done < <(grep -rP "$REGEX" $HAYSTACK) | \
sort -nr

より速くしたい場合は使用してください。 https://linux.die.net/man/1/locate 正規表現を使用して単語間のスペースを制限します。

a.{1,50}b

Answer

次の形態素分析ツールから始めましょう。 https://linux.die.net/man/1/hunspell 次に正規表現を使用します。 https://linux.die.net/man/1/grep 次に、wc sort と Unique を使用して、単語の近接性に基づいてソートします。

医師バッシュ

WORDS=$1
HAYSTACK=/var/mail

STEMS=$(hunspell --stem $WORDS)
REGEX=$(echo $STEMS | perl -pe 's/ /.*/g')
while read MATCH ; do
    FILE=$(echo $MATCH | cut -d : 1)
    COUNT=$(echo $MATCH | cut -d : 2 | perl -pe 's/.*('"$REGEXX"').*/$1/g' | wc -c)
    echo $COUNT\t$FILE
done < <(grep -rP "$REGEX" $HAYSTACK) | \
sort -nr

より速くしたい場合は使用してください。 https://linux.die.net/man/1/locate 正規表現を使用して単語間のスペースを制限します。

a.{1,50}b

Question 3

私はこのアイデアが好きですgrep メール（当店ではrapgrepというユーティリティを作成しましたが、すべてのモードが必要です、一般的な場合）。

このスニペットは、国、人、時間という言葉を見つけ、文字通りの面でより具体的な答えを示しています。

# Utility functions: print-as-echo, print-line-with-visual-space.
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
pl " Input data file $FILE:"
head $FILE

pl " Results, egrep:"
egrep 'time|men|country' $FILE

pl " Results, egrep, with byte offset:"
egrep -b 'time|men|country' $FILE

pl " Results, egrep, with byte offset, matches only:"
egrep -o -b 'time|men|country' $FILE |
tee t1

pl " Looking for minimum distance between all pairs:"

awk -F":" '
  { a[$2] = $1  # Compare every item to the new item
    for ( b in a ) {
      for ( c in a ) {
      # print " Working on b = ",b," c = ",c
        if ( b != c ) {
        v0 = a[c]-a[b]
        v1 = v0 < 0 ? -v0 : v0  # convert to > 0
        v2 = (b < c) ? b " " c : c " " b  # trivial sort of names
        print v1, v2
      }
    }
    }
  }

' t1 |
datamash -t" " -s --group 2,3 min 1

生産：

-----
 Input data file data1:
Now is the time
for all good men
to come to the aid
of their country.

-----
 Results, egrep:
Now is the time
for all good men
of their country.

-----
 Results, egrep, with byte offset:
0:Now is the time
16:for all good men
52:of their country.

-----
 Results, egrep, with byte offset, matches only:
11:time
29:men
61:country

-----
 Looking for minimum distance between all pairs:
country men 32
country time 50
men time 18

そして、特定の単語が複数回現れる少し複雑なファイルは次のとおりです。

-----
 Input data file data2:
Now is the time men
for all good men
to come to the aid
of their men country.

-----
 Results, egrep:
Now is the time men
for all good men
of their men country.

-----
 Results, egrep, with byte offset:
0:Now is the time men
20:for all good men
56:of their men country.

-----
 Results, egrep, with byte offset, matches only:
11:time
16:men
33:men
65:men
69:country

-----
 Looking for minimum distance between all pairs:
country men 4
country time 58
men time 5

これはGNU grepのバイト数オプションを利用し、awkプログラムはワードペア間のすべての距離を計算し、最後にdatamashが最小の距離をソート、グループ化、および選択します。

コマンドラインで単語を許可し、距離を許可するように非常に簡単にパラメータ化できます。 awkプログラムからdatamashへの入力データ型の説明については、ファイルt1を参照してください。

次のシステムで実行されます。

OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.9 (jessie) 
bash GNU bash 4.3.30
grep (GNU grep) 2.20
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)
datamash (GNU datamash) 1.2

頑張って...乾杯、drl

Answer

私はこのアイデアが好きですgrep メール（当店ではrapgrepというユーティリティを作成しましたが、すべてのモードが必要です、一般的な場合）。

このスニペットは、国、人、時間という言葉を見つけ、文字通りの面でより具体的な答えを示しています。

# Utility functions: print-as-echo, print-line-with-visual-space.
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
pl " Input data file $FILE:"
head $FILE

pl " Results, egrep:"
egrep 'time|men|country' $FILE

pl " Results, egrep, with byte offset:"
egrep -b 'time|men|country' $FILE

pl " Results, egrep, with byte offset, matches only:"
egrep -o -b 'time|men|country' $FILE |
tee t1

pl " Looking for minimum distance between all pairs:"

awk -F":" '
  { a[$2] = $1  # Compare every item to the new item
    for ( b in a ) {
      for ( c in a ) {
      # print " Working on b = ",b," c = ",c
        if ( b != c ) {
        v0 = a[c]-a[b]
        v1 = v0 < 0 ? -v0 : v0  # convert to > 0
        v2 = (b < c) ? b " " c : c " " b  # trivial sort of names
        print v1, v2
      }
    }
    }
  }

' t1 |
datamash -t" " -s --group 2,3 min 1

生産：

-----
 Input data file data1:
Now is the time
for all good men
to come to the aid
of their country.

-----
 Results, egrep:
Now is the time
for all good men
of their country.

-----
 Results, egrep, with byte offset:
0:Now is the time
16:for all good men
52:of their country.

-----
 Results, egrep, with byte offset, matches only:
11:time
29:men
61:country

-----
 Looking for minimum distance between all pairs:
country men 32
country time 50
men time 18

そして、特定の単語が複数回現れる少し複雑なファイルは次のとおりです。

-----
 Input data file data2:
Now is the time men
for all good men
to come to the aid
of their men country.

-----
 Results, egrep:
Now is the time men
for all good men
of their men country.

-----
 Results, egrep, with byte offset:
0:Now is the time men
20:for all good men
56:of their men country.

-----
 Results, egrep, with byte offset, matches only:
11:time
16:men
33:men
65:men
69:country

-----
 Looking for minimum distance between all pairs:
country men 4
country time 58
men time 5

これはGNU grepのバイト数オプションを利用し、awkプログラムはワードペア間のすべての距離を計算し、最後にdatamashが最小の距離をソート、グループ化、および選択します。

コマンドラインで単語を許可し、距離を許可するように非常に簡単にパラメータ化できます。 awkプログラムからdatamashへの入力データ型の説明については、ファイルt1を参照してください。

次のシステムで実行されます。

OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.9 (jessie) 
bash GNU bash 4.3.30
grep (GNU grep) 2.20
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)
datamash (GNU datamash) 1.2

頑張って...乾杯、drl

互いに近い単語のファジー検索

答え1

答え2

答え3

関連情報