awk：レイアウトキーワードを使用したテキストファイルの解析

Question 1

一部のタグ（「ALBERT」のような名前）が他の行に欠けているかのように最初の行に欠落している可能性があると仮定すると、2段階のアプローチを使用して最初にすべてのタグを識別してから、すべてのタグを印刷する必要があります。その行に表示されるかどうかに関係なく、すべての行に適用される値。

$ cat tst.awk
BEGIN { OFS=";" }
NR==FNR {
    for (i=1; i<NF; i+=3 ) {
        if ( !seen[$i]++ ) {
            tags[++numTags] = $i
        }
    }
    next
}
{
    delete tag2val
    for (i=1; i<NF; i+=3) {
        tag = $i
        val = $(i+1) FS $(i+2)
        tag2val[tag] = val
    }

    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = tag2val[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
}

$ awk -f tst.awk example.txt example.txt | column -t -s';' -o'; '
some a; some b; some c; some d; some e
some a; some b;       ;       ; some e
some a; some b;       ; some d;

上記のコードは、すべての入力に表示される順序で各行のすべてのラベル値を出力します。

ラベルを列ヘッダーとして処理するには：

$ cat tst.awk
BEGIN { OFS=";" }
NR==FNR {
    for (i=1; i<NF; i+=3 ) {
        if ( !seen[$i]++ ) {
            tags[++numTags] = $i
        }
    }
    next
}
FNR==1 {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
    }
}
{
    delete tag2val
    for (i=1; i<NF; i+=3) {
        tag = $i
        val = $(i+1) FS $(i+2)
        tag2val[tag] = val
    }

    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = tag2val[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
}

$ awk -f tst.awk example.txt example.txt | column -t -s';' -o'; '
ALBERT; BRYAN ; CLAUDIA; DAVID ; ERIK
some a; some b; some c ; some d; some e
some a; some b;        ;       ; some e
some a; some b;        ; some d;

Answer

一部のタグ（「ALBERT」のような名前）が他の行に欠けているかのように最初の行に欠落している可能性があると仮定すると、2段階のアプローチを使用して最初にすべてのタグを識別してから、すべてのタグを印刷する必要があります。その行に表示されるかどうかに関係なく、すべての行に適用される値。

$ cat tst.awk
BEGIN { OFS=";" }
NR==FNR {
    for (i=1; i<NF; i+=3 ) {
        if ( !seen[$i]++ ) {
            tags[++numTags] = $i
        }
    }
    next
}
{
    delete tag2val
    for (i=1; i<NF; i+=3) {
        tag = $i
        val = $(i+1) FS $(i+2)
        tag2val[tag] = val
    }

    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = tag2val[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
}

$ awk -f tst.awk example.txt example.txt | column -t -s';' -o'; '
some a; some b; some c; some d; some e
some a; some b;       ;       ; some e
some a; some b;       ; some d;

上記のコードは、すべての入力に表示される順序で各行のすべてのラベル値を出力します。

ラベルを列ヘッダーとして処理するには：

$ cat tst.awk
BEGIN { OFS=";" }
NR==FNR {
    for (i=1; i<NF; i+=3 ) {
        if ( !seen[$i]++ ) {
            tags[++numTags] = $i
        }
    }
    next
}
FNR==1 {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
    }
}
{
    delete tag2val
    for (i=1; i<NF; i+=3) {
        tag = $i
        val = $(i+1) FS $(i+2)
        tag2val[tag] = val
    }

    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        val = tag2val[tag]
        printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
    }
}

$ awk -f tst.awk example.txt example.txt | column -t -s';' -o'; '
ALBERT; BRYAN ; CLAUDIA; DAVID ; ERIK
some a; some b; some c ; some d; some e
some a; some b;        ;       ; some e
some a; some b;        ; some d;

Question 2

パールの使用：

#!/usr/bin/perl

# @keys is an array containing the keywords. It also determines
# the field output order.  This can be read from a file if needed,
# but here it's hard-coded.
my @keys = qw(ALBERT BRYAN CLAUDIA DAVID ERIK);

# create and pre-compile a regex matching all the keywords
my $keys = join("|",@keys);
my $keys_re = qr/$keys/;

# make an empty hash containing elements for all the keys so that
# we can start processing each input record afresh, with a fully
# populated list of keys.
my %empty = map +( $_ => '' ), @keys;


# main loop, process stdin and/or filename args
while(<>) {
  # clean up the input a little.
  chomp;            # trim newlines at EOL
  s/^\s*|\s*$//g;   # trim leading and trailing whitespace

  # ignore empty lines.
  next if (m/^$/);

  # NUL can't be in text input, so insert it as a marker around
  # the keywords. i.e. insert NULs before and after each keyword
  s/$keys_re/\000$&\000/g;

  # split the input record on NUL, trimming spaces and discarding
  # the first element (a bogus artificial field which only exists
  # as a side-effect of inserting a NUL before the first keyword.)
  my (undef,@record) = split /\s*\000\s*/;

  # pre-populate the fields hash for each record.
  my %fields = %empty;

  # now insert the real values for each keyword if they exist.
  foreach my $i (0..$#record) {
    $fields{$record[$i]} = $record[$i+1];
    $i++;
  };

  print join(";", map +( $fields{$_} ), @keys),"\n";
}

各セミコロンの後にスペースを追加するには、print join(";",...)上記の行を変更して追加します。

ファイルからキーワードを読むには、my @keys = qw(...)上の行を次のように置き換えます。

# slurp in the keywords file and split it on any whitespace.
my @keys = split /\s+/, do {
  local $/;   # read entire file at once - slurp
  my $fname = 'keywords.txt';
  open(my $fh, '<', $fname) or die "Error opening $fname: $!";
  <$fh>
};

キーワード.txtには、スペース、タブ、改行、CR / LFなどの垂直または水平スペースの組み合わせで区切られたキーを含めることができます。

$ cat keywords.txt 
ALBERT
BRYAN   CLAUDIA
DAVID ERIK

たとえば、として保存しiterate.plて実行可能にしますchmod +x iterate.pl。

$ ./iterate.pl input.txt 
some a;some b;some c;some d;some e
some a;some b;;;some e
some a;some b;;some d;

より少ないコンテンツや他のコンテンツでよりきれいな出力を表示するには、columnを使用できます。

$ ./iterate.pl input.txt | column -s';' -o'; ' -t
some a; some b; some c; some d; some e
some a; some b;       ;       ; some e
some a; some b;       ; some d;

Answer