awk: 解析して別のファイルに書き込む

Question 1

あなたが投稿したものは有効なXMLではないので、例だとします。この仮定が有効でない場合、私の答えは本当ではありません...そうであれば、XML仕様の要約のコピーと一緒にXMLを提供した人に連絡して「修正」を要求する必要があります。

しかし、実際にはawk正規表現は作業に適したツールではありません。 XMLパーサーは次のとおりです。パーサーを使用すると、必要な操作を非常に簡単に実行できます。

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

#parse your file - this will error if it's invalid. 
my $twig = XML::Twig -> new -> parsefile ( 'your_xml' );
#set output format. Optional. 
$twig -> set_pretty_print('indented_a');

#iterate all the 'record' nodes off the root. 
foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   #if - beneath this record - we have a node anywhere (that's what // means)
   #with a tag of 'keyword' and content of 'SEARCH' 
   #print the whole record. 
   if ( $record -> get_xpath ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> print;
   }
}

xpathいくつかの点では正規表現に似ていますが、ディレクトリパスに近いです。これは、コンテキストを認識し、XML構造を処理できることを意味します。

上：./「現在のノードの下」を意味するので、次のようになります。

$twig -> get_xpath ( './record' )

「最上位」<record>タグを表します。

ただし、.//「現在のノードの下のすべてのレベル」を意味するので、これを再帰的に実行します。

$twig -> get_xpath ( './/search' )

<search>すべてのレベルのすべてのノードを取得できます。

角括弧は条件を表します。これは関数（text()ノードのテキストを取得するなど）または属性を使用できます。たとえば、//category[@name]名前属性を持つすべてのカテゴリを見つけてさらに//category[@name="xyz"]フィルタリングします。

テスト用のXML：

<XML>
<record category="xyz">
<person ssn="" e-i="E">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>SEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
<record category="abc">
<person ssn="" e-i="F">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>DONTSEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is not present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
</XML>

出力：

 <record category="xyz">
    <person
        e-i="E"
        ssn="">
      <title xsi:nil="true" />
      <position xsi:nil="true" />
      <details>
        <names>
          <first_name/>
          <last_name></last_name>
        </names>
        <aliases>
          <alias>CDP</alias>
        </aliases>
        <keywords>
          <keyword xsi:nil="true" />
          <keyword>SEARCH</keyword>
        </keywords>
        <external_sources>
          <uri>http://www.google.com</uri>
          <detail>SEARCH is present in abc for xyz reason</detail>
        </external_sources>
      </details>
    </person>
  </record>

注 – 上記はレコードを STDOUT として印刷します。実は…私の考えにはあまり良い考えではないようです。特にXML構造を印刷しないので（「ルート」ノードなしで）複数のレコードがある場合、実際には「有効な」XMLではありません。

だから私は - あなたが要求したことを正確に行います：

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

my $twig = XML::Twig -> new -> parsefile ('your_file.xml'); 
$twig -> set_pretty_print('indented_a');

foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   if ( not $record -> findnodes ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> delete;
   }
}

open ( my $output, '>', "output.txt" ) or die $!;
print {$output} $twig -> sprint;
close ( $output );

代わりに、これはロジックを反転し（メモリから解析されたデータ構造から）レコードを削除します。いいえ必要に応じて、新しい構造体全体（XMLヘッダーを含む）を「output.txt」という名前の新しいファイルに印刷します。

Answer