指定した一致に基づいてファイルから重複を削除する

指定した一致に基づいてファイルから重複を削除する

重複を削除するファイルには、次のサンプルテキストがあります。最後の目標は、このファイル(そのうちの1つはweb:webapi)からすべての重複インスタンスを削除することです。

このファイルは600MB以上のファイルです。

"nirmal" -> ["app:am","app:am","app:identity_gateway","app:identity_gateway","app:loginsvc","app:loginsvc","app:loginui","app:loginui","app:ticket","app:ticket","app:webapi","app:webapi","ds:config_store","ds:config_store","ds:cts_store","ds:cts_store","ds:user_store","ds:user_store","web:am","web:am","web:identity_gateway","web:identity_gateway","web:loginsvc","web:loginsvc","web:loginui","web:loginui","web:ticket","web:ticket","web:webapi","web:webapi"];
"mbl" -> ["app:phx","web:phx","app:vas","development:mobile","s2:detsvc","s2core:detsvc","txn:detsvc","web:detsvc","app:fidoproxy","app:landing","app:mobile","app:noknok","app:optchart","app:redis","app:sentinel","app:spring","cws:mesg","cws3:wsproxy","s2:billpay","s2:services","s2core:billpay","s2core:services","web:fidoproxy","web:spring","at:admin","at:eqsroll","at:oqsroll","batch:admin","cws:ctnt","cws:risk","cws:user","cws3:acctaggtr","cws3:content","cws3:risk","cws3:rtao","cws3:rtmm","ets:ord","fhs:eqs","fhs:oqs","s2:aarcomm","s2:acctcomm","s2:espsvc","s2:ibsvc","s2core:aarcomm","s2core:espsvc","s2core:ibsvc","txb:b2bsvc","txn:acct","txn:ibank2","txn:olsvc","txn:rtmm","txn:services","txn:wtools","web:aempros_mpublish","web:b2b","web:etsecxml","web:ibxml","web:olxml","web:prospect","web:tablet","web:ticket","web:wtxml","web:xmlacct","web:xmlrtmm","s2:asset","s2core:asset","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg"];

Linuxではこれをどのように実行しますか?

各行に同じ形式のテキストを含む完全なファイル。各ファイルで最初の文字列(「->」で区切られた)を検索し、その値でコンマで区切られた重複エントリを見つけようとします。重複したコンテンツが見つかった場合は削除する必要があります。

答え1

そしてsed

sed -e :1 -e 's/\("[^",]*"\)\(.*\),\1/\1\2/;t1'
  • :1ループを表示するジャンプマーカー
  • "[^",]*"フィールドです。パターンからカンマを除外すると、パターンは","フィールドとして扱われません。このフィールドを入力すると、\(\)同じフィールドを再参照できます。\1
  • このsコマンドは、カンマで同じフィールドの2番目の発生を削除します。
  • 置換が行われると、tコマンドは最初のジャンプマーカーにジャンプします。

答え2

1つの方法は次のとおりです。

$ perl -lne '/^(.*?->\s*\[)(.*)(\].*)/; $k{$_}++ for split(/,/,$2); 
             print "$1", join ",", keys(%k), "$3"' file
"nirmal" -> ["web:webapi","web:identity_gateway","ds:config_store","app:ticket","ds:user_store","web:loginsvc","ds:cts        _store","web:loginui","web:am","ds:cts_store","app:loginui","app:am","app:identity_gateway","app:webapi","web:ticket","app:loginsvc",];
"mbl" -> ["s2:acctcomm","cws:mesg","txn:olsvc","app:loginsvc","web:b2b","app:loginui","app:optchart","app:phxcfgsvr","cws3:risk","s2core:billpay","s2:detsvc","app:spring","app:phxdshbrd","ds:user_store","web:ticket","batch:admin","at:eqsroll","s2:asset","s2core:mblsvc","txn:acct","app:am","s2:espsvc","development:mobile","web:fidoproxy","app:webapi","txn:rtmm","s2:mblsvc","app:redis","cws:user","cws3:acctaggtr","ds:cts_store","txn:detsvc","web:mobile","app:webapiagg","txb:b2bsvc","fhs:oqs","cws3:wsproxy","web:landing","web:olxml","fhs:eqs","web:prospect","s2core:ibsvc","cws:risk","web:phx","s2:ibsvc","s2core:espsvc","txn:services","web:ibxml","web:tablet","at:admin","web:identity_gateway","web:spring","web:phxdshbrd","web:phxcfgsvr","s2core:snapquotes","app:sentinel","s2core:asset","ets:ord","cws3:rtmm","web:loginui","txn:wtools","web:loginsvc","s2:snapquotes","app:fidoproxy","web:etsecxml","s2:aarcomm","web:am","web:wtxml","app:noknok","ds:config_store","app:ticket","txn:ibank2","s2core:services","s2:billpay","web:detsvc","app:landing","cws3:content","web:aempros_mpublish","s2core:aarcomm","app:mobile","web:webapiagg","s2core:detsvc","web:webapi","cws3:rtao","app:identity_gateway","web:xmlrtmm","web:xmlacct","ds:cts        _store","s2:services","at:oqsroll","app:vas","app:phx","cws:ctnt",];

説明する

  • perl -lne-e:入力ファイルの各行で()で指定されたスクリプトを実行します-n-l入力から末尾の改行を削除し、各呼び出しに改行を追加しますprint
  • /^(.*?->\s*\[)(.*)(\].*)/:各入力ラインで3つのデータセットを一致させます。.*?->\s*\[最初のファイルの先頭から最初の、次の->ゼロ個以上の空白文字、その後\[。パターンがかっこ内にあるので、と呼ぶことができます$1。次に、最後の項目].*))まですべて一致させます。これは〜になります$2。最後に、残りの行(\].*)を一致させます$3
  • $k{$_}++ for split(/,/,$2);:重複を削除する方法は次のとおりです。分割して$2(冗長データ)を配列に配置し、その,配列の各要素をハッシュのキーとして使用します%k。ハッシュキーは常に一意であるため、削除されます$2
  • print "$1", join ",", keys(%k), "$3"':今、ソース、行の先頭、コンマで連結されたハッシュキー$1、最後に行の残りの部分を印刷します。これにより元の入力順序は維持されませんが、重複した内容は削除されます。%k$3

関連情報