重複を削除するファイルには、次のサンプルテキストがあります。最後の目標は、このファイル(そのうちの1つはweb:webapi)からすべての重複インスタンスを削除することです。
このファイルは600MB以上のファイルです。
"nirmal" -> ["app:am","app:am","app:identity_gateway","app:identity_gateway","app:loginsvc","app:loginsvc","app:loginui","app:loginui","app:ticket","app:ticket","app:webapi","app:webapi","ds:config_store","ds:config_store","ds:cts_store","ds:cts_store","ds:user_store","ds:user_store","web:am","web:am","web:identity_gateway","web:identity_gateway","web:loginsvc","web:loginsvc","web:loginui","web:loginui","web:ticket","web:ticket","web:webapi","web:webapi"];
"mbl" -> ["app:phx","web:phx","app:vas","development:mobile","s2:detsvc","s2core:detsvc","txn:detsvc","web:detsvc","app:fidoproxy","app:landing","app:mobile","app:noknok","app:optchart","app:redis","app:sentinel","app:spring","cws:mesg","cws3:wsproxy","s2:billpay","s2:services","s2core:billpay","s2core:services","web:fidoproxy","web:spring","at:admin","at:eqsroll","at:oqsroll","batch:admin","cws:ctnt","cws:risk","cws:user","cws3:acctaggtr","cws3:content","cws3:risk","cws3:rtao","cws3:rtmm","ets:ord","fhs:eqs","fhs:oqs","s2:aarcomm","s2:acctcomm","s2:espsvc","s2:ibsvc","s2core:aarcomm","s2core:espsvc","s2core:ibsvc","txb:b2bsvc","txn:acct","txn:ibank2","txn:olsvc","txn:rtmm","txn:services","txn:wtools","web:aempros_mpublish","web:b2b","web:etsecxml","web:ibxml","web:olxml","web:prospect","web:tablet","web:ticket","web:wtxml","web:xmlacct","web:xmlrtmm","s2:asset","s2core:asset","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg","app:phxcfgsvr","app:phxdshbrd","app:webapiagg","s2core:mblsvc","s2core:snapquotes","s2:mblsvc","s2:snapquotes","web:landing","web:mobile","web:phxcfgsvr","web:phxdshbrd","web:webapiagg"];
Linuxではこれをどのように実行しますか?
各行に同じ形式のテキストを含む完全なファイル。各ファイルで最初の文字列(「->」で区切られた)を検索し、その値でコンマで区切られた重複エントリを見つけようとします。重複したコンテンツが見つかった場合は削除する必要があります。
答え1
そしてsed
:
sed -e :1 -e 's/\("[^",]*"\)\(.*\),\1/\1\2/;t1'
:1
ループを表示するジャンプマーカー"[^",]*"
フィールドです。パターンからカンマを除外すると、パターンは","
フィールドとして扱われません。このフィールドを入力すると、\(\)
同じフィールドを再参照できます。\1
- この
s
コマンドは、カンマで同じフィールドの2番目の発生を削除します。 - 置換が行われると、
t
コマンドは最初のジャンプマーカーにジャンプします。
答え2
1つの方法は次のとおりです。
$ perl -lne '/^(.*?->\s*\[)(.*)(\].*)/; $k{$_}++ for split(/,/,$2);
print "$1", join ",", keys(%k), "$3"' file
"nirmal" -> ["web:webapi","web:identity_gateway","ds:config_store","app:ticket","ds:user_store","web:loginsvc","ds:cts _store","web:loginui","web:am","ds:cts_store","app:loginui","app:am","app:identity_gateway","app:webapi","web:ticket","app:loginsvc",];
"mbl" -> ["s2:acctcomm","cws:mesg","txn:olsvc","app:loginsvc","web:b2b","app:loginui","app:optchart","app:phxcfgsvr","cws3:risk","s2core:billpay","s2:detsvc","app:spring","app:phxdshbrd","ds:user_store","web:ticket","batch:admin","at:eqsroll","s2:asset","s2core:mblsvc","txn:acct","app:am","s2:espsvc","development:mobile","web:fidoproxy","app:webapi","txn:rtmm","s2:mblsvc","app:redis","cws:user","cws3:acctaggtr","ds:cts_store","txn:detsvc","web:mobile","app:webapiagg","txb:b2bsvc","fhs:oqs","cws3:wsproxy","web:landing","web:olxml","fhs:eqs","web:prospect","s2core:ibsvc","cws:risk","web:phx","s2:ibsvc","s2core:espsvc","txn:services","web:ibxml","web:tablet","at:admin","web:identity_gateway","web:spring","web:phxdshbrd","web:phxcfgsvr","s2core:snapquotes","app:sentinel","s2core:asset","ets:ord","cws3:rtmm","web:loginui","txn:wtools","web:loginsvc","s2:snapquotes","app:fidoproxy","web:etsecxml","s2:aarcomm","web:am","web:wtxml","app:noknok","ds:config_store","app:ticket","txn:ibank2","s2core:services","s2:billpay","web:detsvc","app:landing","cws3:content","web:aempros_mpublish","s2core:aarcomm","app:mobile","web:webapiagg","s2core:detsvc","web:webapi","cws3:rtao","app:identity_gateway","web:xmlrtmm","web:xmlacct","ds:cts _store","s2:services","at:oqsroll","app:vas","app:phx","cws:ctnt",];
説明する
perl -lne
-e
:入力ファイルの各行で()で指定されたスクリプトを実行します-n
。-l
入力から末尾の改行を削除し、各呼び出しに改行を追加しますprint
。/^(.*?->\s*\[)(.*)(\].*)/
:各入力ラインで3つのデータセットを一致させます。.*?->\s*\[
最初のファイルの先頭から最初の、次の->
ゼロ個以上の空白文字、その後\[
。パターンがかっこ内にあるので、と呼ぶことができます$1
。次に、最後の項目]
(.*)
)まですべて一致させます。これは〜になります$2
。最後に、残りの行(\].*
)を一致させます$3
。$k{$_}++ for split(/,/,$2);
:重複を削除する方法は次のとおりです。分割して$2
(冗長データ)を配列に配置し、その,
配列の各要素をハッシュのキーとして使用します%k
。ハッシュキーは常に一意であるため、削除されます$2
。print "$1", join ",", keys(%k), "$3"'
:今、ソース、行の先頭、コンマで連結されたハッシュキー$1
、最後に行の残りの部分を印刷します。これにより元の入力順序は維持されませんが、重複した内容は削除されます。%k
$3