URL txt ファイルをカールしますが、単一のファイルから各 URL を個別に grep します。

Question 1

これは2行で行う必要があります。

sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} curl -O "{}"
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} basename "{}"|xargs -I {} sed '/mortgage/q' "{}"

各行の最初のsedコマンドは、urlファイル（例では/tmp/curl.conf）からURLを抽出します。最初の行では、カールの-Oオプションを使用して、各ページの出力をページ名を持つファイルに保存します。 2行目では、各ファイルを再確認し、興味のあるテキストのみを表示します。もちろん、ファイルに「mortgage」という単語が含まれていないと、ファイル全体が出力されます。

これにより、現在のディレクトリの各URLの一時ファイルが残ります。

編集する：

これは残りのファイルを防ぎ、結果をstdoutに出力し、必要に応じてそこからリダイレクトできる短いスクリプトです。

#!/bin/bash
TMPF=$(mktemp)
# sed command extracts URLs line by line
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf >$TMPF
while read URL; do
    # retrieve each web page and delete any text after 'mortgage' (substitute whatever test you like)
    curl "$URL" 2>/dev/null | sed '/mortgage/q'
done <"$TMPF"
rm "$TMPF"

Answer

これは2行で行う必要があります。

sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} curl -O "{}"
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} basename "{}"|xargs -I {} sed '/mortgage/q' "{}"

各行の最初のsedコマンドは、urlファイル（例では/tmp/curl.conf）からURLを抽出します。最初の行では、カールの-Oオプションを使用して、各ページの出力をページ名を持つファイルに保存します。 2行目では、各ファイルを再確認し、興味のあるテキストのみを表示します。もちろん、ファイルに「mortgage」という単語が含まれていないと、ファイル全体が出力されます。

これにより、現在のディレクトリの各URLの一時ファイルが残ります。

編集する：

これは残りのファイルを防ぎ、結果をstdoutに出力し、必要に応じてそこからリダイレクトできる短いスクリプトです。

#!/bin/bash
TMPF=$(mktemp)
# sed command extracts URLs line by line
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf >$TMPF
while read URL; do
    # retrieve each web page and delete any text after 'mortgage' (substitute whatever test you like)
    curl "$URL" 2>/dev/null | sed '/mortgage/q'
done <"$TMPF"
rm "$TMPF"

Question 2

この一般的なトリックは、カール構成ファイルに他のオプション（ユーザーエージェント、リファラーなど）が含まれていても機能します。

最初のステップとして、構成ファイルの名前が指定されたとします。カール構成これは、awk '/^[Uu][Rr][Ll]/{print;print "output = dummy/"++k;next}1' curl_config > curl_config2 各URL / URLの下にさまざまな出力ファイル名を徐々に追加する新しいカール構成ファイルを作成するために使用されます。

例:

[xiaobai@xiaobai curl]$ cat curl_config
URL = "www.google.com"
user-agent = "holeagent/5.0"

url = "m12345.google.com"
user-agent = "holeagent/5.0"

URL = "googlevideo.com"
user-agent = "holeagent/5.0"
[xiaobai@xiaobai curl]$ awk '/^[Uu][Rr][Ll]/{print;print "output = dummy/"++k;next}1' curl_config  > curl_config2 
[xiaobai@xiaobai curl]$ cat curl_config2
URL = "www.google.com"
output = dummy/1
user-agent = "holeagent/5.0"

url = "m12345.google.com"
output = dummy/2
user-agent = "holeagent/5.0"

URL = "googlevideo.com"
output = dummy/3
user-agent = "holeagent/5.0"
[xiaobai@xiaobai curl]$

次に、このmkdir dummy一時ファイルを保存するディレクトリを作成します。セッションを作成しますinotifywait（sed '/google/q'をsed '/mortgage/q'に置き換えます）。

[xiaobai@xiaobai curl]$ rm -r dummy; mkdir dummy;
[xiaobai@xiaobai curl]$ rm final 
[xiaobai@xiaobai curl]$ inotifywait -m dummy -e close_write | while read path action file; do echo "[$file]">> final ; sed '/google/q' "$path$file" >> final; echo "$path$file"; rm "$path$file"; done;
Setting up watches.
Watches established.

別のbash/ターミナルセッションを開きます。 rm決定的なファイルが存在する場合は、上記の最初のステップで作成されたcurl_config2ファイルを使用してカールを実行します。

[xiaobai@xiaobai curl]$ curl -vLK curl_config2
...processing

inotifywaitセッションを見て、ファイルへの最近の終了書き込みを印刷し、sedし、完了するとすぐに削除します。

[xiaobai@xiaobai curl]$ inotifywait -m dummy -e close_write | while read path action file; do echo "[$file]">> final ; sed '/google/q' "$path$file" >> final; echo "$path$file"; rm "$path$file"; done;
Setting up watches.
Watches established.
dummy/1
dummy/3

最後に、出力が呼び出されるのがわかります。決定的な、これ[1と3]区切り文字は上記echo "[$file]">> finalで作成されます。

ファイルをすぐに削除する理由は、出力ファイルが大きく、多くのURLを処理し続ける必要があるため、すぐに削除するとディスク容量を節約できると仮定するためです。

Answer