wget - 再帰的にダウンロードし、特定のMIMEタイプ/拡張のみをダウンロードする方法（例：テキストのみ）

Question 1

ホワイトリストを指定できます。許可されないファイル名パターン：

許可する:

-A LIST
--accept LIST

許可されていません：

-R LIST
--reject LIST

LISTカンマ区切りのファイル名パターン/拡張子のリスト。

次の予約文字を使用してパターンを指定できます。

*
?
[
]

例:

PNGファイルのみダウンロード：-A png
CSSファイルをダウンロードしないでください。-R css
「アバター」で始まるPNGファイルをダウンロードしないでください。-R avatar*.png

ファイルに拡張子がない場合。ファイル名に使用できるパターンはありません。 MIMEタイプの確認が必要なようです（参照Lars Kothoffの答え）。

Answer

ホワイトリストを指定できます。許可されないファイル名パターン：

許可する:

-A LIST
--accept LIST

許可されていません：

-R LIST
--reject LIST

LISTカンマ区切りのファイル名パターン/拡張子のリスト。

次の予約文字を使用してパターンを指定できます。

*
?
[
]

例:

PNGファイルのみダウンロード：-A png
CSSファイルをダウンロードしないでください。-R css
「アバター」で始まるPNGファイルをダウンロードしないでください。-R avatar*.png

ファイルに拡張子がない場合。ファイル名に使用できるパターンはありません。 MIMEタイプの確認が必要なようです（参照Lars Kothoffの答え）。

Question 2

次のコマンドを使用してwgetにパッチを適用できます。これ(返品ここ）MIMEタイプでフィルタリングします。しかし、このパッチはかなり古いものなので、もう機能しない可能性があります。

Answer

次のコマンドを使用してwgetにパッチを適用できます。これ(返品ここ）MIMEタイプでフィルタリングします。しかし、このパッチはかなり古いものなので、もう機能しない可能性があります。

Question 3

新しいWget（Wget2）にはすでに次の機能があります。

--filter-mime-type    Specify a list of mime types to be saved or ignored`

### `--filter-mime-type=list`

Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:

  wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*

It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

現在、Wget2はまだリリースされていませんが、まもなくリリースされる予定です。 Debian Unstableがアルファ版をリリースしました。

見ているhttps://gitlab.com/gnuwget/wget2より多くの情報を知りたいです。質問/コメントを直接投稿できます。[Eメール保護]。

Answer

新しいWget（Wget2）にはすでに次の機能があります。

--filter-mime-type    Specify a list of mime types to be saved or ignored`

### `--filter-mime-type=list`

Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:

  wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*

It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

現在、Wget2はまだリリースされていませんが、まもなくリリースされる予定です。 Debian Unstableがアルファ版をリリースしました。

見ているhttps://gitlab.com/gnuwget/wget2より多くの情報を知りたいです。質問/コメントを直接投稿できます。[Eメール保護]。

Question 4

私が試した全く異なるアプローチはScrapyを使用することでしたが、同じ問題がありました！私が解決した方法は次のとおりです。Python Scrapy - テキスト以外のファイルのダウンロードを防ぐためのMIMEタイプベースのフィルタ？

解決策は、プロキシを設定し、Node.js環境変数を介してそれを使用するようにScrapyを設定することです。http_proxy

何ですか代理人実行する必要がある作業は次のとおりです。

ScrapyからHTTPリクエストを取得し、クロールしているサーバーに送信します。その後、すべてのHTTPトラフィックを傍受するScrapyの応答を返します。

バイナリの場合（実装された経験的な方法に基づいて）403 ForbiddenScrapyにエラーを送信し、要求/応答をすぐに閉じます。これは時間とトラフィックを節約するのに役立ち、Scrapyはクラッシュしません。

実際に動作するサンプルプロキシコード！

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

Answer

私が試した全く異なるアプローチはScrapyを使用することでしたが、同じ問題がありました！私が解決した方法は次のとおりです。Python Scrapy - テキスト以外のファイルのダウンロードを防ぐためのMIMEタイプベースのフィルタ？

解決策は、プロキシを設定し、Node.js環境変数を介してそれを使用するようにScrapyを設定することです。http_proxy

何ですか代理人実行する必要がある作業は次のとおりです。

ScrapyからHTTPリクエストを取得し、クロールしているサーバーに送信します。その後、すべてのHTTPトラフィックを傍受するScrapyの応答を返します。

バイナリの場合（実装された経験的な方法に基づいて）403 ForbiddenScrapyにエラーを送信し、要求/応答をすぐに閉じます。これは時間とトラフィックを節約するのに役立ち、Scrapyはクラッシュしません。

実際に動作するサンプルプロキシコード！

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

wget - 再帰的にダウンロードし、特定のMIMEタイプ/拡張のみをダウンロードする方法（例：テキストのみ）

答え1

答え2

答え3

答え4

実際に動作するサンプルプロキシコード！

関連情報