検索スクリプトからrobots.txtにアクセス中に問題が発生しました。

2024-5-22 • tag-icon

私は完全な検索スクリプトを使用していますhttp://www.perlfect.com/freescripts/search/何年も私のサイトにいました。未知の理由で数ヶ月前に正常に動作しませんでした。インデックス作成スクリプトを実行すると、次のエラーが発生します。

Loading http://emetnews.org/robots.txt...
Error: Couldn't get 'http://emetnews.org/robots.txt': response code 403
Not using any robots.txt.
Error: Couldn't get 'http://emetnews.org/': response code 403

robots.txtファイルは、Googleだけでなくサイト訪問者にも簡単にアクセスできます。権限は644に設定されています。スクリプトが機能しなくなるまで何も変更しませんでした。スクリプト開発者に連絡できません（彼らは何年もスクリプトやウェブサイトを更新していません）。私のWebホストは「外部」スクリプトをサポートしていません。

何が起こるのか知っている人はいますか？スクリプトのレイアウトが好きです。試してみると非常に専門的に見えます（そして無料です）。

作業結果curl --user-agent libwww-perl/6.08 http://emetnews.org/robots.txt:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /robots.txt
on this server.</p>
<p>Additionally, a 500 Internal Server Error
error was encountered while trying to use an ErrorDocument to handle the request.</p>
</body></html>

curl http://emetnews.org/robots.txt:

User-agent: Mediapartners-Google
Disallow:

Sitemap: http://emetnews.org/sitemap.xml

# User-agent: Browsershots
# Disallow:

User-agent: NinjaBot
Allow: /

User-agent: *
Disallow: /_lee/
Disallow: /blosxom/flavours/
Disallow: /blosxom/plugins/
Disallow: /contact/
Disallow: /cgi-bin/
Disallow: /feedback/
Disallow: /img/
Disallow: /includes/
# Disallow: /javascript/
Disallow: /lastrss/
# Disallow: /media/
Disallow: /mp3s/
Disallow: /print/
Disallow: /r/
Disallow: /sendPage/
# Disallow: /style/
Disallow: /talkback/
Disallow: /trip/
# block any URL that includes a ?
Disallow: /*?

# Disallowing the robot from Alexa from listing files in the Internet Archive
User-agent: ia_archiver
Disallow: /

ご協力ありがとうございます。私の.htaccessファイルには次のものがあります。

# Blocks access from libwww-perl user-agents and URLS which include the command "=http:"
RewriteCond %{HTTP_USER_AGENT} libwww [NC,OR]
RewriteCond %{QUERY_STRING} ^(.*)=http [NC]
RewriteRule ^(.*)$ - [F,L]

私はそれをコメントアウトし、今実行するとrobots.txtファイルのテキストを取得できます。

curl --user-agent libwww-perl/6.08 http://emetnews.org/robots.txt

ただし、インデクサーを実行すると、「できません」というメッセージは表示されなくなりました。得るrobots.txt ファイルが表示されます。無視する文書。 ???

関連情報