WGETを使用してインデックスからすべてのファイル名を抽出する

Question

apache2次の解決策は、フォーマットされていない標準生成ディレクトリインデックスでのみ機能します。以下を使用してwgetファイルを索引付けして解析grepできますcut。

#this will download the directory listing index.html file for /folder/
wget the.server.ip.address/folder/   

#this will grep for the table of the files, remove the top line (parent folder) and cut out
#the necessary fields
grep '</a></td>' index.html | tail -n +2 | cut -d'>' -f7 | cut -d'<' -f1

上記のように、これはapache2次のように構成されたデフォルトオプションを使用してサーバーからディレクトリリストを生成する場合にのみ機能します。

<Directory /var/www/html/folder>
 Options +Indexes 
 AllowOverride None
 Allow from all
</Directory>

この構成では、ディレクトリリストは特定の形式なしでwget返されますindex.htmlが、もちろん次のオプションを使用してディレクトリリストをカスタマイズすることもできます。

IndexOptions +option1 -option2 ...

より正確な回答を提供するには（あなたの状況に応じて）、サンプルindex.htmlファイルが必要です。

ここにPythonのバージョンもあります。

from bs4 import BeautifulSoup
import requests

def get_listing() :
  dir='http://cdimage.debian.org/debian-cd/8.4.0-live/amd64/iso-hybrid/'
  for file in listFD(dir):
    print file.split("//")[2]

def listFD(url, ext=''):    
  page = requests.get(url).text
  print page
  soup = BeautifulSoup(page, 'html.parser')
  return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

def main() :
  get_listing()


if __name__=='__main__' : 
  main()

ガイドとして使用このページ。

Answer 1