HTMLからデータを抽出する簡単な方法

Question 1

あなたはそれを使用することができますsed

$ cat test

<td><a href="http://help.domain.com " target="_blank">help.domain.com</a></td>
<td><a href="http://hello.domain.com " target="_blank">hello.domain.com</a></td>
<td><a href="http://test.domain.com " target="_blank">test.domain.com</a></td>

$ sed 's/^.*">//;s/<.*//' test

help.domain.com
hello.domain.com
test.domain.com

Answer

あなたはそれを使用することができますsed

$ cat test

<td><a href="http://help.domain.com " target="_blank">help.domain.com</a></td>
<td><a href="http://hello.domain.com " target="_blank">hello.domain.com</a></td>
<td><a href="http://test.domain.com " target="_blank">test.domain.com</a></td>

$ sed 's/^.*">//;s/<.*//' test

help.domain.com
hello.domain.com
test.domain.com

Question 2

あなたはそれを使用することができますawk：

awk -F'">|</' '{ print $2 }' file

出力：

help.domain.com
hello.domain.com
test.domain.com

Answer

あなたはそれを使用することができますawk：

awk -F'">|</' '{ print $2 }' file

出力：

help.domain.com
hello.domain.com
test.domain.com

Question 3

たぶん試してみてくださいlynx

lynx -dump -listonly -nonumbers  http://example.com/data/123 | awk -F'[/:]+' '{print $2}'

猫ファイル.html

<td><a href="http://help.example.com " target="_blank">help.example.com</a></td>
<td><a href="http://hello.example.com " target="_blank">hello.example.com</a></td>
<td><a href="http://test.example.com " target="_blank">test.example.com</a></td>

lynx -dump -listonly -nonumbers  file.html | awk -F'[/:]+' '{print $2}'

出力

help.example.com
hello.example.com
test.example.com

Answer

たぶん試してみてくださいlynx

lynx -dump -listonly -nonumbers  http://example.com/data/123 | awk -F'[/:]+' '{print $2}'

猫ファイル.html

<td><a href="http://help.example.com " target="_blank">help.example.com</a></td>
<td><a href="http://hello.example.com " target="_blank">hello.example.com</a></td>
<td><a href="http://test.example.com " target="_blank">test.example.com</a></td>

lynx -dump -listonly -nonumbers  file.html | awk -F'[/:]+' '{print $2}'

出力

help.example.com
hello.example.com
test.example.com

Question 4

これがワンタイムタスクであれば、他の答えも問題ありません。

それ以外の場合は、適切なxmlまたはhtmlパーサーを使用してください。

たとえばBeautifulSoup::

curl -X POST http://example.com/data/123 | python -c '
from bs4 import BeautifulSoup
import sys
soup=BeautifulSoup(sys.stdin,"lxml")
for a in soup.find_all("a"):
  print(a.string)
'

出力：

help.example.com
hello.example.com
test.example.com

bs4インストールプロセスを経なければならない場合がありますpip。

もちろん、そうする必要はありませんcurl。リクエストページから直接python。

Answer