シェルを使用してHTMLをテーブルに変換する方法

Question 1

以下はある程度トリックを行う必要があります。私を覚えてください。

テストせずに書いた。編集：これでテストしていくつかのバグを修正したので、うまくいくようです。
私は極端な場合（複数<h1>、<tbody>テーブルフィールド内など...）を無視します。

「scriptname.pl」に入れて、2行目と3行目のファイル名を変更して実行してください。perl scriptname.pl

#!/usr/bin/perl
open my $ifh, "inputfilename.html";
open my $ofh, ">outputfilename.html";
while(<$ifh>) {
  if(/<h1>(.*)<\/h1>/) {
    my $header = << "END";
  <table>
    <caption>$1</caption>
    <thead>
        <tr>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
        </tr>
    </thead>
    <tbody>
END
    print $ofh $header;
  } elsif(/<div class="row">/) {
    print $ofh "<tr>\n";
  } elsif(/<\/div>/) {
    print $ofh "</tr>\n";
  } elsif(/<p class=".*?">(.*)<\/p>/) {
    print $ofh "<td>$1</td>\n";
  } elsif(/<\/body>/) {
    print $ofh "</tbody>\n</table>\n</body>\n";
  } else {
    print $ofh $_;
  }
}
close $ofh;
close $ifh;

Answer

以下はある程度トリックを行う必要があります。私を覚えてください。

テストせずに書いた。編集：これでテストしていくつかのバグを修正したので、うまくいくようです。
私は極端な場合（複数<h1>、<tbody>テーブルフィールド内など...）を無視します。

「scriptname.pl」に入れて、2行目と3行目のファイル名を変更して実行してください。perl scriptname.pl

#!/usr/bin/perl
open my $ifh, "inputfilename.html";
open my $ofh, ">outputfilename.html";
while(<$ifh>) {
  if(/<h1>(.*)<\/h1>/) {
    my $header = << "END";
  <table>
    <caption>$1</caption>
    <thead>
        <tr>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
            <th>Hard Code</th>
        </tr>
    </thead>
    <tbody>
END
    print $ofh $header;
  } elsif(/<div class="row">/) {
    print $ofh "<tr>\n";
  } elsif(/<\/div>/) {
    print $ofh "</tr>\n";
  } elsif(/<p class=".*?">(.*)<\/p>/) {
    print $ofh "<td>$1</td>\n";
  } elsif(/<\/body>/) {
    print $ofh "</tbody>\n</table>\n</body>\n";
  } else {
    print $ofh $_;
  }
}
close $ofh;
close $ifh;

Question 2

セルを1つずつ抽出しようとするため、テーブルを再構築するのがより困難になります。

使いbashやすく、次pupの事項のみが適用されます。

#!/bin/bash

count=$(grep '<div ' demo.html | wc -l)
page_title=$(cat demo.html | pup 'body h1 text{}')

tbody() {
    for ((i=1;i<count+1;++i)); do
        IFS=, row=$(cat demo.html | pup "body div.row:nth-of-type($i) text{}" | grep '\S' | paste -s -d, -)
        printf "\t\t<tr>\n"
        printf '\t\t\t<td>%s</td>\n' $row
        printf "\t\t</tr>\n"
    done
}

cat <<EOF
<table>
    <caption>$page_title</caption>
    <thead>
        <tr>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
        </tr>
    </thead>
    <tbody>
`tbody`
    </tbody>
</table>
EOF

出力

<table>
    <caption>Page Title</caption>
    <thead>
        <tr>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
            <th>Hard Coded</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Text 1</td>
            <td>Text 2</td>
            <td>Text 3</td>
            <td>Text 4</td>
            <td>Text 5</td>
            <td>Text 6</td>
        </tr>
        <tr>
            <td>Text 1</td>
            <td>Text 2</td>
            <td>Text 3</td>
            <td>Text 4</td>
            <td>Text 5</td>
            <td>Text 6</td>
        </tr>
        <tr>
            <td>Text 1</td>
            <td>Text 2</td>
            <td>Text 3</td>
            <td>Text 4</td>
            <td>Text 5</td>
            <td>Text 6</td>
        </tr>
    </tbody>
</table>

説明する

アイデアは、最後の行まで繰り返し行ごとにデータを抽出することです。このコードスニペットは行数を提供します。

grep '<div ' demo.html | wc -l

その後、これをセレクタとして使用すると、列の代わりに行nth-of-type(n)全体を取得できます。grep '\S'空白行を削除するには、それを渡す必要があります。次に渡すと、paste -s -d, -コンマ区切りの結果が生成されます。

IFS=, row=$(cat demo.html | pup "body div.row:nth-of-type($i) text{}" | grep '\S' | paste -s -d, -)

各パラメータprintf '\t\t\t<td>%s</td>\n' $rowに展開され、次のようにラップされます。printf '\t\t\t<td>%s</td>\n' 'Text 1' 'Text 2' ...<td>...</td>

そのセクションを完全に削除すると、インデントされた結果のみが\t印刷されます。

Answer