欠落している列でCSVファイルを構成する

Question 1

今書き直された質問は概念的に答えやすいです。データの各行に表示または表示されない可能性があるラベルのセットがあります。各行を読み、列を順番に調べて、その列のタグが予想されるタグであることを確認したいと思います。そうでない場合は、空のセルを挿入し、次の列を確認してください。予想ラベルリストの終わりに達すると、再構成された行がエクスポートされます。

選択した言語で実装できる擬似コードは次のとおりです。

read the first row
split the text on commas to create the array of expected tags
read the next row
    if no more data, exit
    split the text on commas to create a row data array
    for each expected tag
        check the current column in the row's data
        if the tag matches
            write the column data to the output
            advance the current column in the row data
        else
            write a blank column to the output
        terminate the output line

Answer

今書き直された質問は概念的に答えやすいです。データの各行に表示または表示されない可能性があるラベルのセットがあります。各行を読み、列を順番に調べて、その列のタグが予想されるタグであることを確認したいと思います。そうでない場合は、空のセルを挿入し、次の列を確認してください。予想ラベルリストの終わりに達すると、再構成された行がエクスポートされます。

選択した言語で実装できる擬似コードは次のとおりです。

read the first row
split the text on commas to create the array of expected tags
read the next row
    if no more data, exit
    split the text on commas to create a row data array
    for each expected tag
        check the current column in the row's data
        if the tag matches
            write the column data to the output
            advance the current column in the row data
        else
            write a blank column to the output
        terminate the output line

Question 2

データの各列が実際に列名で始まることがわかりました。あなたの質問を初めて見たときにこの内容を見逃したようです。これにより、データ型を再指定できるだけでなく、非常に簡単です。

#!/usr/bin/perl

use strict;
my @headers; # array to hold the headers in the order they were seen.
my @search;  # array to hold a copy of @headers sorted by string length

while(<>) {
  chomp;    # remove newline character at end-of-line

  if ($. == 1) {
    next if (scalar @headers); # only process headers for first file
    # Split the first line into @headers array, removing any
    # leading or trailing spaces from each column
    @headers = split '\s*,\s*';

    # In case one key might be a substring of another key, copy the
    # @headers array, sorted by length, so we can compare the data
    # with the longest header names first.
    @search = sort { length($b) <=> length($a) } @headers;

    print join(",", @headers), "\n";

  } else {
    my %columns = ();
    # Loop over each column of the input line (row), inserting it into
    # the %columns hash, using the appropriate column name as the key.
    foreach my $c (split '\s*,\s*') {
      my $found = 0;
      foreach my $h (@search) {
        # If the current column ($c) begins with a header
        # name ($h), we've found the right key for it.
        if ($c =~ s/^$h\s+//i) { # match and remove header from column
        #if ($c =~ m/^$h\s+/i) { # or just match without removing header
          $columns{$h} = $c;
          $found = 1;
        };
      };
      warn "Unknown column '$c' in line $. of $ARGV\n" if
        ($c ne '' && ! $found);
    };

    # Output every column in the same order as in the header line.
    # Columns not actually present in a row are output as an empty field
    print join(",", @columns{@headers}), "\n";
  };

  # Reset the line counter at the end of each input file if
  # there's more than one
  close(ARGV) if eof;
}

各列に一致する正規表現は、大文字と小文字を区別せずに一致します。データに大文字または大文字と小文字の混合バージョンを含む列が含まれている場合同じ名前を指定し、正規表現から/i修飾子を削除します。

たとえば、適切な名前で保存./fix-data.plして実行可能にしますchmod +x ./fix-data.pl。

出力例：

$ ./fix-data.pl datafile.txt 
Name,Date,Address,Email
Alex,Sept 3,123 Madeup,[email protected]
Jenn,Sept 4,,[email protected]

または、コメント付きの代替if文を使用してください。

$ ./fix-data.pl datafile.txt 
Name,Date,Address,Email
Name Alex,Date Sept 3,Address 123 Madeup,Email [email protected]
Name Jenn,Date Sept 4,,Email [email protected]

列名がすでにヘッダー行にあり、各出力行の各列が正しい順序になっているので、誰かが2番目の形式を望む理由が何であるかわかりません。しかし、これが欲しいものなら簡単です。する。

ただし、パイプを使用して、出力形式を同じ幅の列を持つテーブルとして指定できますcolumn。

$ ./fix-data.pl datafile.txt | column -t -s , -o ', '
Name, Date  , Address   , Email
Alex, Sept 3, 123 Madeup, [email protected]
Jenn, Sept 4,           , [email protected]

column私の考えでは、withを' | '出力区切り文字として使用する方が読みやすくなります（まだスプレッドシートに簡単にインポートしたり、他のプログラムから解析したりできます）。

$ ./fix-data.pl datafile.txt | column -t -s , -o ' | '
Name | Date   | Address    | Email
Alex | Sept 3 | 123 Madeup | [email protected]
Jenn | Sept 4 |            | [email protected]

columnデータを有効なjsonとして出力することもできます。たとえば、次のようになります。

$ ./fix-data.pl datafile.txt |
  tail -n +2 |
  column --json -s , \
      --table-columns "$(sed -n -e '1s/ *, */,/gp' datafile.txt)"
{
   "table": [
      {
         "name": "Alex",
         "date": "Sept 3",
         "address": "123 Madeup",
         "email": "[email protected]"
      },{
         "name": "Jenn",
         "date": "Sept 4",
         "address": null,
         "email": "[email protected]"
      }
   ]
}

（少なくともDebianではパッケージcolumnにあります。他のディストリビューションではパッケージにあります。bsdextrautilsユーティリティLinux)

ミラーそしてデータ混合データを適切な形式に変換すると、便利なコマンドラインツールでもあります。

注：このスクリプトは、データが単純なカンマ区切り形式であると想定しています。いいえ正しい形式のCSV（例：RFC 4180 - カンマ区切り値（CSV）ファイルの一般形式とMIMEタイプ）引用符付き文字列フィールドまたは挿入されたコンマを使用できます。以内に参照されるフィールドです。行に引用された列が含まれている場合は、各入力行をカンマで区切る代わりにCSVパーサーを使用する必要があります。たとえば、Perlテキスト::CSV基準寸法。私はこれが必要だとは思わない。あなたのデータは、それを作成した人が明らかに発明した奇妙な非CSV形式であるためです。（CSVを知っていれば、おそらくそれを使用したでしょう...またはデータをめちゃくちゃにすることは今よりも悪いです）。

この警告は、すべての言語のすべての実装に適用されます。問題は、コードではなく混乱したデータが原因で発生するためです。

columnまた、コンマを含むCSVでは機能しません。

Answer