文字列フィールドに複数のカンマを含む.CSVファイルの日付フィールドの書式設定

Question 1

カンマで区切られますが、カンマを含む文字列があります。 9列を日付で言及しているとは思わないでください。print m表示する行の後にaを挿入してください。

m=substr($9,4,3)
print m

はい

MY M: lum
column1,column2,column3,column4,column5,column6, column7, Column8,00/00/2009, Column10
MY M: me"
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
MY M: tho
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
MY M: me"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

アプローチを再検討するか、文字列に含まれるコンマを避ける必要があると思います。

修理する

awkキャラクターグループを分割する奇妙だが便利な機能があります。 1つの方法は、","カンマを使用する代わりに分割することです。

例（改善事項＃1）

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 {
  if(NR==1){print}
  else{ m=substr($9,4,3); print "MY M: " m;
   $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
  print
 } }' OFS="," file.csv

出力

MY M: 
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10,,,,,,,,00/00/2000
MY M: OCT
"12,B000QRIGJ4,4432,string with quotes, and with a comma, and colon: in between,4432,author1, name,890,88,10/11/2011,12"
MY M: OCT
"4432,B000QRIGJ4,890,another, string with quotes, and with more than, two commas: in between,455,author2, name,12,455,10/12/2011,55"
MY M: OCT
"11,B000QRIGJ4,77,string with, commas and (paranthesis) and : colans, in between,12,author3, name,333,22,10/13/2011,232"

でもこれも完全に合う言葉ではない。引用符を復元するには、追加のクリーンアップを実行してから、文字列の先頭と末尾から重複した引用符を削除する必要があります。

例（改善事項＃2）

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3); print "MY M: " m;
 $9 = sprintf("\"%02d/%02d/20%02d\"",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 /\"\"/ }' OFS="," file.csv

出力

MY M: 
"column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10","","","","","","","",""00/00/2000"","",
MY M: OCT
""12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88",""10/11/2011"","12"",
MY M: OCT
""4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455",""10/12/2011"","55"",
MY M: OCT
""11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22",""10/13/2011"","232"",

私はこのアプローチを引き続き使用しませんが、これは問題を解決する良い方法ではなく、メンテナンスの問題があり、時間の経過とともに入力が変わると非常に脆弱であることを知ってほしい。

例（改善事項＃3）

いいですね。ただこのままにすることはできません。これが実際のケースです。

awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }

 { if (NR==1){print; next} }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 }' OFS="," file.csv | sed -e 's/""/"/g' -e 's/,$//'

出力

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","10/11/2011","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","10/12/2011","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","10/13/2011","232"

Answer

カンマで区切られますが、カンマを含む文字列があります。 9列を日付で言及しているとは思わないでください。print m表示する行の後にaを挿入してください。

m=substr($9,4,3)
print m

はい

MY M: lum
column1,column2,column3,column4,column5,column6, column7, Column8,00/00/2009, Column10
MY M: me"
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
MY M: tho
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
MY M: me"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

アプローチを再検討するか、文字列に含まれるコンマを避ける必要があると思います。

修理する

awkキャラクターグループを分割する奇妙だが便利な機能があります。 1つの方法は、","カンマを使用する代わりに分割することです。

例（改善事項＃1）

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 {
  if(NR==1){print}
  else{ m=substr($9,4,3); print "MY M: " m;
   $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
  print
 } }' OFS="," file.csv

出力

MY M: 
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10,,,,,,,,00/00/2000
MY M: OCT
"12,B000QRIGJ4,4432,string with quotes, and with a comma, and colon: in between,4432,author1, name,890,88,10/11/2011,12"
MY M: OCT
"4432,B000QRIGJ4,890,another, string with quotes, and with more than, two commas: in between,455,author2, name,12,455,10/12/2011,55"
MY M: OCT
"11,B000QRIGJ4,77,string with, commas and (paranthesis) and : colans, in between,12,author3, name,333,22,10/13/2011,232"

でもこれも完全に合う言葉ではない。引用符を復元するには、追加のクリーンアップを実行してから、文字列の先頭と末尾から重複した引用符を削除する必要があります。

例（改善事項＃2）

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3); print "MY M: " m;
 $9 = sprintf("\"%02d/%02d/20%02d\"",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 /\"\"/ }' OFS="," file.csv

出力

MY M: 
"column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10","","","","","","","",""00/00/2000"","",
MY M: OCT
""12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88",""10/11/2011"","12"",
MY M: OCT
""4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455",""10/12/2011"","55"",
MY M: OCT
""11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22",""10/13/2011"","232"",

私はこのアプローチを引き続き使用しませんが、これは問題を解決する良い方法ではなく、メンテナンスの問題があり、時間の経過とともに入力が変わると非常に脆弱であることを知ってほしい。

例（改善事項＃3）

いいですね。ただこのままにすることはできません。これが実際のケースです。

awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }

 { if (NR==1){print; next} }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 }' OFS="," file.csv | sed -e 's/""/"/g' -e 's/,$//'

出力

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","10/11/2011","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","10/12/2011","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","10/13/2011","232"

Question 2

適切なCSVパーサーを含むツールを使用してください。たとえば、ルビーの場合：

ruby -rcsv -pe '
  if $. > 1
    row = CSV.parse_line($_)
    row[8] = Date.parse(row[8]).strftime("%Y/%m/%d")
    $_ = row.to_csv(:force_quotes=>true)
  end
' file.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"

Answer

適切なCSVパーサーを含むツールを使用してください。たとえば、ルビーの場合：

ruby -rcsv -pe '
  if $. > 1
    row = CSV.parse_line($_)
    row[8] = Date.parse(row[8]).strftime("%Y/%m/%d")
    $_ = row.to_csv(:force_quotes=>true)
  end
' file.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"

Question 3

簡単な方法

toのすべての項目を見つかった場所に変更し DD-MMM-YYYYますYYYY/MM/DD。

$ perl -pe 'BEGIN{ @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
                for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
               } 
          s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1#;' foo.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/9/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/9/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/9/13","232"

正確な方法

フィールド9の形式のみを変更してください。 Perlのフラグを使用して各行をフィールド（たとえばフィールド）-aに分割し、フィールド区切り文字をに設定すると、次のことができます。awk$F[0],$F[1]...$F[N-1]-F","

perl -F'\",\"' -lane 'BEGIN{
               @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
               for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
              } 
              $F[8]=~s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1# if $.>1; 
              print join("\",\"",@F)' foo.csv

その後、YYYY / MM / DDが印刷され（あなたの質問でしたように）すべての年が20。

Answer

簡単な方法

toのすべての項目を見つかった場所に変更し DD-MMM-YYYYますYYYY/MM/DD。

$ perl -pe 'BEGIN{ @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
                for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
               } 
          s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1#;' foo.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/9/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/9/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/9/13","232"

正確な方法

フィールド9の形式のみを変更してください。 Perlのフラグを使用して各行をフィールド（たとえばフィールド）-aに分割し、フィールド区切り文字をに設定すると、次のことができます。awk$F[0],$F[1]...$F[N-1]-F","

perl -F'\",\"' -lane 'BEGIN{
               @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
               for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
              } 
              $F[8]=~s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1# if $.>1; 
              print join("\",\"",@F)' foo.csv

その後、YYYY / MM / DDが印刷され（あなたの質問でしたように）すべての年が20。

Question 4

Miller（mlr）を使用してスペースをクリーンアップし（一部のヘッダーにスペースがあるようです）、フィールドの日付をColumn9正しい形式に変換します。日付変換は、最初に指定された日付をUnix時間に変換し、次にを使用してstrptime()すぐに目的の形式に再フォーマットすることによって行われますstrftime()。

mlr --csv \
    clean-whitespace then \
    put '$Column9 = strftime(strptime($Column9, "%d-%b-%y"), "%Y/%m/%d")' file

質問のデータに関する結果を提供します。

column1,column2,column3,column4,column5,column6,column7,Column8,Column9,Column10
12,B000QRIGJ4,4432,"string with quotes, and with a comma, and colon: in between",4432,"author1, name",890,88,2011/10/11,12
4432,B000QRIGJ4,890,"another, string with quotes, and with more than, two commas: in between",455,"author2, name",12,455,2011/10/12,55
11,B000QRIGJ4,77,"string with, commas and (paranthesis) and : colans, in between",12,"author3, name",333,22,2011/10/13,232

すべてのフィールドを参照するには、--quote-all今--csvコマンドラインに適切なオプションを追加してください。デフォルトでは、Millerは実際に参照する必要があるフィールドのみを参照します。

形式がより美しいです。

+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| column1 | column2    | column3 | column4                                                                 | column5 | column6       | column7 | Column8 | Column9    | Column10 |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| 12      | B000QRIGJ4 | 4432    | string with quotes, and with a comma, and colon: in between             | 4432    | author1, name | 890     | 88      | 2011/10/11 | 12       |
| 4432    | B000QRIGJ4 | 890     | another, string with quotes, and with more than, two commas: in between | 455     | author2, name | 12      | 455     | 2011/10/12 | 55       |
| 11      | B000QRIGJ4 | 77      | string with, commas and (paranthesis) and : colans, in between          | 12      | author3, name | 333     | 22      | 2011/10/13 | 232      |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+

Answer

Miller（mlr）を使用してスペースをクリーンアップし（一部のヘッダーにスペースがあるようです）、フィールドの日付をColumn9正しい形式に変換します。日付変換は、最初に指定された日付をUnix時間に変換し、次にを使用してstrptime()すぐに目的の形式に再フォーマットすることによって行われますstrftime()。

mlr --csv \
    clean-whitespace then \
    put '$Column9 = strftime(strptime($Column9, "%d-%b-%y"), "%Y/%m/%d")' file

質問のデータに関する結果を提供します。

column1,column2,column3,column4,column5,column6,column7,Column8,Column9,Column10
12,B000QRIGJ4,4432,"string with quotes, and with a comma, and colon: in between",4432,"author1, name",890,88,2011/10/11,12
4432,B000QRIGJ4,890,"another, string with quotes, and with more than, two commas: in between",455,"author2, name",12,455,2011/10/12,55
11,B000QRIGJ4,77,"string with, commas and (paranthesis) and : colans, in between",12,"author3, name",333,22,2011/10/13,232

すべてのフィールドを参照するには、--quote-all今--csvコマンドラインに適切なオプションを追加してください。デフォルトでは、Millerは実際に参照する必要があるフィールドのみを参照します。

形式がより美しいです。

+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| column1 | column2    | column3 | column4                                                                 | column5 | column6       | column7 | Column8 | Column9    | Column10 |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| 12      | B000QRIGJ4 | 4432    | string with quotes, and with a comma, and colon: in between             | 4432    | author1, name | 890     | 88      | 2011/10/11 | 12       |
| 4432    | B000QRIGJ4 | 890     | another, string with quotes, and with more than, two commas: in between | 455     | author2, name | 12      | 455     | 2011/10/12 | 55       |
| 11      | B000QRIGJ4 | 77      | string with, commas and (paranthesis) and : colans, in between          | 12      | author3, name | 333     | 22      | 2011/10/13 | 232      |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+

文字列フィールドに複数のカンマを含む.CSVファイルの日付フィールドの書式設定

答え1

はい

修理する

例（改善事項＃1）

出力

例（改善事項＃2）

出力

例（改善事項＃3）

出力

答え2

答え3

簡単な方法

正確な方法

答え4

関連情報