awkスクリプトの条件の場合

Question 1

以下は、あなたが言うときにany of the size/date/repo-name/repo-path has no value意味するものを想定しています。たとえば、一部のブロックにはrepo-name=線がまったくありません。repo-name=

awkを使って実際に欲しいものを達成し、column最終的な列間隔を設定する方法は次のとおりです。

$ cat tst.sh
#!/usr/bin/env bash

awk '
BEGIN { OFS="\t" }
{
    sub(/^@/,"")                  # instead of `| tr -d @`
    ++numTags
    tag = val = $0
    sub(/ *=.*/,"",tag)
    sub(/[^=]+= */,"",val)
    tags[numTags] = tag
    vals[numTags] = val
}
numTags == 4 {
    if ( !doneHdr++ ) {
        for ( i=1; i<=numTags; i++ ) {
            tag = ( tags[i] == "date" ? "creationTime" : tags[i] )  # instead of `| sed s/date/creationTime/`
            printf "%s%s", tag, (i<numTags ? OFS : ORS)
        }
    }
    vals[3] = substr(vals[3],1,10)     # instead of `| awk {$3=substr($3,0,10}1`
    for ( i=1; i<=numTags; i++ ) {
        val = ( vals[i] == "" ? 0 : vals[i] )
        printf "%s%s", val, (i<numTags ? OFS : ORS)
    }
    numTags = 0
}
' "${@:--}" |
column -s$'\t' -t

$ cat file
size=190000
date=1603278566981
repo-name=testupload
repo-path=
size=140000
date=1603278566981
repo-name=
repo-path=/home/test/testupload2
size=
date=1603278566981
repo-name=testupload3
repo-path=/home/test/testupload3

$ ./tst.sh file
size    creationTime   repo-name    repo-path
190000  1603278566981  testupload   0
140000  1603278566981  0            /home/test/testupload2
0       1603278566981  testupload3  /home/test/testupload3

既存のコードの変更:

awkもはやファイル全体を一度にメモリに読み込む必要はありません。column私はこれがギャップを見つけるのに必要だと思います。それ以外の場合、columnawkは出力する前に2段階の方法を使用して各列のフィールドの最大長printfと最大フィールドの幅を決定するため、すべての入力をメモリに読み取る必要があります。
これ以上データの値に依存せず（現在のsedパイプを使用して実行しているヘッダー行にdateマッピングを追加したことを除きcreationTime）、一度に4つのデータ行のみが必要です。これがより便利な場合は、特定のタグ行のクリックをトリガーするように簡単に変更できます。たとえば、numTags == 4に変更しますtag == "repo-path"。
sed追加のパイプやコマンドが必要ないだけでなく、入力に文字列が含まれていると中断されるため、列ヘッダーをパイプしなくなります。datecreatingTimedaterepo-path=/home/date/uploadX
たとえば、=入力にが含まれていると失敗するため、FS値として使用されなくなります。=repo-path=/home/foo=bar/uploadX
データからすべてのsを削除するには、出力をパイピングする代わりに@使用しますが、実際にはヘッダー名（タグ）に対してのみこれを実行したいようです。それ以外の場合はどちらでも可能です。たとえば、sを含むデータを壊す可能性があるため、タグの先頭にsを含めて削除しました。gsub(/@/,"")tr -d @@repo-path=/home/foo@bar/uploadXsub(/^@/,"")@
3番目のフィールドを10文字に切り捨てるには、パイプを2番目のawkスクリプトに追加するのではなく、印刷するループの前にsubstr(vals[3],1,10)これを行う方法があるため、これを含めます。vals[]しかし、2番目の引数はargではなくargでsubstr()始まります。10

Answer

以下は、あなたが言うときにany of the size/date/repo-name/repo-path has no value意味するものを想定しています。たとえば、一部のブロックにはrepo-name=線がまったくありません。repo-name=

awkを使って実際に欲しいものを達成し、column最終的な列間隔を設定する方法は次のとおりです。

$ cat tst.sh
#!/usr/bin/env bash

awk '
BEGIN { OFS="\t" }
{
    sub(/^@/,"")                  # instead of `| tr -d @`
    ++numTags
    tag = val = $0
    sub(/ *=.*/,"",tag)
    sub(/[^=]+= */,"",val)
    tags[numTags] = tag
    vals[numTags] = val
}
numTags == 4 {
    if ( !doneHdr++ ) {
        for ( i=1; i<=numTags; i++ ) {
            tag = ( tags[i] == "date" ? "creationTime" : tags[i] )  # instead of `| sed s/date/creationTime/`
            printf "%s%s", tag, (i<numTags ? OFS : ORS)
        }
    }
    vals[3] = substr(vals[3],1,10)     # instead of `| awk {$3=substr($3,0,10}1`
    for ( i=1; i<=numTags; i++ ) {
        val = ( vals[i] == "" ? 0 : vals[i] )
        printf "%s%s", val, (i<numTags ? OFS : ORS)
    }
    numTags = 0
}
' "${@:--}" |
column -s$'\t' -t

$ cat file
size=190000
date=1603278566981
repo-name=testupload
repo-path=
size=140000
date=1603278566981
repo-name=
repo-path=/home/test/testupload2
size=
date=1603278566981
repo-name=testupload3
repo-path=/home/test/testupload3

$ ./tst.sh file
size    creationTime   repo-name    repo-path
190000  1603278566981  testupload   0
140000  1603278566981  0            /home/test/testupload2
0       1603278566981  testupload3  /home/test/testupload3

既存のコードの変更:

awkもはやファイル全体を一度にメモリに読み込む必要はありません。column私はこれがギャップを見つけるのに必要だと思います。それ以外の場合、columnawkは出力する前に2段階の方法を使用して各列のフィールドの最大長printfと最大フィールドの幅を決定するため、すべての入力をメモリに読み取る必要があります。
これ以上データの値に依存せず（現在のsedパイプを使用して実行しているヘッダー行にdateマッピングを追加したことを除きcreationTime）、一度に4つのデータ行のみが必要です。これがより便利な場合は、特定のタグ行のクリックをトリガーするように簡単に変更できます。たとえば、numTags == 4に変更しますtag == "repo-path"。
sed追加のパイプやコマンドが必要ないだけでなく、入力に文字列が含まれていると中断されるため、列ヘッダーをパイプしなくなります。datecreatingTimedaterepo-path=/home/date/uploadX
たとえば、=入力にが含まれていると失敗するため、FS値として使用されなくなります。=repo-path=/home/foo=bar/uploadX
データからすべてのsを削除するには、出力をパイピングする代わりに@使用しますが、実際にはヘッダー名（タグ）に対してのみこれを実行したいようです。それ以外の場合はどちらでも可能です。たとえば、sを含むデータを壊す可能性があるため、タグの先頭にsを含めて削除しました。gsub(/@/,"")tr -d @@repo-path=/home/foo@bar/uploadXsub(/^@/,"")@
3番目のフィールドを10文字に切り捨てるには、パイプを2番目のawkスクリプトに追加するのではなく、印刷するループの前にsubstr(vals[3],1,10)これを行う方法があるため、これを含めます。vals[]しかし、2番目の引数はargではなくargでsubstr()始まります。10

Question 2

最後のフィールドが空の場合は、次のように0に設定できます。

if ($NF == "") $NF = 0

だからあなたは次のようなものを得るでしょう

/^@repo-name/ {
  if (++count2 == 1) header = header OFS $1 ","
  if ($NF == "") $NF = 0

  repoNameArr[count] = $NF
  next
}

またはコードの重複を防ぐために

$NF == "" { $NF = 0 }

# ...

/^@repo-name/ {
  if (++count2 == 1) header = header OFS $1 ","
  repoNameArr[count] = $NF
  next
}

（データに一致する行がありません^@repo-name。）

この場合、おそらくより簡単なアプローチを選択します。各レコードが常に4行であると仮定すると、データをタブで区切られた4列に並べ替えるには、次のようにしますpaste。

$ cat file
size=
date=1603278566981
repo-name=testupload
repo-path=/home/test/testupload
size=140000
date=
repo-name=testupload2
repo-path=/home/test/testupload2
size=170000
date=1603278566981
repo-name=
repo-path=/home/test/testupload3
size=170000
date=1603278566981
repo-name=testupload3
repo-path=/home/test/testupload3

$ paste - - - - <file
size=   date=1603278566981      repo-name=testupload    repo-path=/home/test/testupload
size=140000     date=   repo-name=testupload2   repo-path=/home/test/testupload2
size=170000     date=1603278566981      repo-name=      repo-path=/home/test/testupload3
size=170000     date=1603278566981      repo-name=testupload3   repo-path=/home/test/testupload3

その後、次の方法を使用してCSVに変換できますmlr（ミラー):

$ paste - - - - <file | mlr --ifs tab --ocsv cat
size,date,repo-name,repo-path
,1603278566981,testupload,/home/test/testupload
140000,,testupload2,/home/test/testupload2
170000,1603278566981,,/home/test/testupload3
170000,1603278566981,testupload3,/home/test/testupload3

mlr欠落している値をゼロに置き換えることもできます。

$ paste - - - - <file | mlr --ifs tab --ocsv put 'for (k,v in $*) { is_null(v) { $[k] = 0 } }'
size,date,repo-name,repo-path
0,1603278566981,testupload,/home/test/testupload
140000,0,testupload2,/home/test/testupload2
170000,1603278566981,0,/home/test/testupload3
170000,1603278566981,testupload3,/home/test/testupload3

CSVの代わりにタブ区切り値（TSV）を使用するには、JSONまたは「きれいな印刷」表形式の出力を取得するために必要なものをすべて--otsv使用--ocsvできます。--opprint--ojson

上記は、入力データが質問のデータと類似していると仮定しています。質問のデータが構造化データ型（XMLやJSONなど）の一部のデータを処理したバリアントである場合は、元のデータを直接使用することをお勧めします。

Answer