さまざまな行数で複数の.csvファイルを並べてマージする必要があります。

2024-5-24 • tag-icon

awk cut csv-simple paste

さまざまな行数で複数の.csvファイルを並べてマージする必要があります。

3〜5個の.csvファイルがあり、すべてを独自の列に維持しながら一緒にマージする必要があります。以下は、ファイルの行数が異なる単純な例です。ファイル1ファイル2ファイル3ファイル4ファイル5>最終ファイル。

ファイル1

1 1  
1 1  
1 1

ファイル2

2 2 2     
2 2 2

ファイル3

ファイル4

4  
4

ファイル5

すべてのファイルをマージしてすべてを独自の列に保持するには、.csvファイルの結果が必要です。私の例では、0は空のセル/列です。

最終文書

1 1 2 2 2 3 4 5       
1 1 2 2 2 3 4 5   
1 1 0 0 0 3 0 5    
0 0 0 0 0 3 0 5  
0 0 0 0 0 0 0 5

現在試しているすべての操作は、そのセル/列にデータがない場合、すべての項目を左にスライドさせます。

最終文書

1 1 2 2 2 3 4 5  
1 1 2 2 2 3 4 5   
1 1 3 5  
3 5  
5

答え1

csvkitヘッダーを使用せずに-H推論を無効にしてください-I。それ以外の場合、値は次1のように解釈されます。TRUE

csvjoin -H -I file*

これまで（ヘッダーがなくてエラーメッセージがあって文句を言って追加します）

a,b,a2,b2,c,a2_2,a2_3,a2_4
1,1,2,2,2,3,4,5
1,1,2,2,2,3,4,5
1,1,,,,3,,5
,,,,,3,,5
,,,,,,,5

欠落している値を置き換えて区切り文字を失う方法は好みの問題ですが、のフィールドを繰り返し、区切り文字をawkに設定し、-F","ヘッダーをスキップして変換フィールドをNR>1 追加できます。0NULL

csvjoin -H -I file* | awk -F"," 'NR>1{for (i=1; i<=NF; i++) printf ("%s ", $i+0); print""}'

これはあなたが望む場所に行くでしょう

1 1 2 2 2 3 4 5 
1 1 2 2 2 3 4 5 
1 1 0 0 0 3 0 5 
0 0 0 0 0 3 0 5 
0 0 0 0 0 0 0 5

あるいは、純粋なawkバージョンは、各ファイルのフィールド数が一定であると仮定する。

awk '{for (i=1; i<=NF; i++) {mx[FNR][nf+i]=$i}}
  ENDFILE{nf+=NF; nor=(nor<FNR)?FNR:nor}
  END{for (i=1;i<=nor;i++) {for (j=1;j<=nf;j++) printf ("%s ", mx[i][j]+0); print ""}}' file*

各ファイルの値を配列にロードするだけです。mx[][]

{for (i=1; i<=NF; i++) {mx[FNR][nf+i]=$i}}

各ファイルの末尾で、列インデックスをnfファイル内のフィールド数だけ右に移動し、現在のレコード数または最後の行列のサイズのいずれか大きい方をNF取ります。NRnor

ENDFILE{nf+=NF; nor=(nor<FNR)?FNR:nor}

最後に、行列次元を繰り返し、NULL値を変換します。

END{for (i=1;i<=nor;i++) {for (j=1;j<=nf;j++) printf ("$s ", mx[i][j]+0); print ""}}'

答え2

% stitch --autocol --ofs="\\t" one two three four five
1       1       2       2       2       3       4       5
1       1       2       2       2       3       4       5
1       1                               3               5
                                        3               5
                                                        5

近いですがpaste、まだそこまで届いていません。実際のCSVデータを設定し--ofs=,て--ifs=,取得するには、カンマで分割するのが非常に悪いCSVパーサーであることに注意してください。

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long qw(GetOptions);
GetOptions(
  'autocol|ac' => \my $Flag_AutoCol,
  'ifs=s'      => \my $Flag_IFS,
  'ofs=s'      => \my $Flag_OFS,
  'rs=s'       => \my $Flag_RS,
) or exit 64;

$Flag_IFS //= ' ';
$Flag_OFS //= ' ';
$Flag_RS  //= '\n';

$Flag_IFS =~ s/(\\t)/qq!"$1"!/eeg;
$Flag_OFS =~ s/(\\[nrt])/qq!"$1"!/eeg;
$Flag_RS =~ s/(\\[nrt])/qq!"$1"!/eeg;

my @fhs;
my $seen_stdin = 0;

for my $arg (@ARGV) {
  # "file" (no spec) or "file:" (no spec but colon) or "file:spec"
  # where no spec means "print all columns and do not preserve column
  # positions as will not try to guess that"
  my ( $file, $spec );
  if ( $arg =~ m/^([^:]+)$/ ) {
    $file = $1;
  } elsif ( $arg =~ m/^(.+):([^:]*)$/ ) {
    $file = $1;
    $spec = $2;
  }
  die "could not parse file from '$arg'\n" if !defined $file;

  my $fh;
  if ( $file eq '-' and !$seen_stdin ) {
    $fh         = \*STDIN;
    $seen_stdin = 1;
  } else {
    open $fh, '<', $file or die "could not open $file: $!\n";
  }
  push @fhs, [ $fh, defined $spec ? specify($spec) : undef ];
}

my $have_fhs = @fhs;
while ($have_fhs) {
  my $pad_col = 0;
  for my $i ( 0 .. $#fhs ) {
    if ( defined $fhs[$i]->[0] ) {
      my $line = readline $fhs[$i]->[0];
      if ( !defined $line ) {
        # EOF on an input file
        $fhs[$i]->[0] = undef;
        $have_fhs--;
        $pad_col += @{ $fhs[$i]->[1] } if defined $fhs[$i]->[1];
        next;
      }

      # Complicated due to not wanting to print the empty columns if
      # there's nothing else on the line to print (works around getting
      # an ultimate blank line that messes up the shell prompt)
      if ($pad_col) {
        print( ($Flag_OFS) x $pad_col );
        $pad_col = 0;
      }

      chomp $line;
      my @fields = split /$Flag_IFS/, $line;

      # Set field count from the first line of input (may cause
      # subsequent uninit warnings if the number of columns then drops)
      if ( $Flag_AutoCol and !defined $fhs[$i]->[1] ) {
        $fhs[$i]->[1] = [ 0 .. $#fields ];
      }

      if ( defined $fhs[$i]->[1] ) {
        print join( $Flag_OFS, @fields[ @{ $fhs[$i]->[1] } ] );
      } else {
        print join( $Flag_OFS, @fields );
      }
      print $Flag_OFS if $i != $#fhs;

    } elsif ( defined $fhs[$i]->[1] ) {
      $pad_col += @{ $fhs[$i]->[1] };
    }
  }
  print $Flag_RS if $have_fhs;
}

exit 0;

# Parse 1,2,3,5..9 type input into Perl array indices
sub specify {
  my $spec = shift;
  my @indices;

SPEC: {
    if ( $spec =~ m/\G(\d+)\.\.(\d+),?/cg ) {
      push @indices, $1 .. $2;
      redo SPEC;
    }
    if ( $spec =~ m/\G(\d+),?/cg ) {
      push @indices, $1;
      redo SPEC;
    }
    if ( $spec =~ m/\G(.)/cg ) {
      warn "unknown character '$1' in spec '$spec'\n";
      exit 65;
    }
  }

  # Assume user will use awk- or cut-like column numbers from 1, shift
  # these to perl count-from-zero internally.
  $_-- for @indices;

  return \@indices;
}

__END__
=head1 NAME

stitch - joins columns from multiple input files

=head1 SYNOPSIS

   $ cat a
   a b c
   $ cat b
   1 2 3
   4 5 6
   7 8 9
   $ stitch --ofs=\\t a:2 b:1,3
   b       1       3
           4       6
           7       9

That is, column two from the first file, and columns one and three from
the second. The range operator C<..> may also be used to select a range
of columns, e.g. C<1,4..6,8>.

=head1 DESCRIPTION

This program joins columns by line number from multiple input files.

=head1 USAGE

  $ stitch [--ac] [--ifs=s] [--ofs=s] [--rs=s] file[:spec] [file[:spec] ..]

Use C<-> to select columns from standard input; otherwise, specify files
to read input from, along with the optional column specification (by
default, all columns will be selected).

This program supports the following command line switches:

=over 4

=item B<--autocol> | B<--ac>

Set the number of columns from the first line of input seen from a
C<file> if a column specification was not provided for said C<file>.
Influences empty field padding (which only happens with a column
specification should a file run short before the others).

=item B<--ifs>=I<s>

Specify the input field separator (space by default). A C<\t> will be
expanded to the actual character:

  $ perl -E 'say join("\t", qw/a b c/)' | stitch --ifs=\\t -- -:2

Or, use a regex:

  $ perl -E 'say join("\t", qw/a b c/)' | stitch --ifs='\s+' -- -:2

=item B<--ofs>=I<s>

Output field separator (space by default). Similar expansion done as
for B<--ifs>, though also C<\n> and C<\r> are allowed.

=item B<--rs>=I<s>

Output record separator (newline by default). Expansion done as
for B<--ofs>.

=back

=head1 SECURITY

Probably should not be run under elevated privs due to user-supplied
input to the L<perlfunc/"split"> function.

Passing a user-supplied regex to L<perlfunc/"split"> might be a bit
sketchy especially if L<sudo(1)> or the like is involved. It might be
nice to have per-file IFS (so one could split on spaces on stdin, and
C<:> from C<passwd>), but that would add complications.

=head1 SEE ALSO

awk(1), comm(1), cut(1), join(1), perl(1)

=cut

関連情報