行を複数の行に分割します。新しい行には一意の値があるだけでなく、別の行に重複した値も必要です。

2024-6-6 • tag-icon

text-processing awk

行を複数の行に分割します。新しい行には一意の値があるだけでなく、別の行に重複した値も必要です。

行が多いデータファイルがあり、フィールドのリストが異なる場合があります。以下はサンプルライン形式です。各フィールドは次のように区切ります@@@。

runAs="X094174"@@@format="excel2007"@@@path="/Path1"@@@name="X143122"@@@name="X182881"@@@name="X094174"@@@address="[email protected]"@@@address="[email protected]"@@@AgentLoc="/loc1"

データベーステーブル（列/行形式）と同じ形式でデータをインポートしたいと思います。

runAs      format       path       AgentLoc    name      address    
X094174    excel2007    /Path1     /loc1       X143122   [email protected]   
X094174    excel2007    /Path1     /loc1       X182881   [email protected]
X094174    excel2007    /Path1     /loc1       X094174

ファイル読み取りループとawk。

以下の形式でデータを生成しやすい場合は、

runAs      format       path       AgentLoc    name      address    

X094174    excel2007    /Path1     /loc1       X143122      
X094174    excel2007    /Path1     /loc1       X182881   
X094174    excel2007    /Path1     /loc1       X094174
X094174    excel2007    /Path1     /loc1                 [email protected]
X094174    excel2007    /Path1     /loc1                 [email protected]

答え1

これアッ必要なフォームを生成します。

$ cat dat
runAs="X094174"@@@format="excel2007"@@@path="/Path1"@@@name="X143122"@@@name="X182881"@@@name="X094174"@@@address="[email protected]"@@@address="[email protected]"@@@AgentLoc="/loc1"

まず、データを行ごとに1つのアイテム形式で配置します。

$ awk -F '@@@' '{ for(i=1;i<=NF;i++){ print $i } }' dat > tmp.dat

その後、テーブルを作成し、行の終わりをクリーンアップします。

$ awk -F '=' '{
    head[$1]++;
    dat[$1,head[$1]]=$2
  } END{
    max=0;
    for(i in head){
      printf i"\t"
    }
    print "";
      for(i in dat){
        split(i, arr_i, SUBSEP);
        if(arr_i[2]>max){
          max=arr_i[2]
        }
      }
      for(j=1;j<=max;j++){
        for(i in head){
          if(head[i]==1){
            printf dat[i,1]"\t"
          }else{
            printf dat[i,j]"\t"
          }
        }
        print ""
      }
  }' tmp.dat | awk -F '\t' '{ for(i=1;i<NF;i++){ printf $i"\t" } print $NF }' > dat.xls

$ cat dat.xls
runAs   format  address AgentLoc        name    path
"X094174"       "excel2007"     "[email protected]" "/loc1" "X143122"       "/Path1"
"X094174"       "excel2007"     "[email protected]" "/loc1" "X182881"       "/Path1"
"X094174"       "excel2007"             "/loc1" "X094174"       "/Path1"

たとえば、Excelにインポートした後、TAB停止区切り文字を選択します。

値が表示される順序によって、テーブルの行がどのように関連するかが決まります。

tmp.datパイプを使用すると、上記の作業を1段階で実行できるため、一時ファイルを回避できます。

答え2

Java 11以降、以下があります。テキストファイルから直接Javaコードを実行するためのきちんとした機能では、この問題を解決するためにJavaを使用するのはどうでしょうか？

Java 11がインストールされているとし、以下のコードを保存し、ConvertToTable.java次のように端末セッションで実行します。

java ConvertToTable.java < /path/to/input.txt | tee /path/to/output.txt

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

/**
 * <p>
 * Reads STDIN, converts it to the desired table representation, and writes the result to STDOUT.
 * </p>
 * <p>
 * Assumes that the incoming data uses the system character set and the system line separator.
 * </p>
 * <p>
 * Requires Java 11. Example invocation in Bash:
 * </p>
 *
 * <pre>
 * java ConvertToTable.java < /path/to/input.txt | tee /path/to/output.txt
 * </pre>
 *
 * @author eomanis
 */
public class ConvertToTable {

    private static final Pattern FIELD_SEPARATOR_INPUT = Pattern.compile( "@@@", Pattern.LITERAL );
    private static final Pattern KEY_VALUE_SEPARATOR = Pattern.compile( "=", Pattern.LITERAL );
    private static final Pattern VALUE_IN_DOUBLE_QUOTES = Pattern.compile( "^\"(.*)\"$" ); // Captures the value in capturing group 1

    private static final String FIELD_SEPARATOR_OUTPUT = "\t";

    public static void main( String[] args ) {
        Matcher matcherValueInDoubleQuotes = VALUE_IN_DOUBLE_QUOTES.matcher( "" );
        Map<String, List<String>> keysAndValues = new LinkedHashMap<>(); // A map that maps a key to a list of values
        boolean firstLine = true;
        String line;
        String[] keyAndValue;
        String key;
        String value;
        int outputLinesCount;

        try (BufferedReader reader = new BufferedReader( new InputStreamReader( System.in ) )) {

            // Read and convert the incoming data, one line at a time
            while ((line = reader.readLine()) != null) { // For each line...

                // Discard the previous line's data
                keysAndValues.values().stream().forEach( List::clear );

                // Collect the line's keys and values into the map
                for (String field : FIELD_SEPARATOR_INPUT.split( line )) { // For each key=value in the text line...
                    keyAndValue = KEY_VALUE_SEPARATOR.split( field, 2 ); // Split key=value into key and value
                    key = keyAndValue[0];
                    value = keyAndValue[1];

                    // Strip the double quotes from the value
                    if (matcherValueInDoubleQuotes.reset( value ).matches()) {
                        value = matcherValueInDoubleQuotes.group( 1 );
                    }

                    // Add the value to the key's list of values
                    if (!keysAndValues.containsKey( key )) { // If required, create a new empty list in the map for the key
                        keysAndValues.put( key, new ArrayList<>() );
                    }
                    keysAndValues.get( key ).add( value );
                }

                // First line: Generate and write the column headers (assume that the first line contains all possible keys)
                if (firstLine) {
                    firstLine = false;
                    String columnHeaders = keysAndValues.keySet().stream().collect( Collectors.joining( FIELD_SEPARATOR_OUTPUT ) );
                    System.out.println( columnHeaders );
                }

                // Figure out how many output lines we will be writing for the single input line
                outputLinesCount = keysAndValues.values().stream().mapToInt( List::size ).max().getAsInt();
                // Write the output line(s)
                for (int index = 0; index < outputLinesCount; index++) {
                    int indexFinal = index;
                    String outputLine = keysAndValues.values().stream() //
                            .map( list -> getValue( indexFinal, list ) ) //
                            .collect( Collectors.joining( FIELD_SEPARATOR_OUTPUT ) );
                    System.out.println( outputLine );
                }
            }
        } catch (IOException e) {
            throw new RuntimeException( e );
        }
    }

    /**
     * @return The value for the given index, with certain workarounds
     */
    private static String getValue( int index, List<String> values ) {

        if (values.isEmpty()) {
            // The text line did not contain the key at all
            return "";
        } else if (values.size() == 1) {
            // Value of a key that occurred exactly once in the text line: These are repeated on all output rows
            return values.get( 0 );
        } else {
            // Value of a key that occurred multiple times in the text line: Only print them for their respective output row
            return (index < values.size()) ? values.get( index ) : "";
        }
    }
}

関連情報