テキストファイルを固定ワード数の行に分割する

Question 1

使用xargs（17秒）：

xargs -n1000 <file >output

-n最大引数数を定義するフラグを使用します。必要な制限にxargs変更するだけです。1000500

10 ^ 7の単語を含むテストファイルを作成しました。

$ wc -w file
10000000 file

時間統計は次のとおりです。

$ time xargs -n1000 <file >output
real    0m16.677s
user    0m1.084s
sys     0m0.744s

Answer

使用xargs（17秒）：

xargs -n1000 <file >output

-n最大引数数を定義するフラグを使用します。必要な制限にxargs変更するだけです。1000500

10 ^ 7の単語を含むテストファイルを作成しました。

$ wc -w file
10000000 file

時間統計は次のとおりです。

$ time xargs -n1000 <file >output
real    0m16.677s
user    0m1.084s
sys     0m0.744s

Question 2

Perlはこの点でとても上手なようです：

スペースで区切られた10,000,000の単語を含むファイルを作成します。

for ((i=1; i<=10000000; i++)); do printf "%s " $RANDOM ; done > one.line

Perlは現在、1,000語ごとに改行を追加します。

time perl -pe '
    s{ 
        (?:\S+\s+){999} \S+   # 1000 words
        \K                    # then reset start of match
        \s+                   # and the next bit of whitespace
    }
    {\n}gx                    # replace whitespace with newline
' one.line > many.line

タイミング

real    0m1.074s
user    0m0.996s
sys     0m0.076s

検証結果

$ wc one.line many.line
        0  10000000  56608931 one.line
    10000  10000000  56608931 many.line
    10000  20000000 113217862 total

承認されたawkソリューションは、入力ファイルに5秒以上かかりました。

Answer