タブ間のスペースを文字「|」に変換したいと思います。ファイルはここからダウンロードできます。
wget http://download.cbioportal.org/cancerhotspots/cancerhotspots.v2.maf.gz
cat cancerhotspots.v2.maf | grep -v version | head -3
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POMOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel 000236 NORMAL C C c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att 1 -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 MODERATE 1 SNV ACC . . acyc exome RETAIN V334I 334 V I NA WARS2 334 360 ACC headandneck saca acyc 000236
OPN3 23596 . GRCh37 1 241761094 241761094 + Missense_Mutation SNP G G A rs780348058 000236 NORMAL G G c.899C>T p.Ser300Leu p.S300L ENST00000366554 3/4 0 . . 0 . . OPN3,missense_variant,p.Ser300Leu,ENST00000366554,NM_014322.2;OPN3,missense_variant,p.Ser221Leu,ENST00000331838,;KMO,downstream_gene_variant,,ENST00000366559,NM_003679.4;KMO,downstream_gene_variant,,ENST00000366557,;KMO,downstream_gene_variant,,ENST00000366555,;OPN3,non_coding_transcript_exon_variant,,ENST00000469376,;OPN3,non_coding_transcript_exon_variant,,ENST00000490673,;OPN3,non_coding_transcript_exon_variant,,ENST00000478849,;OPN3,non_coding_transcript_exon_variant,,ENST00000463155,;OPN3,non_coding_transcript_exon_variant,,ENST00000462265,; A ENSG00000054277 ENST00000366554 Transcript missense_variant 1006/2620 899/1209 300/402 S/L tCg/tTg rs780348058 1 -1 OPN3 HGNC 14007 protein_coding YES CCDS31072.1 ENSP00000355512 Q9H1Y3 UPI000000165B NM_014322.2 deleterious(0.02) possibly_damaging(0.692) 3/4 Transmembrane_helices:TMhelix,PROSITE_profiles:PS50262,hmmpanther:PTHR24240:SF64,hmmpanther:PTHR24240,PROSITE_patterns:PS00238,Gene3D:1.20.1070.10,Pfam_domain:PF00001,Superfamily_domains:SSF81321,Prints_domain:PR00237 MODERATE 1 SNV 9.415e-06 0 0 0.0001278 0 0 0 0 . CGA . . 9.426e-06 1/106086 1/106208 0/9066 0/11158 1/7822 0/6612 0/54326 0/694 0/16408 PASS acyc exome RETAIN S300L 300 NA OPN3 300 402 TCG headandneck saca acyc 000236
列に値がない場合、2つのタブの間にスペースがあります。列数を数えると、これがわかります。
cat cancerhotspots.v2.maf | grep -v version | head -4 | awk '{ print NF }'
148
80
99
81
希望の出力。列に値がない場合、空白は文字「|」で置き換えられます。
cat cancerhotspots.v2.maf | grep -v version | head -2 | sed 's/\t\t/\t|\t/g'
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POMOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF TUMORTYPE PLATFORM judgement Amino_Acid_Change Amino_Acid_Position Protein_Lenght Reference_Amino_Acid Variant_Amino_Acid allele_freq tm Amino_Acid_Length Ref_Tri oncotree_organtype oncotree_parent oncotree_detailed Master_ID
WARS2 10352 . GRCh37 1 119575617 119575617 + Missense_Mutation SNP C C T novel | 000236 NORMAL C C | | | | | | | | c.1000G>A p.Val334Ile p.V334I ENST00000235521 6/6 0 . . 0 . . WARS2,missense_variant,p.Val334Ile,ENST00000235521,NM_201263.2,NM_015836.3;WARS2,missense_variant,p.Val240Ile,ENST00000537870,;WARS2,3_prime_UTR_variant,,ENST00000369426,;WARS2,downstream_gene_variant,,ENST00000497402,;WARS2,downstream_gene_variant,,ENST00000495746,; T ENSG00000116874 ENST00000235521 Transcript missense_variant 1027/2800 1000/1083 334/360 V/I Gtt/Att | 1 | -1 WARS2 HGNC 12730 protein_coding YES CCDS900.1 ENSP00000235521 Q9UGM6 B7Z5X7 UPI000004A002 NM_201263.2,NM_015836.3 tolerated(0.31) benign(0.015) 6/6 | Gene3D:1.10.240.10,HAMAP:MF_00140_B,hmmpanther:PTHR10055,Low_complexity_(Seg):seg,Superfamily_domains:SSF52374,TIGRFAM_domain:TIGR00233 | | | | | | | | MODERATE 1 SNV | | | | ACC . . | | | | | | | | | | acyc exome RETAIN V334I 334 | V I NA WARS2 334 360 ACC headandneck saca acyc 000236
cat cancerhotspots.v2.maf | grep -v version | head -4 | sed 's/\t\t/\t|\t/g' | awk '{ print NF }'
148
118
128
118
出力は148列でなければなりませんが、ヘッダーの列数は148列です。
空白がある場合は、すべての列を「|」で均一に埋める方法
ありがとうございます!
答え1
あなたが望むものは次のとおりです。
awk 'BEGIN{FS=OFS="\t"} {for (i=1;i<=NF;i++) if ($i == "") $i="|"; print}' file
または:
sed 's/\t\t/\t|\t/g; s/\t\t/\t|\t/g' file
しかし、提供された例では区別が難しい。
タブの代わりにコンマを使用すると、可視性が向上します。これは、sedを使用する2つの代替が必要な理由を示しています。
$ printf 'a,,,,b\n' | sed 's/,,/,|,/g'
a,|,,|,b
$ printf 'a,,,,b\n' | sed 's/,,/,|,/g; s/,,/,|,/g'
a,|,|,|,b
正規表現は,,
すべてのsペアと一致するため、,
すべての奇数ペアと一致します,,
が、偶数ペアは,,
2番目のパスが実行されるまで一致しません。他の例:
$ printf '12345678\n' | sed 's/\([0-9]\)\([0-9]\)/\1|\2/g'
1|23|45|67|8
$ printf '12345678\n' | sed 's/\([0-9]\)\([0-9]\)/\1|\2/g; s/\([0-9]\)\([0-9]\)/\1|\2/g'
1|2|3|4|5|6|7|8