「while」ループに「awk」を入れ子にして、2つのファイルを1行ずつ解析し、列の値を比較します。

Question 1

awk最初の問題は、そのように内部でbash変数を使用できないことです。$a内部awk評価は大地 aしかし、はaに定義されていないので空です。この問題を解決する1つの方法は、のオプションを使用して変数を定義することです。awkbashawk-v

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

したがって、次のようにすることができます。

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

しかし、別のエラーがあります。読み取りがエクソン内に属するには、読み取りの開始位置がエクソンの開始位置より大きく、終了位置がエクソンの終了位置より小さくなければなりません。これを使用して、$2<=a && b <= $3エクソン境界の外側から始まる読み取りを選択します。あなたが望むものです$2>=a && $3<=b。

とにかく、bashループでこれらのタスクを実行することは、各sumペアに対してa入力ファイルを一度読み取る必要があるため、非常に非効率的ですb。なぜやりませんかawk？

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

上記のスクリプトをサンプルファイルで実行すると、次の出力が生成されます。

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

わかりやすくするために、ここでは短縮されていない形式で同じ内容があります。

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

Answer

awk最初の問題は、そのように内部でbash変数を使用できないことです。$a内部awk評価は大地 aしかし、はaに定義されていないので空です。この問題を解決する1つの方法は、のオプションを使用して変数を定義することです。awkbashawk-v

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

したがって、次のようにすることができます。

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

しかし、別のエラーがあります。読み取りがエクソン内に属するには、読み取りの開始位置がエクソンの開始位置より大きく、終了位置がエクソンの終了位置より小さくなければなりません。これを使用して、$2<=a && b <= $3エクソン境界の外側から始まる読み取りを選択します。あなたが望むものです$2>=a && $3<=b。

とにかく、bashループでこれらのタスクを実行することは、各sumペアに対してa入力ファイルを一度読み取る必要があるため、非常に非効率的ですb。なぜやりませんかawk？

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

上記のスクリプトをサンプルファイルで実行すると、次の出力が生成されます。

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

わかりやすくするために、ここでは短縮されていない形式で同じ内容があります。

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

Question 2

私はそうではないことを知っています。かなり何が欲しいですか？しかし、個人的に私は社交的な人ではないので、awkPerlを試してみることをお勧めします。

このような：

#!/usr/bin/perl

#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;

#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;

#record where our exons start and finish. 
my %start_of;
my %end_of;

#read line by line our exons file. 
#extract the 3 fields and save 'start' and 'end' in a hash table. 
while (<$exons>) {
    my ( $something, $start, $end ) = split;

    my $exon_id = $.;    #line number;
    $start_of{$exon_id} = $start;
    $end_of{$exon_id}   = $end;
}
close ( $exons );

my %exons;
#run through 'reads' line by line, extracting the files. 

open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
    my ( $thing, $read_start, $read_end, $value ) = split;

    #cycle through each exon. 
    foreach my $exon_id ( keys %start_of ) {

        #check if _this_ 'read' is within the start and end ranges. 
        if (    $read_start >= $start_of{$exon_id}
            and $read_end <= $end_of{$exon_id} )
        {
            #store the line number in our hash %exons. 
            push( @{ $exons{$exon_id} }, $. );
        }
    }
}
close ( $reads ); 

#cycle through %exons - in 'id' order. 
foreach my $exon_id ( sort keys %exons ) {
    #print any matches. 
    print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
        ") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}

与えられたサンプルデータを考えると：

exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5

より複雑な範囲の検証/検証を簡単に行うには、それを拡張できる必要があります！

Answer

私はそうではないことを知っています。かなり何が欲しいですか？しかし、個人的に私は社交的な人ではないので、awkPerlを試してみることをお勧めします。

このような：

#!/usr/bin/perl

#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;

#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;

#record where our exons start and finish. 
my %start_of;
my %end_of;

#read line by line our exons file. 
#extract the 3 fields and save 'start' and 'end' in a hash table. 
while (<$exons>) {
    my ( $something, $start, $end ) = split;

    my $exon_id = $.;    #line number;
    $start_of{$exon_id} = $start;
    $end_of{$exon_id}   = $end;
}
close ( $exons );

my %exons;
#run through 'reads' line by line, extracting the files. 

open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
    my ( $thing, $read_start, $read_end, $value ) = split;

    #cycle through each exon. 
    foreach my $exon_id ( keys %start_of ) {

        #check if _this_ 'read' is within the start and end ranges. 
        if (    $read_start >= $start_of{$exon_id}
            and $read_end <= $end_of{$exon_id} )
        {
            #store the line number in our hash %exons. 
            push( @{ $exons{$exon_id} }, $. );
        }
    }
}
close ( $reads ); 

#cycle through %exons - in 'id' order. 
foreach my $exon_id ( sort keys %exons ) {
    #print any matches. 
    print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
        ") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}

与えられたサンプルデータを考えると：

exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5

より複雑な範囲の検証/検証を簡単に行うには、それを拡張できる必要があります！

「while」ループに「awk」を入れ子にして、2つのファイルを1行ずつ解析し、列の値を比較します。

答え1

答え2

関連情報