2つのファイル間の交点（ファイル1の値はファイル2の値の範囲に属します）

Question 1

目的のタスクを実行するPythonスクリプトは次のとおりです。

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""intersect_snp.py"""

import sys

# Read data from the SNP file into a list
snp_list = []
with open(sys.argv[1], 'r') as snp_file:
    for line in snp_file:
        snp_row = line.split() 
        snp_list.append(snp_row)

# Read data from the "window" file into a dictionary of lists
win_dict = {} 
with open(sys.argv[2], 'r') as win_file:
    for line in win_file:
        seqnames, chromosome, start, end, width = win_row = line.split()
        if chromosome not in win_dict:
            win_dict[chromosome] = []
        win_dict[chromosome].append(win_row)

# Compare data and compute results
results = []

# Iterate over snp data rows
for snp_row in snp_list:

    # Extract the field values for each snp row
    snp_id, chromosome, position = snp_row

    # Convert the chromosome to match the format in the "window" file
    # i.e. `1` -> `Chr1`
    chromosome_name = "Chr{}".format(chromosome)

    # Iterate over the "window" rows corresponding to that chromosome
    for win_row in win_dict.get(chromosome_name, []):

        # Extract the field values for each "window" row
        seqnames, chromosome, start, end, width = win_row

        # Perform the desired comparison
        if int(start) <= int(position) <= int(end):

            # If the comparison returns true, construct the result row
            result = [snp_id, chromosome, position, seqnames]
            results.append(result)
            break

# Print the output column headers
columns = ["snp_id", "chromosome", "position", "window"]
print(" ".join(columns))

# Print the results
for row in results:
    print(' '.join(row))

このスクリプトでは、すべての行がデータ行であると仮定します。入力ファイルの名前が指定されている場合は、snp_data次windowのように実行できます。

python intersect_snp.py snp_data window

ファイルにヘッダー行がある場合は、tailヘッダーのスキップ/アンインストールを使用して次のように実行できます。

python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)

これがあなたのファイルであるsnp_dataと仮定します。

snp_id              chromosome  position
Chr01__912          1           912
Chr01__944          1           944
Chr01__1107         1           1107
...
...
...
Chr17__214708367    17          214708367
Chr17__214708424    17          214708424
Chr17__214708451    17          214708451
Chr17__214708484    17          214708484
Chr17__214708508    17          214708508

これはあなたのwindowファイルです：

seqnames    chromosome  start       end         width
1           Chr1        1           15000       15000
2           Chr1        15001       30000       15000
3           Chr1        30001       45000       15000
4           Chr1        45001       60000       15000
5           Chr1        60001       75000       15000
...
...
...
199954      Chr17       214620001   214635000   15000
199955      Chr17       214635001   214650000   15000
199956      Chr17       214650001   214665000   15000
199957      Chr17       214665001   214680000   15000
199958      Chr17       214680001   214695000   15000
199959      Chr17       214695001   214708580   13580

その後、このコマンドを実行すると：

python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)

私達は次の結果を得ます：

snp_id chromosome position window
Chr01__912 Chr1 912 1
Chr01__944 Chr1 944 1
Chr01__1107 Chr1 1107 1
...
...
...
Chr17__214708367 Chr17 214708367 199959
Chr17__214708424 Chr17 214708424 199959
Chr17__214708451 Chr17 214708451 199959
Chr17__214708484 Chr17 214708484 199959
Chr17__214708508 Chr17 214708508 199959

Answer

目的のタスクを実行するPythonスクリプトは次のとおりです。

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""intersect_snp.py"""

import sys

# Read data from the SNP file into a list
snp_list = []
with open(sys.argv[1], 'r') as snp_file:
    for line in snp_file:
        snp_row = line.split() 
        snp_list.append(snp_row)

# Read data from the "window" file into a dictionary of lists
win_dict = {} 
with open(sys.argv[2], 'r') as win_file:
    for line in win_file:
        seqnames, chromosome, start, end, width = win_row = line.split()
        if chromosome not in win_dict:
            win_dict[chromosome] = []
        win_dict[chromosome].append(win_row)

# Compare data and compute results
results = []

# Iterate over snp data rows
for snp_row in snp_list:

    # Extract the field values for each snp row
    snp_id, chromosome, position = snp_row

    # Convert the chromosome to match the format in the "window" file
    # i.e. `1` -> `Chr1`
    chromosome_name = "Chr{}".format(chromosome)

    # Iterate over the "window" rows corresponding to that chromosome
    for win_row in win_dict.get(chromosome_name, []):

        # Extract the field values for each "window" row
        seqnames, chromosome, start, end, width = win_row

        # Perform the desired comparison
        if int(start) <= int(position) <= int(end):

            # If the comparison returns true, construct the result row
            result = [snp_id, chromosome, position, seqnames]
            results.append(result)
            break

# Print the output column headers
columns = ["snp_id", "chromosome", "position", "window"]
print(" ".join(columns))

# Print the results
for row in results:
    print(' '.join(row))

このスクリプトでは、すべての行がデータ行であると仮定します。入力ファイルの名前が指定されている場合は、snp_data次windowのように実行できます。

python intersect_snp.py snp_data window

ファイルにヘッダー行がある場合は、tailヘッダーのスキップ/アンインストールを使用して次のように実行できます。

python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)

これがあなたのファイルであるsnp_dataと仮定します。

snp_id              chromosome  position
Chr01__912          1           912
Chr01__944          1           944
Chr01__1107         1           1107
...
...
...
Chr17__214708367    17          214708367
Chr17__214708424    17          214708424
Chr17__214708451    17          214708451
Chr17__214708484    17          214708484
Chr17__214708508    17          214708508

これはあなたのwindowファイルです：

seqnames    chromosome  start       end         width
1           Chr1        1           15000       15000
2           Chr1        15001       30000       15000
3           Chr1        30001       45000       15000
4           Chr1        45001       60000       15000
5           Chr1        60001       75000       15000
...
...
...
199954      Chr17       214620001   214635000   15000
199955      Chr17       214635001   214650000   15000
199956      Chr17       214650001   214665000   15000
199957      Chr17       214665001   214680000   15000
199958      Chr17       214680001   214695000   15000
199959      Chr17       214695001   214708580   13580

その後、このコマンドを実行すると：

python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)

私達は次の結果を得ます：

snp_id chromosome position window
Chr01__912 Chr1 912 1
Chr01__944 Chr1 944 1
Chr01__1107 Chr1 1107 1
...
...
...
Chr17__214708367 Chr17 214708367 199959
Chr17__214708424 Chr17 214708424 199959
Chr17__214708451 Chr17 214708451 199959
Chr17__214708484 Chr17 214708484 199959
Chr17__214708508 Chr17 214708508 199959

Question 2

長時間待たないようにするには、Linuxにプリインストールされている場合が多い最小SQLエンジンであるSQLiteを使用してこれを実行できます。サーバーを実行せずにファイルに格納されているSQLデータベースを使用します。

snp_dataとウィンドウディレクトリで、次の操作を行います。

cat snp_data | tr -s " " > snp_data.csv
sed 's#Chr##g' window | tr -s " " > window.csv

これにより、フィールド間のスペースが正規化され、インポートする準備が整います。

その後、このデータをSQLにインポートし、クエリを実行して出力を取得します。

cat > task.sql
CREATE TABLE snp_data (snp_id text,chromosome int, position int);
CREATE TABLE window (seqname int,chromosome int, c_start int , c_end int, c_width int);

.mode csv
.separator "  "
.import snp_data.csv snp_data
.import window.csv window

.mode column
.header on
SELECT D.snp_id, D.chromosome, D.position, W.seqname FROM snp_data D, window W WHERE W.chromosome=D.chromosome AND D.position BETWEN W.c_start AND W.c_end;

[ここで入力を停止するには Ctrl+D]

ついに：

cat task.sql | sqlite3 my_database.db

通常、大容量ファイルの場合、この方法はより高速です。

Answer