SQLなどのCSVファイルのクエリ

Question 1

これを達成するために、いくつかのツールが作成されました。例は次のとおりです。

$ csvq 'select * from cities'
+------------+-------------+----------+
|    name    |  population |  country |
+------------+-------------+----------+
| warsaw     |  1700000    |  poland  |
| ciechanowo |  46000      |  poland  |
| berlin     |  3500000    |  germany |
+------------+-------------+----------+

$ csvq 'insert into cities values("dallas", 1, "america")'
1 record inserted on "C:\\cities.csv".
Commit: file "C:\\cities.csv" is updated.

https://github.com/mithrandie/csvq

Answer

これを達成するために、いくつかのツールが作成されました。例は次のとおりです。

$ csvq 'select * from cities'
+------------+-------------+----------+
|    name    |  population |  country |
+------------+-------------+----------+
| warsaw     |  1700000    |  poland  |
| ciechanowo |  46000      |  poland  |
| berlin     |  3500000    |  germany |
+------------+-------------+----------+

$ csvq 'insert into cities values("dallas", 1, "america")'
1 record inserted on "C:\\cities.csv".
Commit: file "C:\\cities.csv" is updated.

https://github.com/mithrandie/csvq

Question 2

面接質問と言われました。インタビューでこの質問を受けたら、これらの制限について質問します。たとえば、これらの制限がある理由、許可されているものと許可されていないもの、理由などを尋ねます。各質問について、私はここで何が起こっているのかを実際に理解するためにビジネス環境に制限がある理由を再接続しようとします。

そして、動物の速度の公式の由来について尋ねたかったのですが、それは私の生命科学の背景よりも物理科学の背景が強く、気になるからです。

面接官として、私はCSV構文解析のための標準的なツールがあるということを必ず聞きたいです。最初から解析/修正するためにスクリプトやコマンドラインユーティリティを使用することpandasとcsv。

Stack Exchangeはこのタイプの繰り返しQ&Aには適していないため、Pythonを使用して回答を公開します。答えは、ビジネス上の問題を真に理解した後にのみインタビューで提供されます。

# Assume it's OK to import sqrt, otherwise the spirit of the problem isn't understood.
from math import sqrt

# Read data into dictionary.
dino_dict = dict()
for filename in ['file1.csv','file2.csv']:
    with open(filename) as f:
        # Read the first line as the CSV headers/labels.
        labels = f.readline().strip().split(',')

        # Read the data lines.
        for line in f.readlines():
            values = line.strip().split(',')
        
            # For each line insert the data in the dict.
            for label, value in zip(labels, values):
                if label == "NAME":
                    dino_name = value
                    if dino_name not in dino_dict:
                        dino_dict[dino_name] = dict() # New dino.
                else:
                    dino_dict[dino_name][label] = value # New attribute.

# Calculate speed and insert into dictionary.
for dino_stats in dino_dict.values():
    try:
        stride_length = float(dino_stats['STRIDE_LENGTH'])
        leg_length = float(dino_stats['LEG_LENGTH'])
    except KeyError:
        continue
    
    dino_stats["SPEED"] = ((stride_length / leg_length) - 1) * sqrt(leg_length * 9.8)
    
# Make a list of dinos with their speeds.
bipedal_dinos_with_speed = list()
for dino_name, dino_stats in dino_dict.items():
    if dino_stats.get('STANCE') == 'bipedal':
        if 'SPEED' in dino_stats:
            bipedal_dinos_with_speed.append((dino_name, dino_stats['SPEED']))

# Sort the list by speed and print the dino names.
[dino_name for dino_name, _ in sorted(bipedal_dinos_with_speed, key=lambda x: x[1], reverse=True)]

['ティラノサウルスレックス'、'ベロシラプター'、'ダチョウ'、'オリブリ恐竜']

Answer

面接質問と言われました。インタビューでこの質問を受けたら、これらの制限について質問します。たとえば、これらの制限がある理由、許可されているものと許可されていないもの、理由などを尋ねます。各質問について、私はここで何が起こっているのかを実際に理解するためにビジネス環境に制限がある理由を再接続しようとします。

そして、動物の速度の公式の由来について尋ねたかったのですが、それは私の生命科学の背景よりも物理科学の背景が強く、気になるからです。

面接官として、私はCSV構文解析のための標準的なツールがあるということを必ず聞きたいです。最初から解析/修正するためにスクリプトやコマンドラインユーティリティを使用することpandasとcsv。

Stack Exchangeはこのタイプの繰り返しQ&Aには適していないため、Pythonを使用して回答を公開します。答えは、ビジネス上の問題を真に理解した後にのみインタビューで提供されます。

# Assume it's OK to import sqrt, otherwise the spirit of the problem isn't understood.
from math import sqrt

# Read data into dictionary.
dino_dict = dict()
for filename in ['file1.csv','file2.csv']:
    with open(filename) as f:
        # Read the first line as the CSV headers/labels.
        labels = f.readline().strip().split(',')

        # Read the data lines.
        for line in f.readlines():
            values = line.strip().split(',')
        
            # For each line insert the data in the dict.
            for label, value in zip(labels, values):
                if label == "NAME":
                    dino_name = value
                    if dino_name not in dino_dict:
                        dino_dict[dino_name] = dict() # New dino.
                else:
                    dino_dict[dino_name][label] = value # New attribute.

# Calculate speed and insert into dictionary.
for dino_stats in dino_dict.values():
    try:
        stride_length = float(dino_stats['STRIDE_LENGTH'])
        leg_length = float(dino_stats['LEG_LENGTH'])
    except KeyError:
        continue
    
    dino_stats["SPEED"] = ((stride_length / leg_length) - 1) * sqrt(leg_length * 9.8)
    
# Make a list of dinos with their speeds.
bipedal_dinos_with_speed = list()
for dino_name, dino_stats in dino_dict.items():
    if dino_stats.get('STANCE') == 'bipedal':
        if 'SPEED' in dino_stats:
            bipedal_dinos_with_speed.append((dino_name, dino_stats['SPEED']))

# Sort the list by speed and print the dino names.
[dino_name for dino_name, _ in sorted(bipedal_dinos_with_speed, key=lambda x: x[1], reverse=True)]

['ティラノサウルスレックス'、'ベロシラプター'、'ダチョウ'、'オリブリ恐竜']

Question 3

見事に活用できますミラーそして走る

mlr --csv join -j NAME -f file1.csv \
then put '$speed=($STRIDE_LENGTH/LEG_LENGTH - 1)*pow(($LEG_LENGTH*9.8),0.5)' \
then sort -nr speed \
then cut -f NAME file2.csv

得る

NAME
Tyrannosaurus Rex
Velociraptor
Euoplocephalus
Stegosaurus
Hadrosaurus
Struthiomimus

Bash（およびその他のスクリプト言語）を使用すると、ほとんどすべてのオペレーティングシステムとスクリプトで使用できます。切り取り/貼り付け/sed/awkと同じです。

Answer

見事に活用できますミラーそして走る

mlr --csv join -j NAME -f file1.csv \
then put '$speed=($STRIDE_LENGTH/LEG_LENGTH - 1)*pow(($LEG_LENGTH*9.8),0.5)' \
then sort -nr speed \
then cut -f NAME file2.csv

得る

NAME
Tyrannosaurus Rex
Velociraptor
Euoplocephalus
Stegosaurus
Hadrosaurus
Struthiomimus

Bash（およびその他のスクリプト言語）を使用すると、ほとんどすべてのオペレーティングシステムとスクリプトで使用できます。切り取り/貼り付け/sed/awkと同じです。

Question 4

A gはawk内部的に初期ソートjoinと最終ソートを実行します。awk

join -t, <(sort file1.csv) <(sort file2.csv) | 
    awk -F, -v g=9.8 '/bipedal/{osaur[$1]=($4/$2-1)*sqrt(g*$2)}
        END{PROCINFO["sorted_in"]="@val_num_desc"; for (d in osaur) print d}'

Tyrannosaurus Rex
Velociraptor
Struthiomimus
Hadrosaurus

@Cbhihe コメントを編集しました

制御方法の有用なリソースgawkスキャンアレイ。

PROCINFO["sorted_in"]配列を読み取る順序を制御するように設定できます。

この場合は@value を使用して eric と仮定して最後までnumソートするので。desc@val_num_desc

icesを使用して配列を出力することもできます。この場合、配列@indはingsと仮定してstrソートします。asc@ind_str_asc

これらのハエの組み合わせとすべてのハエはリンクされたリソースにあります。

Answer