Benchmarking Technologies for Storing and Querying Genomic Data

Background

Recently we worked with a startup that provides a big data platform for storing and managing genomic data. Using a dashboard, medical professionals can receive notifications about common mutations in their patients’ DNA, search for specific mutations, and perform family analysis queries. Since ~99.5% of the human genome is shared, the DNA sequencing process typically compares sample sequences with a specific well-known reference genome. The result of this comparison is stored in a raw file called VCF which only contains the differences between the sample and reference genome. This file contains ~1M rows, each represents a gene mutation, namely, a gene index and a value which is different than the one in the reference genome for that index. After getting a VCF file, it goes through a normalization process that prepares it to support different scenarios that the platform offers. The normalization process is a pipeline of steps that adds metadata and annotations from external gene-related knowledge-bases. The normalization process produces a large file containing ~1M rows and ~50 columns.

The Problem

Clinicians and researchers need to be able to run dynamic queries at runtime using the dashboard on their patients’ sequencing results. A basic query can be filtering a sample by multiple columns. A more advanced query can result in a join between a sample with a reference DB (that contains 60M rows). An even more advanced scenario is a multiple join operation in the same query to support family analysis scenarios, in addition to a join with an external reference DB repository. Since these queries are triggered via the dashboard, the requirement was to return the result in less than 2 seconds. Returning a result after a longer period of time is considered a bad user experience. Performing such queries at runtime, such that each sample contains so many rows and columns requires powerful computational resources and takes a relatively long time to complete.

The Engagement

We collaborated with this startup with the goal of performing a query performance benchmark to evaluate several scalable storage architectures:

Impala running on a Cloudera cluster

Impala is well-suited for executing SQL queries for interactive exploratory analytics on large data sets, whereas Hive and MapReduce are appropriate for very long running, batch-oriented tasks. Impala also uses columnar storage which makes much more sense for the type of queries that we used because of the way the data is physically stored on disk. When executing SELECT queries and projecting a few of the columns, only the relevant blocks of the required columns will be read from the disk, this is because the data for the same column of different rows sits next to one another. Read more about columnar storage here.

Spark running on HDInsight cluster

Spark is the first technology which makes sense to try when working with big data on top of HDFS. It can provide a good reference to trying other technologies such as Impala. Read more about spark here.

SQL Data Warehouse

SQL Data Warehouse is a massively parallel processing (MPP) distributed database system. By dividing data and processing capability across multiple nodes, SQL Data Warehouse can offer huge scalability. Read more about SQL Data Warehouse here.

The queries

We agreed on a select number of queries that support common scenarios. We used the same sample files in the different storage solutions and tested how long it took for the query to run on each of them. The queries below are written in a “SQL-like” style for the purpose of clarity. They were implemented in each of the storage solutions according to its syntax.

The reference DB table contains 60M rows in the same structure of the VCF file, but in addition, also contains more metadata about each gene. The normalization process uses this data to annotate the rows in the sample VCF file.

The following table describes the queries and the benchmark test results:

.	Technology	Impala	Spark on HDInsight	Spark on Cloudera	SQL DW	SQL DW
.	Cluster Type	3 x D13	3 x D12	3 x DS13 (5xP30 disks each)	400 DWU	1000 DWU
index	Scenario	seconds	seconds	seconds	seconds	seconds
1	select from sample, order by gene position, take first 100 records	2	5	2	1	1
2	select from reference db table, filter by one field, order by another field, take first 100 records	11	96	38	2	6
3	join sample with reference db table on gene id, order by gene position, take first 100 records	775	347	168	15	6
4	join sample with reference db table on gene id, take first 100 records	211	275	121	1	1
5	join sample with reference db table on gene id, filter by one field, order by gene position, take first 100 records	13	111	61	1	1
6	join sample with reference db table on gene id, group by one field, order by another field, take first 100 records	37	41	23	5	2
7	select from reference db table, group by one field, order by another field, take first 100 records	12	25	20	15	7
8	select from sample rows that don’t exist in reference db table, take first 100 records	37	367	432	5	2
9	join 4 samples on gene id, filter by one field, order by another field, take first 100 records	7	16	20	3	2
10	add 1M records into a 60M records table	7	35	33	4	3
11	add 1M records into a new empty table	4.5	35	33	6	21
12	add 60M records into a new empty table	140	500	280	214	124

Conclusions

Based on the results, the above alternatives didn’t satisfy the 2 seconds requirement.

However, we recommended the following:

In the short term, use RDBMS (ie SQL Server, MySql, Postgres) for driving most of the realtime queries issued via the dashboard and that must complete in <2 seconds. Validate that the execution plan doesn’t perform a full-table scan.
Deploy a small Hadoop cluster specifically including Spark, Impala, HDFS and Parquet. Use the cluster to offload some of the more complex web dashboard queries that do not need a sub-second response from RDBMS to Impala. Over time, expand the number of queries that are handled outside of RDBMS by using Impala to do the “heavy lifting”.
Expand ad-hoc query capabilities and optimize data preparation by using Impala and Spark. Pre-process raw data using Spark and Impala and load relevant slices of the data into the RDBMS. Use Impala when building out new ad-hoc query capabilities in the web application and include relevant UI tweaks (i.e. “wait… loading…”) to minimize the user-experience impact of the longer response times.

Benchmarking Technologies for Storing and Querying Genomic Data

Background

The Problem