Background
Recently we worked with a startup that provides a big data platform for storing and managing genomic data. Using a dashboard, medical professionals can receive notifications about common mutations in their patients’ DNA, search for specific mutations, and perform family analysis queries. Since ~99.5% of the human genome is shared, the DNA sequencing process typically compares sample sequences with a specific well-known reference genome. The result of this comparison is stored in a raw file called VCF which only contains the differences between the sample and reference genome. This file contains ~1M rows, each represents a gene mutation, namely, a gene index and a value which is different than the one in the reference genome for that index. After getting a VCF file, it goes through a normalization process that prepares it to support different scenarios that the platform offers. The normalization process is a pipeline of steps that adds metadata and annotations from external gene-related knowledge-bases. The normalization process produces a large file containing ~1M rows and ~50 columns.
The Problem
Clinicians and researchers need to be able to run dynamic queries at runtime using the dashboard on their patients’ sequencing results. A basic query can be filtering a sample by multiple columns. A more advanced query can result in a join between a sample with a reference DB (that contains 60M rows). An even more advanced scenario is a multiple join operation in the same query to support family analysis scenarios, in addition to a join with an external reference DB repository. Since these queries are triggered via the dashboard, the requirement was to return the result in less than 2 seconds. Returning a result after a longer period of time is considered a bad user experience. Performing such queries at runtime, such that each sample contains so many rows and columns requires powerful computational resources and takes a relatively long time to complete.
The Engagement
We collaborated with this startup with the goal of performing a query performance benchmark to evaluate several scalable storage architectures:
Impala running on a Cloudera cluster
Impala is well-suited for executing SQL queries for interactive exploratory analytics on large data sets, whereas Hive and MapReduce are appropriate for very long running, batch-oriented tasks. Impala also uses columnar storage which makes much more sense for the type of queries that we used because of the way the data is physically stored on disk. When executing SELECT queries and projecting a few of the columns, only the relevant blocks of the required columns will be read from the disk, this is because the data for the same column of different rows sits next to one another. Read more about columnar storage here.
Spark running on HDInsight cluster
Spark is the first technology which makes sense to try when working with big data on top of HDFS. It can provide a good reference to trying other technologies such as Impala. Read more about spark here.
SQL Data Warehouse
SQL Data Warehouse is a massively parallel processing (MPP) distributed database system. By dividing data and processing capability across multiple nodes, SQL Data Warehouse can offer huge scalability. Read more about SQL Data Warehouse here.
The queries
We agreed on a select number of queries that support common scenarios. We used the same sample files in the different storage solutions and tested how long it took for the query to run on each of them. The queries below are written in a “SQL-like” style for the purpose of clarity. They were implemented in each of the storage solutions according to its syntax.
The reference DB table contains 60M rows in the same structure of the VCF file, but in addition, also contains more metadata about each gene. The normalization process uses this data to annotate the rows in the sample VCF file.
The following table describes the queries and the benchmark test results:
. | Technology | Impala | Spark on HDInsight | Spark on Cloudera | SQL DW | SQL DW |
---|---|---|---|---|---|---|
. | Cluster Type | 3 x D13 | 3 x D12 | 3 x DS13 (5xP30 disks each) | 400 DWU | 1000 DWU |
index | Scenario | seconds | seconds | seconds | seconds | seconds |
1 | select from sample, order by gene position, take first 100 records | 2 | 5 | 2 | 1 | 1 |
2 | select from reference db table, filter by one field, order by another field, take first 100 records | 11 | 96 | 38 | 2 | 6 |
3 | join sample with reference db table on gene id, order by gene position, take first 100 records | 775 | 347 | 168 | 15 | 6 |
4 | join sample with reference db table on gene id, take first 100 records | 211 | 275 | 121 | 1 | 1 |
5 | join sample with reference db table on gene id, filter by one field, order by gene position, take first 100 records | 13 | 111 | 61 | 1 | 1 |
6 | join sample with reference db table on gene id, group by one field, order by another field, take first 100 records | 37 | 41 | 23 | 5 | 2 |
7 | select from reference db table, group by one field, order by another field, take first 100 records | 12 | 25 | 20 | 15 | 7 |
8 | select from sample rows that don’t exist in reference db table, take first 100 records | 37 | 367 | 432 | 5 | 2 |
9 | join 4 samples on gene id, filter by one field, order by another field, take first 100 records | 7 | 16 | 20 | 3 | 2 |
10 | add 1M records into a 60M records table | 7 | 35 | 33 | 4 | 3 |
11 | add 1M records into a new empty table | 4.5 | 35 | 33 | 6 | 21 |
12 | add 60M records into a new empty table | 140 | 500 | 280 | 214 | 124 |
Conclusions
Based on the results, the above alternatives didn’t satisfy the 2 seconds requirement.
However, we recommended the following:
- In the short term, use RDBMS (ie SQL Server, MySql, Postgres) for driving most of the realtime queries issued via the dashboard and that must complete in <2 seconds. Validate that the execution plan doesn’t perform a full-table scan.
- Deploy a small Hadoop cluster specifically including Spark, Impala, HDFS and Parquet. Use the cluster to offload some of the more complex web dashboard queries that do not need a sub-second response from RDBMS to Impala. Over time, expand the number of queries that are handled outside of RDBMS by using Impala to do the “heavy lifting”.
- Expand ad-hoc query capabilities and optimize data preparation by using Impala and Spark. Pre-process raw data using Spark and Impala and load relevant slices of the data into the RDBMS. Use Impala when building out new ad-hoc query capabilities in the web application and include relevant UI tweaks (i.e. “wait… loading…”) to minimize the user-experience impact of the longer response times.
0 comments