Benchmarking Data Migration from Cassandra to Azure Cosmos DB Cassandra API

Appendix

Environment Details

Cassandra IaaS on Azure (single VM)

Azure VM SKU DS14-8 v2
8 vCPUs
112 GB RAM
Single node cluster

Sample of the data that was migrated to Cosmos DB Cassandra API

Keyspace/table schema:

stresscql.typestest10m

( name text,

choice boolean,

date timestamp,

address inet,

dbl double,

lval bigint,

ival int,

uid timeuuid,

value blob,

col1 text,

col2 text,

col3 text,

col4 text,

col5 text)

Azure Databricks
- Standard Cluster/Runtime 9.1 LTS (Scala 2.12/Spark 3.1.2)
- Standard DS3_v2
- Minimum 1 to maximum 4 worker 14 GB RAM 4 cores

Azure Databricks provides a baseline scala notebook which just needs to be configured for your environment in order to function. After making the configurations, the baseline scala notebook can read from your Cassandra environment and write to your Cosmos DB. Baseline Azure Databricks configuration file can be found here

Sample of the Azure Databricks notebook and parameters that can be modified:

stresscql, table -> typestest10m, cosmosCassandra: scala.collection.immutable.Map[String,String] = Map(spark.cassandra.output.concurrent.writes -> 25, spark.cassandra.concurrent.reads -> 512, spark.cassandra.connection.ssl.enabled -> true, spark.cassandra.connection.keep_alive_ms -> 600000000, spark.cassandra.output.batch.size.rows -> 1, spark.cassandra.output.batch.grouping.buffer.size -> 512,

note: blue text above indicate configurations we can change