Benchmarking Data Migration from Cassandra to Azure Cosmos DB Cassandra API

Alp Kaya

Akash Shankaran

 

Appendix

 

Environment Details

Image Screenshot 2022 06 27 082430 component architecture v1

 

 

 

 

  1. Cassandra IaaS on Azure (single VM)
  • Azure VM SKU DS14-8 v2
  • 8 vCPUs
  • 112 GB RAM
  • Single node cluster

 

Sample of the data that was migrated to Cosmos DB Cassandra API

Image data sample of rows cassandra iaas VM

 

 

 

Keyspace/table schema:

 

stresscql.typestest10m

(    name text,

choice boolean,

date timestamp,

address inet,

dbl double,

lval bigint,

ival int,

uid timeuuid,

value blob,

col1 text,

col2 text,

col3 text,

col4 text,

col5 text)

 

  1. Azure Databricks
    • Standard Cluster/Runtime 9.1 LTS (Scala 2.12/Spark 3.1.2)
    • Standard DS3_v2
    • Minimum 1 to maximum 4 worker 14 GB RAM 4 cores

 

Azure Databricks provides a baseline scala notebook which just needs to be configured for your environment in order to function. After making the configurations, the baseline scala notebook can read from your Cassandra environment and write to your Cosmos DB.  Baseline Azure Databricks configuration file can be found here

Sample of the Azure Databricks notebook and parameters that can be modified:

stresscql, table -> typestest10m, cosmosCassandra: scala.collection.immutable.Map[String,String] = Map(spark.cassandra.output.concurrent.writes -> 25, spark.cassandra.concurrent.reads -> 512, spark.cassandra.connection.ssl.enabled -> true, spark.cassandra.connection.keep_alive_ms -> 600000000, spark.cassandra.output.batch.size.rows -> 1, spark.cassandra.output.batch.grouping.buffer.size -> 512,

note: blue text above indicate configurations we can change

  1. Azure Cosmos DB
    • Single region deployment
    • Request Units sizes varied from run #1 24,000 RUs to run #5 with 80,000 RUs

See sample of data from Cassandra that was inserted into Cosmos DB

Image sampledata in azure cosmos db after load

0 comments

Discussion is closed.

Feedback usabilla icon