Co-authors: RK Iyer, Mangal Dutta
Migrating stateful systems, such as databases, is a complex process. A frequent requirement for customers is to transfer data from DynamoDB to Azure Cosmos DB for NoSQL. This process involves several stages, including exporting data from DynamoDB, performing necessary transformations, and importing the data into Azure Cosmos DB.
Common migration techniques: Offline and Online
Migration strategies include offline and online approaches, which can be used independently or combined based on your needs. Online migration is ideal for applications requiring real-time data transfer with zero downtime, while offline migration suits scenarios where applications can be paused during a maintenance window, allowing data to be exported from DynamoDB to an intermediate location before importing it into Azure Cosmos DB. You could also use a hybrid approach where bulk data migration occurs offline, followed by real-time synchronization to maintain consistency if you need to (temporarily) continue using DynamoDB in parallel with Azure Cosmos DB.
Offline data migration from DynamoDB to Azure Cosmos DB for NoSQL
One of the approaches involves using a combination of Azure Data Factory, Azure Storage (with Azure Data Lake Storage Gen v2), and Apache Spark on Azure Databricks.
First, data from DynamoDB table is exported to S3 (in DynamoDB JSON format) using native DynamoDB export capability. The DynamoDB table data in S3 is written to Azure Data Lake Storage (ADLS Gen v2) using an Azure Data Factory (ADF) pipeline. Finally, data in Azure storage is processed with Spark on Azure Databricks and written to Azure Cosmos DB using the Azure Cosmos DB Spark connector for NoSQL API.
This approach decouples storage and processing, which can be beneficial when dealing with large datasets. With Apache Spark you can scale data processing across multiple worker nodes, and it also provides a lot of flexibility when it comes to data transformations. However, the downside of this approach includes a multi-stage process which increases complexity, and overall latency. It also requires knowledge of Apache Spark and could introduce a learning curve depending on the skillset of your team.
Alternative options
You can also explore other approaches such as exporting from DynamoDB to S3, and directly use ADF to read from S3 and write to Azure Cosmos DB. You could also leverage Spark on Azure Databricks to read from DynamoDB and write to Azure Cosmos DB. Both these options have their pros and cons.
Online migration approaches
Online migration from DynamoDB typically employs a Change-Data-Capture (CDC) mechanism to stream data changes from DynamoDB. Although this is a near real-time process, you will need to build additional component to process the streaming data and write it to Azure Cosmos DB. This could be an AWS Lambda function that gets triggered by DynamoDB Streams or use Kinesis Data Streams and process the data with Kinesis or Flink.
As always, each approach has its strengths and weaknesses. For example, DynamoDB Streams provides ordering guarantees, but has a data retention for 24 hours, which may or may not be suitable if you are migrating large data volumes to Azure Cosmos DB. On the other hand, you can write Flink job(s) to consume from Kinesis Data Streams and perform complex aggregations on data before writing it to Azure Cosmos DB. However, note that Kinesis Data Streams does not provide ordering guarantee.
Want to dive in deeper?
If you are interested in exploring this further, I would encourage you to documentation page for Data migration from DynamoDB to Azure Cosmos DB for NoSQL that covers a complete walkthrough of an offline migration approach and explores the pros/cons of some of the offline/online migration options. You can also access the Spark notebook and the ADF pipeline from the migration-dynamodb-to-cosmosdb-nosql GitHub repository and tweak it based on your requirements.
Let us know how that goes!
Leave a review
Tell us about your Azure Cosmos DB experience! Leave a review on PeerSpot and we’ll gift you $50. Get started here.
About Azure Cosmos DB
Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.
Try Azure Cosmos DB for free here. To stay in the loop on Azure Cosmos DB updates, follow us on X, YouTube, and LinkedIn.
0 comments
Be the first to start the discussion.