Long-term backup with Azure Synapse Link for Azure Cosmos DB

Emmanuel Deletang

Emmanuel

Azure Cosmos DB offers automatic and integrated backups, and recently introduced new ways to modify the number and frequency of backups. Most customers find these options to be enough, because their database backups aren’t used for high availability (HA) but to protect against data corruption, accidental deletions, and other human errorsFor legacy reasons, however, some clients need to keep longterm backups, and others would like the ability to query their backupsThis post shows how Azure Synapse Link can help in these scenarios. 

Azure Synapse Link for Azure Cosmos DB is a cloud-native hybrid transactional and analytical processing (HTAP) capability that enables you to run near real-time analytics over operational data in Azure Cosmos DB. It creates a tight, seamless integration between Azure Cosmos DB and Azure Synapse Analytics.  Azure Synapse Link can be used with both the Azure Cosmos DB Core (SQL) API and the API for MongoDB, and the sample below works with both. 

Image Azure Synapse Link for Azure Cosmos DB overview
Azure Synapse Link – integration between Azure Cosmos DB and Azure Synapse Analytics

 

Azure Blob Storage offers a solution to store business-critical dataImmutable storage for Azure Blob storage enables users to store data objects in a WORM state (Write Once, Read Many) and makes the data non-erasable and non-modifiable for a user-specified interval. For the duration of the retention interval, blobs can be created and read, but cannot be modified or deleted.  

 

How to create long-term backup for Azure Cosmos DB

In this example, we will use Azure Synapse Link for Azure Cosmos DB for storage, and to read the Azure Cosmos DB data and write to immutable storage. Let me show you how to do it. 

 

In Azure Synapse Analytics, start by creating a link to your Azure Cosmos DB and to your Azure Blob Storage.  

Image synapse2

 

Step 2: Read and restore data with Spark notebooks. 

  1. Open a Spark notebook in Azure Synapse Analytics
  2. Read the data from Azure Cosmos DB using the OLAP storage to avoid consuming throughput request units (RU/s)  

# Read from Azure Cosmos DB analytical store into a Spark DataFrame and display 10 rows from the DataFrame 

# To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option(“spark.cosmos.preferredRegions”, ”<Region1>,<Region2>”) 

tosave = spark.read\ 

    .format(“cosmos.olap”)\ 

    .option(“spark.synapse.linkedService”, “mongoAPI”)\ 

    .load() 

display(tosave.limit(10)) 

2. Write your data to your immutable storage. Of course, you’ll need to create the folder when you want to write your data  

tosave.write.json(‘abfss://ede@edeadlsgen2.dfs.core.windows.net/ede/bakup/ede.json’) 

3. In case a restore is needed, you have just to load your data in a dataset by making a read in Azure Synapse  

%%pyspark 

restore = spark.read.load(‘abfss://ede@edeadlsgen2.dfs.core.windows.net/ede/bakup/ede.json’, format=‘json’) 

display(df2.limit(10)) 

4. Restore your backup by writing the DataFrame into Azure Cosmos DB. This operation will consume RUs.

restore.write\ 

    .format(“cosmos.oltp”)\ 

    .option(“spark.synapse.linkedService”, “mongoAPI”)\ 

    .option(“spark.cosmos.container”, “ede”)\ 

    .option(“spark.cosmos.write.upsertEnabled”, “true”)\ 

    .mode(‘append’)\ 

    .save() 

 

5. To restore only a subset of data, just query the applicable data in the DataFrame  

 a) Query using _ts columns for example  

restorequery = restore[(restore._ts <= 1605517579) ] 

display(df3) 

 

b) Restore by writing the new DataFrame in the existing Azure Cosmos DB database

restorequery.write\ 

    .format(“cosmos.oltp”)\ 

    .option(“spark.synapse.linkedService”, “mongoAPI”)\ 

    .option(“spark.cosmos.container”, “ede”)\ 

    .option(“spark.cosmos.write.upsertEnabled”, “true”)\ 

    .mode(‘append’)\ 

    .save() 

 

 

Get Started 

 

0 comments

Leave a comment