Long-term backup with Azure Synapse Link for Azure Cosmos DB

Emmanuel
Azure Cosmos DB offers automatic and integrated backups, and recently introduced new ways to modify the number and frequency of backups. Most customers find these options to be enough, because their database backups aren’t used for high availability (HA) but to protect against data corruption, accidental deletions, and other human errors. For legacy reasons, however, some clients need to keep long–term backups, and others would like the ability to query their backups. This post shows how Azure Synapse Link can help in these scenarios.
Azure Synapse Link for Azure Cosmos DB is a cloud-native hybrid transactional and analytical processing (HTAP) capability that enables you to run near real-time analytics over operational data in Azure Cosmos DB. It creates a tight, seamless integration between Azure Cosmos DB and Azure Synapse Analytics. Azure Synapse Link can be used with both the Azure Cosmos DB Core (SQL) API and the API for MongoDB, and the sample below works with both.
Azure Blob Storage offers a solution to store business-critical data. Immutable storage for Azure Blob storage enables users to store data objects in a WORM state (Write Once, Read Many) and makes the data non-erasable and non-modifiable for a user-specified interval. For the duration of the retention interval, blobs can be created and read, but cannot be modified or deleted.
How to create long-term backup for Azure Cosmos DB
In this example, we will use Azure Synapse Link for Azure Cosmos DB for storage, and to read the Azure Cosmos DB data and write to immutable storage. Let me show you how to do it.
Step 1: Create Azure Synapse Link for Azure Cosmos DB
In Azure Synapse Analytics, start by creating a link to your Azure Cosmos DB and to your Azure Blob Storage.
Step 2: Read and restore data with Spark notebooks.
- Open a Spark notebook in Azure Synapse Analytics
- Read the data from Azure Cosmos DB using the OLAP storage to avoid consuming throughput request units (RU/s)
# Read from Azure Cosmos DB analytical store into a Spark DataFrame and display 10 rows from the DataFrame
# To select a preferred list of regions in a multi-region Azure Cosmos DB account, add .option(“spark.cosmos.preferredRegions”, ”<Region1>,<Region2>”)
tosave = spark.read\
.format(“cosmos.olap”)\
.option(“spark.synapse.linkedService”, “mongoAPI”)\
.load()
display(tosave.limit(10))
2. Write your data to your immutable storage. Of course, you’ll need to create the folder when you want to write your data
tosave.write.json(‘abfss://ede@edeadlsgen2.dfs.core.windows.net/ede/bakup/ede.json’)
3. In case a restore is needed, you have just to load your data in a dataset by making a read in Azure Synapse
%%pyspark
restore = spark.read.load(‘abfss://ede@edeadlsgen2.dfs.core.windows.net/ede/bakup/ede.json’, format=‘json’)
display(df2.limit(10))
4. Restore your backup by writing the DataFrame into Azure Cosmos DB. This operation will consume RUs.
restore.write\
.format(“cosmos.oltp”)\
.option(“spark.synapse.linkedService”, “mongoAPI”)\
.option(“spark.cosmos.container”, “ede”)\
.option(“spark.cosmos.write.upsertEnabled”, “true”)\
.mode(‘append’)\
.save()
5. To restore only a subset of data, just query the applicable data in the DataFrame
a) Query using _ts columns for example
restorequery = restore[(restore._ts <= 1605517579) ]
display(df3)
b) Restore by writing the new DataFrame in the existing Azure Cosmos DB database
restorequery.write\
.format(“cosmos.oltp”)\
.option(“spark.synapse.linkedService”, “mongoAPI”)\
.option(“spark.cosmos.container”, “ede”)\
.option(“spark.cosmos.write.upsertEnabled”, “true”)\
.mode(‘append’)\
.save()
Get Started
- Try Azure Cosmos DB for free
- Online backup and on-demand data restore in Azure Cosmos DB documentation
- Get started with Azure Synapse Analytics free
- Find Azure Synapse Link documentation
- Visit the Azure Synapse Link for Azure Cosmos DB sample repo on GitHub
0 comments