Unlocking the power of Time-Travel in Azure Synapse Link for Azure Cosmos DB

Revin Chalil

Rodrigo Souza

Azure Synapse Link now offers time-travel capabilities for Azure Cosmos DB data supporting NoSQL and MongoDB APIs. This exciting feature, currently available in public preview through Azure Synapse Spark, enables you to access Azure Cosmos DB data in the analytical store, precisely as it appeared at specific points in time in history (down to the millisecond level). With time-travel, you can effortlessly query past data that has been updated or deleted, analyze trends, and compare differences between two points of interest.

The diagram below illustrates how the BasePrice in the Product container gets updated over time and how the time-travel query, using the “spark.cosmos.timetravel.timestampAsOf” configuration, can retrieve the BasePrice for products at any given historical moment. Additionally, it demonstrates the use of the “spark.cosmos.timetravel.fullFidelity” configuration to access all versions of records at a specific point in history.

TimeTravel Azure SynapseLink for Azure CosmosDB

Please note that if an Analytical Time-To-Live (ATTL) is defined on the Azure Cosmos DB container, it serves as the maximum retention period for time-travel operations. However, if ATTL is not defined or set as -1, there is no maximum retention period.

Scenarios

  • Data Audit:

Auditing data changes is crucial for data compliance and understanding how data has evolved over time. The time-travel feature empowers you to track changes, access all versions of updates, and perform data analysis at any desired point in time.

  • Trend Analysis:

By specifying the combination of “spark.cosmos.timetravel.startTimestamp” and “spark.cosmos.timetravel.timestampAsOf” configurations, you can compare and analyze differences between two specific points in time. For instance, you can compare the product inventory quantity from three months ago with that from six months ago.

  • Repairing Accidental Data Changes:

The time-travel feature is invaluable for rectifying individual records to their last known good state, making it efficient to perform repairs without resorting to backups and restores. Once you access the desired data as it existed in the last known good state using the “timestampAsOf” value, you can either update the Azure Cosmos DB container with that data or ingest the records into a new container.

  • Azure Cosmos DB Container as a Slowly Changing Dimension:

Slowly changing dimensions are used to keep track of changes in attribute values and to report historical data at any given point of time. Time-travel queries, along with the “fullFidelity” option, provide the functionality of Type 2 slowly changing dimensions by keeping track of attribute value changes represented as separate rows with validity period.

Using Time Travel with Azure Synapse Spark

To perform time-travel operations on Azure Cosmos DB data, ensure that your Azure Cosmos DB account has been enabled for Synapse Link, and analytical store has been enabled on either a new container or an existing container.

The Scala snippet below demonstrates how to load a Spark DataFrame with records from the product container as of January 1, 2022 and can be executed using Azure Synapse Spark Notebook.

import com.microsoft.azure.cosmos.analytics.spark.connector.datasource.CosmosOlapTimeTravel

val configuration = Map(
    "spark.synapse.linkedService" -> "CosmosDBLS",
    "spark.cosmos.container" ->  "product",
    "spark.cosmos.timetravel.timestampAsOf" -> "2022-01-01 00:00:00"
)

val df = CosmosOlapTimeTravel.load(configuration)
display(df)

 

The equivalent syntax in Python would be the following and can be executed using Azure Synapse Spark Notebook:

from pyspark.sql import DataFrame, SQLContext
sqlCtx = SQLContext(sc)

configuration = {
  "spark.synapse.linkedService": "CosmosDBLS",
  "spark.cosmos.container": "product",
  "spark.cosmos.timetravel.timestampAsOf": "2022-01-01 00:00:00"
  }
                                                                                           
cosmosOlapTimeTravel = sc._jvm.com.microsoft.azure.cosmos.analytics.spark.connector.datasource.CosmosOlapTimeTravel
df = DataFrame(cosmosOlapTimeTravel.load(configuration), sqlCtx)
display(df)

Get Started with Azure Cosmos DB for free

Azure Cosmos DB is a fully managed NoSQL and relational database for modern app development with SLA-backed speed and availability, automatic and instant scalability, and support for open source PostgreSQL, MongoDB and Apache Cassandra. Try Azure Cosmos DB for free here. To stay in the loop on Azure Cosmos DB updates, follow us on TwitterYouTube, and LinkedIn.