Change Data Capture (CDC) is a common fixture in database products, used for retrieving and handling updates to the underlying data. In this blog post, we detail the many ways in which the Azure Cosmos DB Cassandra API provides a much simpler change data capture interface via Change feed, than Apache Cassandra’s CDC functionality to consume data mutations.
The Azure Cosmos DB Cassandra API is a fully managed service which can be used as a backing data store for applications using Apache Cassandra. Built on top of Azure Cosmos DB, the Cassandra API provides scale, performance and availability guarantees while eliminating the operational overhead needed to manage Cassandra.
Data Mutations are available for all tables by default
Apache Cassandra requires multiple levels of configurations to enable Change Data Capture. Firstly, the cassandra.yaml file should be modified to set cdc_enabled to true. This should then be repeated for every node in each data center. Secondly, the cdc_raw_directory should be specified to move data from the Commit Log folder into the CDC folder once data is flushed to the SSTables. Lastly, each table needs to be individually created or altered to enable CDC functionality.
These configurations are not required when using the Azure Cosmos DB Cassandra API. Change Feed is available by default without the need to explicitly opt in and without the need to coordinate enabling CDC across each node in the cluster.
Change Logs are available immediately
In Apache Cassandra, CommitLogSegments are moved into the CDC folder only after the writes are flushed to SSTables on disk. This means that clusters with a low write volume, particularly during time windows when utilization is low will experience longer wait times before the changes can be consumed. Sometimes, this may also require a manual periodic flush of Memtables into SSTables to make the availability of CDC data predictable.
When using the Cosmos DB Cassandra API, CDC data is available immediately regardless of the rate of ingestion. Since there are no flushes that need to be triggered to explicitly make mutations available, there is no operational overhead required to make the availability of consumable changes predictable.
The need for Capacity Planning is eliminated
In Apache Cassandra, the configured CDC folder consumes physical disk space on each Cassandra node. This folder can become full if the mutations are not consumed and cleaned up on time. This poses the risk of ingestion into the source table being blocked until disk space is freed up in the CDC directory.
In Azure Cosmos DB Cassandra API, the Change Feed functionality does not require capacity management of any kind. The underlying storage system that persists the data and makes changes consumable does not have capacity limitations that need to be monitored and cleaned up periodically. The database scales out under the covers without any impact to the application using the service.
The Entire Row is Available in the Cosmos DB Cassandra API
Apache Cassandra’s CDC functionality does not provide visibility into the entire row that was changed. Only the table name, partition key, mutations and associated timestamps are provided. This adds operational overhead if incremental changes need to be maintained.
With the Cosmos DB Cassandra API, the entire row is returned by Change Feed. While Change Feed does not yet return the previous snapshot of the row prior to the change, stay tuned for another blog post on soon to be available functionality which will also include the previous image prior to the mutation.
The Need for De-Duplication is Eliminated
Apache Cassandra replicates data across multiple nodes through a configurable replication factor. The replication factor determines the number of nodes within a data center that will contain each row of data. This also means that the mutations for the same row will exist on multiple nodes in a data center. Thus, a de-duplication mechanism will need to be crafted within the consuming application to account for the same row’s mutation being captured multiple times.
While Cosmos DB also replicates data for high availability, Change Feed only returns the mutations of a row from a single copy of the data. This avoids the complexity of having to de-duplicate the changes and makes consumption of change logs far simpler.
Changes Can be Retrieved from a Chosen Start time
In Apache Cassandra, once the Commit Logs have been consumed from the CDC folder, the files are deleted, and the mutations are lost. These mutations would need to be moved into another data store if the use case mandates multiple views of these changes.
This too is highly simplified and much easier to achieve on the Cosmos DB Cassandra API. Changes can be retrieved by specifying a chosen start time and only mutations from the specified start time will be returned by Cosmos DB. This provides application teams flexibility to iterate quickly without having to ensure that the handler is fine-tuned before use. Most importantly, the mutations live on for the lifetime of the table and do not get archived after they are consumed, facilitating repeated retrieval of the same changes.
DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"); LocalDateTime now = LocalDateTime.now().minusHours(6).minusMinutes(30); String query="SELECT * FROM uprofile.user where COSMOS_CHANGEFEED_START_TIME()='" + dtf.format(now)+ "'";
Next Steps
To learn more, see Change feed in the Azure Cosmos DB API for Cassandra
Stay tuned for another blog on parallelizing the Change Feed handler and new additions to Change Feed for the Azure Cosmos DB API for Cassandra.
I’m still surprised that to this day, Microsoft hasn’t provided support for every single data mutation (i.e. full fidelity) for SQL API customers. This should not be a feature request sitting in the backlog. It should be provided out of the box by every modern database.
Hang in there. Coming soon 🙂