{"id":4084,"date":"2022-04-05T07:00:49","date_gmt":"2022-04-05T14:00:49","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cosmosdb\/?p=4084"},"modified":"2022-04-04T15:16:03","modified_gmt":"2022-04-04T22:16:03","slug":"simplified-cassandra-cdc-with-change-feed","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cosmosdb\/simplified-cassandra-cdc-with-change-feed\/","title":{"rendered":"Simplified CDC with Azure Cosmos DB Cassandra API\u00a0"},"content":{"rendered":"<p><span data-contrast=\"auto\">Change Data Capture (CDC) is a common fixture in database products, used for retrieving and handling updates to the underlying data. In this blog post, we detail the many ways in which the Azure Cosmos DB Cassandra API provides a much simpler change data capture interface via<\/span> <a href=\"https:\/\/aka.ms\/AAgajig\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"auto\">Change feed<\/span><\/a><span data-contrast=\"auto\">, than Apache Cassandra\u2019s CDC functionality to consume data mutations.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The <\/span><a href=\"https:\/\/aka.ms\/AAgkyvg\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"auto\">Azure Cosmos DB Cassandra API<\/span><\/a><span data-contrast=\"auto\">\u00a0is a fully managed service which can be used as a backing data store for applications using Apache Cassandra. Built on top of <\/span><a href=\"https:\/\/aka.ms\/AAgl65g\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"auto\">Azure Cosmos DB<\/span><\/a><span data-contrast=\"auto\">, the Cassandra API provides scale, performance and availability guarantees while eliminating the operational overhead needed to manage Cassandra.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h3><span data-contrast=\"none\">Data Mutations are available for all tables by default<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559738&quot;:40,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">Apache Cassandra requires multiple levels of configurations to enable Change Data Capture. Firstly, the cassandra.yaml file should be modified to set <\/span><b><i><span data-contrast=\"auto\">cdc_enabled<\/span><\/i><\/b><span data-contrast=\"auto\"> to true. This should then be repeated for every node in each data center. Secondly, the <\/span><b><i><span data-contrast=\"auto\">cdc_raw_directory<\/span><\/i><\/b><span data-contrast=\"auto\"> should be specified to move data from the Commit Log folder into the CDC folder once data is flushed to the SSTables. Lastly, each table needs to be individually created or altered to enable CDC functionality.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">These configurations are not required when using the Azure Cosmos DB Cassandra API. Change Feed is available by default without the need to explicitly opt in and without the need to coordinate enabling CDC across each node in the cluster.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">Change Logs are available immediately<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559738&quot;:40,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">In Apache Cassandra, CommitLogSegments are moved into the CDC folder only after the writes are flushed to SSTables on disk. This means that clusters with a low write volume, particularly during time windows when utilization is low will experience longer wait times before the changes can be consumed. Sometimes, this may also require a manual periodic flush of Memtables into SSTables to make the availability of CDC data predictable.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">When using the Cosmos DB Cassandra API, CDC data is available immediately regardless of the rate of ingestion. Since there are no flushes that need to be triggered to explicitly make mutations available, there is no operational overhead required to make the availability of consumable changes predictable.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">The need for Capacity Planning is eliminated<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559738&quot;:40,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">In Apache Cassandra, the configured CDC folder consumes physical disk space on each Cassandra node. This folder can become full if the mutations are not consumed and cleaned up on time. This poses the risk of ingestion into the source table being blocked until disk space is freed up in the CDC directory.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In Azure Cosmos DB Cassandra API, the Change Feed functionality does not require capacity management of any kind. The underlying storage system that persists the data and makes changes consumable does not have capacity limitations that need to be monitored and cleaned up periodically. The database scales out under the covers without any impact to the application using the service.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">The Entire Row is Available in the Cosmos DB Cassandra API<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559738&quot;:40,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">Apache Cassandra\u2019s CDC functionality does not provide visibility into the entire row that was changed. Only the table name, partition key, mutations and associated timestamps are provided. This adds operational overhead if incremental changes need to be maintained.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">With the Cosmos DB Cassandra API, the entire row is returned by Change Feed. While Change Feed does not yet return the previous snapshot of the row prior to the change, stay tuned for another blog post on soon to be available functionality which will also include the previous image prior to the mutation.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">The Need for De-Duplication is Eliminated<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559738&quot;:40,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">Apache Cassandra replicates data across multiple nodes through a configurable replication factor. The replication factor determines the number of nodes within a data center that will contain each row of data. This also means that the mutations for the same row will exist on multiple nodes in a data center. Thus, a de-duplication mechanism will need to be crafted within the consuming application to account for the same row\u2019s mutation being captured multiple times.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">While Cosmos DB also replicates data for high availability, Change Feed only returns the mutations of a row from a single copy of the data. This avoids the complexity of having to de-duplicate the changes and makes consumption of change logs far simpler.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">Changes Can be Retrieved from a Chosen Start time<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559738&quot;:40,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">In Apache Cassandra, once the Commit Logs have been consumed from the CDC folder, the files are deleted, and the mutations are lost. These mutations would need to be moved into another data store if the use case mandates multiple views of these changes.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This too is highly simplified and much easier to achieve on the Cosmos DB Cassandra API. Changes can be retrieved by specifying a chosen start time and only mutations from the specified start time will be returned by Cosmos DB. This provides application teams flexibility to iterate quickly without having to ensure that the handler is fine-tuned before use. Most importantly, the mutations live on for the lifetime of the table and do not get archived after they are consumed, facilitating repeated retrieval of the same changes.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint\">DateTimeFormatter dtf = DateTimeFormatter.ofPattern(\"yyyy-MM-dd HH:mm:ss\");\r\nLocalDateTime now = LocalDateTime.now().minusHours(6).minusMinutes(30);\r\nString query=\"SELECT * FROM uprofile.user where COSMOS_CHANGEFEED_START_TIME()='\" + dtf.format(now)+ \"'\";<\/pre>\n<h3><\/h3>\n<h3><span data-contrast=\"none\">Next<\/span> <span data-contrast=\"none\">Steps<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">To learn more, see <\/span><a href=\"https:\/\/aka.ms\/AAgajig\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Change feed in the Azure Cosmos DB API for Cassandra<\/span><\/a><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Stay tuned for another blog on parallelizing the Change Feed handler and new additions to Change Feed for the Azure Cosmos DB API for Cassandra.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post details the several ways in which Change Feed in the Azure Cosmos DB API for Cassandra makes consumption of row mutations significantly easier to use with much more flexibility, than Apache Cassandra&#8217;s CDC (Change Data Capture) functionality. <\/p>\n","protected":false},"author":64774,"featured_media":1094,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[16,644],"tags":[],"class_list":["post-4084","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cassandra-api","category-change-feed"],"acf":[],"blog_post_summary":"<p>This blog post details the several ways in which Change Feed in the Azure Cosmos DB API for Cassandra makes consumption of row mutations significantly easier to use with much more flexibility, than Apache Cassandra&#8217;s CDC (Change Data Capture) functionality. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/4084","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/users\/64774"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/comments?post=4084"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/4084\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media\/1094"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media?parent=4084"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/categories?post=4084"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/tags?post=4084"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}