{"id":10574,"date":"2025-05-30T12:00:01","date_gmt":"2025-05-30T19:00:01","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cosmosdb\/?p=10574"},"modified":"2025-05-30T12:00:01","modified_gmt":"2025-05-30T19:00:01","slug":"powering-real-time-messaging-at-scale-with-azure-cosmos-db","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cosmosdb\/powering-real-time-messaging-at-scale-with-azure-cosmos-db\/","title":{"rendered":"Powering Real-Time Messaging at Scale with Azure Cosmos DB"},"content":{"rendered":"<p><span data-contrast=\"auto\">Microsoft Teams, Copilot, Azure Communication Services and many other product offerings from Microsoft, rely on a unified messaging platform that powers real-time communication and collaboration at an unprecedented scale. This messaging platform has become critical for enabling boundary-less collaboration, supporting hundreds of millions of users worldwide. To ensure global discovery, durable storage and performance needed for real time communication, the messaging platform relies on Azure Cosmos DB as one of its data storages. It has data distributed in most Azure regions, has several Petabytes of data and performs trillions of database transactions per day to power mission critical messaging scenarios. In this article, we will share why we chose Azure Cosmos DB and some of the learnings we have had after running it at scale.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span style=\"font-size: 18pt;\"><b>Why we chose Azure\u00a0Cosmos\u00a0DB<\/b>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">To support our mission of enabling real-time communication and collaboration for hundreds of millions of users globally, we needed a data store that could meet our stringent requirements. Some of the most critical requirements were:<\/span><span data-ccp-props=\"{&quot;134233117&quot;:true,&quot;134233118&quot;:true}\">\u00a0<\/span><\/p>\n<ul>\n<li><strong>Global distribution<\/strong> with seamless replication across regions in public and sovereign clouds<\/li>\n<li><strong>Fully managed<\/strong>\u00a0with automatic scale-out to reduce operational overhead<\/li>\n<li><strong>Multi-region reads and writes<\/strong> for effective global user and group discovery and routing<\/li>\n<li><strong>Built in resiliency and Automatic backups<\/strong> for better fault tolerance and disaster recovery<\/li>\n<li><strong>Ultra-low latency<\/strong> for both reads and writes to meet real time needs<\/li>\n<li><strong>Planet-scale throughput<\/strong>\u00a0to handle massive and spiky traffic patterns<\/li>\n<\/ul>\n<p><span data-contrast=\"auto\">Azure Cosmos DB meets all of these needs and more. It powers several core components in our pipeline. We use partitioned collections to store users, groups metadata and messages, partitioned and denormalized to serve our queries efficiently. The change feed drives our downstream subscribers pipeline ensuring reliable delivery and supporting fan-out to multiple processing layers. During the early days of the COVID-19 pandemic, our storage infrastructure scaled seamlessly to meet the sudden surge in traffic\u2014ensuring uninterrupted service during a time of unprecedented digital demand. <\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;335559685&quot;:720}\">\u00a0<\/span><\/p>\n<p><span style=\"font-size: 18pt;\"><b>Scaling Lessons &amp; Optimizations<\/b>\u00a0<\/span><\/p>\n<ol>\n<li><span data-contrast=\"auto\"><span data-contrast=\"auto\"><span data-contrast=\"auto\"><strong>Partitioning strategy<\/strong> &#8211; W<\/span><\/span><\/span>e&#8217;ve learned that thoughtful partition design is critical\u2014suboptimal choices can lead to hot partitions, throttling, and degraded performance. To avoid cross-partition queries, we use fine-grained logical partitions. For example,\n<ol>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>We use <strong>User IDs<\/strong>\u00a0and\u00a0<strong>Group IDs<\/strong>\u00a0as partition keys for storing metadata and messages, which provide sticky partitions ideal for user- and group-centric access patterns.<\/li>\n<li>For our delivery pipeline, we use\u00a0<strong>Event IDs<\/strong> to create non-sticky partitions that support high-throughput fan-out.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>To further optimize for diverse query patterns, we store some data in a denormalized and duplicated form across containers, each configured with different partition keys.\u00a0 This approach allows us to tailor data access for specific scenarios\u2014such as message rendering, roster lookups, or user centric left rail rendering\u2014while minimizing latency and avoiding expensive cross-partition operations<\/li>\n<li><strong>Indexing policies<\/strong> &#8211; We apply tailored indexing strategies to support low-latency queries across various user experiences. To optimize both performance and storage efficiency, we disable the default indexing on all properties and selectively enable indexes only on fields required by our query patterns. Additionally, we leverage <strong>composite indexes<\/strong>\u00a0where appropriate, which significantly enhance query performance as data volume grows within partitions.<\/li>\n<li><strong>Multi Write support<\/strong> &#8211; The multi-write capability is crucial for applications that require low-latency reads and writes from multiple regions and geographies, all while operating on same globally distributed data store. We use this capability for storing users &amp; groups routing information for effective global routing enabling users across the globe to instantly discover chats, meetings and other groups and start messaging. While using multi-writes, we recommend\n<ul>\n<li>Having the application with regional affinity as much as possible while performing writes and<\/li>\n<li>Avoiding patterns of rapid repeated writes on same documents which complicates conflicts resolutions.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Resilience by design<\/strong> &#8211; Building applications that can withstand regional outages is essential for mission-critical systems. Azure Cosmos DB offers several building blocks that support this goal, including:\n<ul>\n<li>Automatic read hedging to route reads to healthy replicas even across regions<\/li>\n<li>Write redirection for multi write accounts to alternate regions in case of failures<\/li>\n<li><strong>Per-partition automatic failover\u00a0<\/strong>(currently in preview) for more granular resilience<\/li>\n<\/ul>\n<p>By adopting these capabilities, your application can continue to operate reliably even during regional disruptions. However, it\u2019s important to design these mechanisms with your application&#8217;s consistency requirements in mind.<\/li>\n<li><strong>Autoscaling for spikey traffic<\/strong> &#8211; We observe noticeable traffic spikes at the top and bottom of each hour, along with uneven load distribution across geographies throughout the day. If your application experiences similar patterns, enabling\u00a0<strong>autoscaling<\/strong>\u2014especially\u00a0<strong>dynamic autoscaling<\/strong> (also known as per-region and per-partition autoscaling)\u2014can help manage capacity more efficiently and cost-effectively. This approach ensures that your system continues to serve requests reliably, while Azure Cosmos DB automatically scales resources up or down based on actual demand. Importantly, it targets only the specific regions and partitions that require scaling. After rolling out dynamic autoscaling across several microservices, we observed cost savings ranging from <strong>10% to 45%<\/strong>, depending on the workload.<\/li>\n<\/ol>\n<p><span data-contrast=\"auto\">Overall,\u00a0leveraging\u00a0<\/span><span data-contrast=\"auto\">Azure <\/span><span data-contrast=\"auto\">Cosmos\u00a0DB\u00a0has enabled us to deliver reliable, real-time messaging at a global scale, meeting the dynamic needs of users worldwide. As we continue to evolve, the lessons learned from\u00a0operating\u00a0at scale guide our ongoing optimizations for performance and resilience.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<h2 id=\"leave-a-review\"><strong>Leave a review<\/strong><\/h2>\n<p>Tell us about your Azure Cosmos DB experience! Leave a review on PeerSpot and we\u2019ll gift you $50.\u00a0<a id=\"menuros8\" class=\"fui-Link ___1q1shib f2hkw1w f3rmtva f1ewtqcl fyind8e f1k6fduh f1w7gpdv fk6fouc fjoy568 figsok6 f1s184ao f1mk8lai fnbmjn9 f1o700av f13mvf36 f1cmlufx f9n3di6 f1ids18y f1tx3yz7 f1deo86v f1eh06m1 f1iescvh fhgqx19 f1olyrje f1p93eir f1nev41a f1h8hb77 f1lqvz6u f10aw75t fsle3fq f17ae5zn\" title=\"https:\/\/peerspotdotcom.my.site.com\/proreviews\/?salesopportunityproduct=00kpy000004tkxjia4&amp;productpeerspotnumber=30881&amp;calendlyaccount=peerspot&amp;calendlyformlink=peerspot-product-reviews-ps-gc-vi-sf-50&amp;giftcard=50\" href=\"https:\/\/peerspotdotcom.my.site.com\/proReviews\/?SalesOpportunityProduct=00kPy000004TKXJIA4&amp;productPeerspotNumber=30881&amp;CalendlyAccount=peerspot&amp;CalendlyFormLink=peerspot-product-reviews-ps-gc-vi-sf-50&amp;giftCard=50\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Link Get started here\">Get started here<\/a>.<\/p>\n<h2 id=\"about-azure-cosmos-db\"><strong>About Azure Cosmos DB<\/strong><\/h2>\n<p>Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.<\/p>\n<p>To stay in the loop on Azure Cosmos DB updates, follow us on\u00a0<a id=\"menurosb\" class=\"fui-Link ___1q1shib f2hkw1w f3rmtva f1ewtqcl fyind8e f1k6fduh f1w7gpdv fk6fouc fjoy568 figsok6 f1s184ao f1mk8lai fnbmjn9 f1o700av f13mvf36 f1cmlufx f9n3di6 f1ids18y f1tx3yz7 f1deo86v f1eh06m1 f1iescvh fhgqx19 f1olyrje f1p93eir f1nev41a f1h8hb77 f1lqvz6u f10aw75t fsle3fq f17ae5zn\" title=\"https:\/\/twitter.com\/azurecosmosdb\" href=\"https:\/\/twitter.com\/AzureCosmosDB\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Link X\">X<\/a>,\u00a0<a id=\"menurose\" class=\"fui-Link ___1q1shib f2hkw1w f3rmtva f1ewtqcl fyind8e f1k6fduh f1w7gpdv fk6fouc fjoy568 figsok6 f1s184ao f1mk8lai fnbmjn9 f1o700av f13mvf36 f1cmlufx f9n3di6 f1ids18y f1tx3yz7 f1deo86v f1eh06m1 f1iescvh fhgqx19 f1olyrje f1p93eir f1nev41a f1h8hb77 f1lqvz6u f10aw75t fsle3fq f17ae5zn\" title=\"https:\/\/aka.ms\/azurecosmosdbyoutube\" href=\"https:\/\/aka.ms\/AzureCosmosDBYouTube\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Link YouTube\">YouTube<\/a>, and\u00a0<a id=\"menurosh\" class=\"fui-Link ___1q1shib f2hkw1w f3rmtva f1ewtqcl fyind8e f1k6fduh f1w7gpdv fk6fouc fjoy568 figsok6 f1s184ao f1mk8lai fnbmjn9 f1o700av f13mvf36 f1cmlufx f9n3di6 f1ids18y f1tx3yz7 f1deo86v f1eh06m1 f1iescvh fhgqx19 f1olyrje f1p93eir f1nev41a f1h8hb77 f1lqvz6u f10aw75t fsle3fq f17ae5zn\" title=\"https:\/\/www.linkedin.com\/company\/azure-cosmos-db\/\" href=\"https:\/\/www.linkedin.com\/company\/azure-cosmos-db\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Link LinkedIn\">LinkedIn<\/a>.<\/p>\n<p>To quickly build your first database, watch our\u00a0<a href=\"https:\/\/youtube.com\/playlist?list=PLmamF3YkHLoLLGUtSoxmUkORcWaTyHlXp\" target=\"_blank\" rel=\"noopener\">Get Started videos<\/a>\u00a0on YouTube and explore ways to\u00a0<a href=\"https:\/\/docs.microsoft.com\/azure\/cosmos-db\/optimize-dev-test\" target=\"_blank\" rel=\"noopener\">dev\/test free.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Microsoft Teams, Copilot, Azure Communication Services and many other product offerings from Microsoft, rely on a unified messaging platform that powers real-time communication and collaboration at an unprecedented scale. This messaging platform has become critical for enabling boundary-less collaboration, supporting hundreds of millions of users worldwide. To ensure global discovery, durable storage and performance needed [&hellip;]<\/p>\n","protected":false},"author":190886,"featured_media":10589,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[14],"tags":[499,1953,1952],"class_list":["post-10574","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-core-sql-api","tag-azure-cosmos-db","tag-messaging","tag-microsoft-teams"],"acf":[],"blog_post_summary":"<p>Microsoft Teams, Copilot, Azure Communication Services and many other product offerings from Microsoft, rely on a unified messaging platform that powers real-time communication and collaboration at an unprecedented scale. This messaging platform has become critical for enabling boundary-less collaboration, supporting hundreds of millions of users worldwide. To ensure global discovery, durable storage and performance needed [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/10574","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/users\/190886"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/comments?post=10574"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/posts\/10574\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media\/10589"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/media?parent=10574"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/categories?post=10574"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cosmosdb\/wp-json\/wp\/v2\/tags?post=10574"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}