January 6th, 2026
0 reactions

How Azure Cosmos DB Powers ARM’s Federated Future: Scaling for the Next Billion Requests

Alex Dubinkov
Principal Software Engineering Manager

The Cloud at Hyperscale: ARM’s Mission and Growth

Azure Resource Manager (ARM) is the backbone of Azure’s resource provisioning and management, orchestrating billions of daily requests from customers around the globe. ARM manages all resources for Azure: VMs, Storage, Databases, etc. As Azure’s reach expands and customer expectations rise, ARM’s architecture must not only keep pace—it must set the pace for cloud-scale reliability, agility, and innovation.

In recent years, ARM has seen its request volume surge at an exponential rate, reaching unprecedented levels that continually redefine the boundaries of cloud-scale operations. Meeting this demand requires more than incremental improvements; it calls for a fundamental reimagining of how ARM stores, replicates, and serves data at planetary scale.

The Challenge: Global Scale Meets Regional Demands

  • High replication latency across distant regions
  • Single points of failure that could impact global availability
  • Difficulty scaling to meet explosive growth in request rates
  • Complexities in meeting regional compliance and data sovereignty requirements

The solution? A federated, regionally isolated architecture—powered by Azure Cosmos DB.

Enter Azure Cosmos DB: The Engine of Federated Evolution

Azure Cosmos DB is uniquely suited to meet ARM’s evolving needs. Its multi-region, tunable consistency levels, seamless sharding and elasticity make it the ideal foundation for a federated architecture.

  • Multi-region support: Azure Cosmos DB allows ARM to deploy data stores across strategic “hero” regions in a self-backup architecture, where each region serves as a backup for the others, ensuring high availability and disaster recovery.
  • Cross-region replication: Azure Cosmos DB’s built-in cross-region replication provides fault tolerance and failover in the event of regional outages, ensuring data remains available and consistent in backup regions.
  • Automatic failover (PPAF): Per-partition automatic failover ensures that even if a region or partition experiences issues, requests are seamlessly routed to healthy replicas in other regions.
  • Request hedging: ARM can route requests to multiple stores or regions, minimizing latency and avoiding bottlenecks.
  • Flexible sharding: Both horizontal (across accounts) and vertical (across containers) sharding allow ARM to scale out and fine-tune performance for different workloads.

ARM’s Unique Use of Azure Cosmos DB

At ARM’s scale, the challenge isn’t just storing data—it’s ensuring global consistency, minimizing replication latency and reducing the blast radius of outages. ARM uses Azure Cosmos DB in ways that go far beyond typical customer scenarios. To achieve this, ARM combines Azure Cosmos DB’s native capabilities with custom-built solutions like inline durable replication (for intra-regional durability and follower container updates), advanced routing strategies and follower containers partitioned for workload optimization. These out-of-the-box innovations allow ARM to:

  • Handle hyperscale workloads with predictable performance
  • Reduce the impact of regional failures through layered failover strategies
  • Optimize query and point-get operations across billions of resources

This unique approach underscores the flexibility of Azure Cosmos DB and the engineering ingenuity required to operate at Azure’s scale.

From Monolith to Federation: A New Architectural Paradigm

1. Regional Segregation & Global Federation

Each Azure region now maintains its own dedicated storage for resources, while a global store layer manages cross-region data and routing. This hybrid model delivers both regional autonomy and global consistency.

client / app

2. Sharding for Scale

To achieve massive scalability and performance, ARM employs a dual sharding strategy:

  • Horizontal sharding: Data is distributed across multiple Azure Cosmos DB accounts (stores) within a region or pool. ARM uses a consistent hashing algorithm on a routing key (such as subscription or tenant ID) to determine which store holds each piece of data. This approach allows ARM to scale out seamlessly by adding more stores as needed, balancing load, and minimizing the risk of “hot partitions” or bottlenecks.
  • Vertical sharding: Within each store, data is further partitioned into multiple containers, each optimized for a specific entity type or workload. This enables fine-tuning of throughput and partitioning strategies for different data shapes and access patterns. Vertical sharding also allows ARM to add containers and repartition data as requirements evolve, ensuring flexibility and efficiency at every layer.

This dual sharding approach empowers ARM to scale horizontally and vertically, supporting explosive growth and diverse workloads without sacrificing performance or manageability.

reigonal store unit

3. Robust Replication & Reliability

  • Inline Durable Replication (ARM Innovation): Within each region, ARM’s data layer implements inline durable replication—a custom-built mechanism that keeps primary and secondary stores in sync and ensures that follower containers (which may use different partitioning schemes) are always updated. This approach provides strong intra-regional durability and supports workload-optimized data access.
  • Azure Cosmos DB Cross-Regional Replication: For fault tolerance and disaster recovery, ARM leverages Azure Cosmos DB’s built-in cross-regional replication. This ensures that, in the event of a regional outage, data remains available and consistent in backup regions, supporting seamless failover and business continuity.
  • Background Repair: Any failed inline replications are handled by background processes, guaranteeing data durability and consistency.

4. Regional Isolation for Performance and Compliance

With regional isolation, reads and writes are served locally whenever possible, minimizing latency and supporting data sovereignty. In the event of a regional outage, traffic managers and circuit breakers ensure seamless failover to backup regions.

Value Delivered: Scalability, Reliability, Performance, and Cost

  • Scalability: ARM can now handle surging request rates with ease, scaling horizontally and vertically as demand grows.
  • Reliability: Multiple data copies, custom intra-regional replication and Azure Cosmos DB’s cross-regional failover mechanisms (using PPAF) ensure business continuity—even during outages or network partitions.
  • Performance: Localized data access means faster, more predictable operations for customers worldwide.
  • Security & Compliance: Data remains within regional boundaries, supporting strict compliance and sovereignty requirements.
  • Cost Optimization: By leveraging Azure Cosmos DB’s Per-Partition Per-Region Dynamic Autoscale, ARM achieved a 75% cost reduction in recent months—demonstrating that hyperscale reliability and performance can go hand-in-hand with operational efficiency.

Voices from the Front Lines

“With Azure Cosmos DB’s federated architecture, we’ve reduced replication latency and improved reliability for millions of customers worldwide.”

— ARM Engineering Team

“The ability to scale out by simply adding more stores or containers means we’re always ready for the next wave of growth.”

— ARM Product Management

The Road Ahead: Toward Full Regional Isolation

ARM’s journey toward full regional isolation is more than a technical upgrade—it’s a leap toward a more resilient, scalable, and customer-centric Azure. As we continue to innovate atop Azure Cosmos DB, we’re building the foundation for the next generation of cloud applications—where performance, reliability, and compliance are never compromised.

Get Involved / Learn More

Curious how Azure Cosmos DB enables globally distributed, high-throughput, and resilient architectures at scale? Explore the Azure Cosmos DB documentation and engineering blogs to learn how features like multi-region distribution, flexible consistency models, autoscale throughput, and partition-level failover can help you design cloud-scale systems.

Whether you’re building globally available applications or modernizing large, distributed platforms, Azure Cosmos DB provides foundational primitives to support growth, resiliency, and operational simplicity.

About Azure Cosmos DB

Azure Cosmos DB is a fully managed and serverless NoSQL and vector database for modern app development, including AI applications. With its SLA-backed speed and availability as well as instant dynamic scalability, it is ideal for real-time NoSQL and MongoDB applications that require high performance and distributed computing over massive volumes of NoSQL and vector data.

To stay in the loop on Azure Cosmos DB updates, follow us on XYouTube, and LinkedIn.

Author

Alex Dubinkov
Principal Software Engineering Manager

Alex Dubinkov is a Principal Software Engineering Manager on the Azure Resource Manager team at Microsoft, leading ARM’s federated, multi-stamp architecture to support hyperscale throughput, resiliency, and regional isolation across regional and global stores.

0 comments