Running 1M databases on Azure SQL for a large SaaS provider: Microsoft Dynamics 365 and Power Platform.
Thriving on Azure SQL
Many thanks to our colleagues Mahesh Sreenivas, Pranab Mazumdar, Karthick Pakirisamy Krishnamurthy, Mayank Mehta and Shovan Kar from Microsoft Dynamics team for their contribution to this article.
Dynamics 365 is a set of intelligent SaaS business applications that helps companies of all sizes, from all industries, to run their entire business and deliver greater results through predictive, AI-driven insights. Dynamics 365 applications are built on top of a Microsoft Power Platform that provides a scalable foundation for running not just Dynamics apps themselves, but also to let customers, system integrators and ISVs to create customizations for specific verticals, and connect their business processes to other solutions and systems leveraging hundreds of pre-existing connectors with a no-code / low-code approach.
Microsoft Power Platform (also offering services like Power Apps, Power Automate, Power Virtual Agents and PowerBI) has been built on top of Microsoft Azure services, like Azure SQL Database, which offer scalable and reliable underlying compute, data, management and security services that power the entire stack represented in the picture above.
A bit of history
Microsoft Dynamics 365 has its roots in a suite of packaged business solutions, like Dynamics AX, NAV and CRM, running on several releases of Windows Server and SQL Server on customers’ datacenters around the world.
When Software as a Service paradigm emerged to dominate business applications’ industry, Dynamics CRM led the pack becoming one of the first Microsoft’s online services. At the beginning of the SaaS journey, Dynamics services ran on of bare-metal servers in Microsoft’s on-premises datacenters. With usage growing every day, into millions of active users, the effort required to deploy and operate all those servers, manage capacity demands, and respond promptly to issues of continuously growing customer data volumes (with database size distribution from 100 MB up to more than 4 TB for the largest tenants) would eventually become unmanageable.
Dynamics was one of the first adopters of Microsoft SQL Server 2012 AlwaysOn to achieve business continuity, but also to provide a flexible way to move customer databases to new clusters by creating additional replicas to rebalance utilization.
Managing so many databases at scale was clearly a complex task, involving the entire database lifecycle from initial provisioning to monitoring, patching and operating this large fleet while guaranteeing availability, and team learned how to deal with issues like quorum losses and replicas not in a failover-ready state. From a performance perspective, database instances running on shared underlying nodes, made it hard to isolate performance issues and provided limited options to scale up or accommodate workload burst other than moving individual instances to new nodes.
As end customers can run multiple versions of (highly-customized) applications in their environments, generating significantly different workloads, it is no surprise to hear from Mahesh Sreenivas, Partner Group Engineering Manager on the Common Data Service team, that to manage and maintain all this on a traditional platform was “painful for both engineers and end customers”.
The move to Azure and Azure SQL Database
Dynamics 365 team decided to move their platform to Microsoft Azure to solve these management and operational challenges while meeting customer requirements, ensuring platform fundamentals like availability and reliability, and letting the engineering team to focus on adding innovative new features.
Engineering effort from initial designs to production took a couple of years, including migration of the customers to the new Azure based platform, moving from a monolithic code base running on-premises into a world-class planet scale service running on Azure.
Common Data Service on Azure SQL Database
The first key decision was to transition from a suite of heterogeneous applications, each one with its own history and technical peculiarities, to a common underlying platform where Dynamics applications were going to be regular applications just like what other ISV companies could build and run: hence Microsoft Power Platform and its Common Data Service layer was introduced. Essentially, a new no-code, low-code platform built on top of underlying Azure capabilities like Compute, Storage, Network and other specialized services like Azure SQL Database was a way to abstract Dynamics applications from underlying platform, letting Dynamics developers to focus on transitioning to the platform without managing individual resources like database instances.
The same platform today is also hosting other services like PowerApps, Power Automate, Power Virtual Agents or PowerBI, and it is available for other companies to build their own SaaS applications on top of, from no-code simple solutions to full-code specialized ISV apps that don’t need to worry about how to manage underlying resources like compute and various storage facilities.
By moving to Azure a platform that is managing around 1M database instances (as of July 2020), Dynamics team learned a lot about how underlying services are working, but also provided enormous feedback to other Microsoft teams to make their services better in a mutually beneficial relationship.
From an architectural perspective, Common Data Service is organized in logical stamps (or scale groups) that have two tiers, compute and data, where the relational data store is built on Azure SQL Database given team’s previous familiarity with SQL Server 2012 and 2016 on premises. This provides out of the box, pre-configured high availability with a 3 (or more) nines SLA, depending on selected service tiers. Business Continuity is also guaranteed through features like Geo-restore, Active Geo-replication and Auto Failover Groups.
Azure SQL Database also helped the team significantly by reducing database corruption episodes at table, index or backup level compared to running many databases on a shared, single SQL Server instance on-premises. Similarly, the several man hours required for patching at the firmware, operating system and SQL Server on physical machines have been reduced to only managing the application layer and its data.
Azure SQL Database Elastic Pools
Once landed on Azure SQL Database, the second key decision was to adopt Elastic Pools to host their database fleet. Azure SQL Database Elastic Pools are a simple and cost-effective solution for managing and scaling multiple databases that may have varying and unpredictable usage demands. Databases in an elastic pool are on a single logical server, and share a given amount of resources at a set price. SaaS application developers can optimize the price-performance for hundreds of databases within a prescribed resource budget, while delivering performance elasticity for each database and control multi-tenancy by setting min-max utilization thresholds for each tenant. At the same time, they enforce security and isolation by providing separate access control policies for each database. “By moving to Azure SQL Database Elastic Pools, our team doesn’t need to over-invest in this aspect of managing replication because it’s handled by the Azure SQL Database service,” explains Mahesh.
Microsoft Power Platform is using a separate for each tenant using a given service within its portfolio.
The “Spartan” resource management layer
Given the wide spectrum of customer industries, sizes and (customized) workloads, one of the key requirements is the ability to allocate and manage these databases across a fleet of elastic pools efficiently, maximizing resource utilization and management. To achieve this goal, 3 aspects are critical to master:
- Flexibility in sizing and capacity planning
- Agility in scaling resources for individual tenants
- Optimized price-performance
While Azure SQL Database platform provides the foundations to fully manage these aspects (for example, online service tier scaling, ability to move databases between pools, ability to move from single databases to pools, and viceversa, multitude of price-performance choices, etc.), Dynamics team created a dedicated management layer to automate these operations based on application-defined conditions and policies. “Spartan” is that management layer, designed to reduce the amount of manual efforts to the bare minimum, it has a scalable micro-services platform (implemented as an ARM Resource Provider) which is taking care the entire lifecycle of their database fleet.
Spartan is an API layer taking care of database CRUD operations (create, read, update, delete), but also all other operations like moving a database between elastic pools to maximize resource utilization, load balancing customer workloads across pools and logical servers, managing backup retention and restoring databases to a previous point in time. Underlying storage assigned to databases and pools is also managed automatically by the platform to avoid inefficiencies and maximize density. What may appear as a rare operation like shrinking a database in production, is a common task for a platform that needs to operate more than 1M databases and optimize costs.
Elastic Pools are organized in “tiers”, where each tier provides different configurations in terms of underlying service tiers utilized (General Purpose, Business Critical) and compute size allocated so that end customer databases will always run at the optimal price-performance level. Each tier also controls min-max settings for associated databases and define optimal density of databases per pool, in addition to other details like firewall settings and others.
Figure 1 Azure SQL Database layout per Scale Group
The picture above represents this logical organization of Elastic Pools into tiers and shows the combination of DTU and vCore purchasing models that Dynamics team is using to find the best tradeoff between granular resource allocation and cost optimization.
For very large customers, the platform can also move these databases out of a shared pool into a dedicated Azure SQL Single Database with the ability to scale to the largest compute sizes (e.g. M-Series with up to 128 vCore instances).
The level of efficiency of this platform is incredible, if you think that Dynamics team is managing the entire platform with 2 dedicated engineers, that are focused on operating and improving the Spartan platform rather than managing individual databases.
Dynamics 365 and Azure SQL Database: better together!
As mentioned, engineers from Dynamics and Azure teams have worked hard together to improve underlying platforms during this journey. Some platform-wide improvements like Accelerated Networking, introduced to significantly reduce network latency between compute nodes and other Azure services, has been heavily advocated by Dynamics team who highly benefitted from this for their data-centric, data-intensive applications.
In Azure SQL Database, Dynamics team has influenced the introduction of the vCore model to find the right ratio between compute cores and database storage that now can scale independently and optimize costs and performance.
To help get even more out of the relational database resources in Common Data Service, the team implemented Read Scale-Out, which helps boost performance by offloading part of the workload placed on primary read-write database replicas. Like most data-centric services, the workload in Common Data Service is read-heavy—meaning that it gets a lot more reads than writes. And with Read Scale-out, if a call into Common Data Service is going to result in a select versus an update, we can route that request to a read-only replica and scale-out the read workload.
Given the vast variety of customer workloads and scale, Dynamics 365 apps have been also a great sparring partner to tune and improve features like Automatic Tuning with automatic plan correction, automatic index management and intelligent query processing set of features.
Imagine having queries timing out in a 1 million database fleet: is it for lack of the right indexing strategy? A query plan regression? Something else?
To help Dynamics engineers and support organization during troubleshooting and maintenance events, another micro-service called Data Administration and Management Service (DAMS) has been developed to schedule and execute maintenance tasks and jobs like creating or rebuilding indexes to dynamically optimize changes in customer workloads. These tasks can span areas like performance improvements, transaction management, diagnostic data collection and query plan management.
Figure 2 DAMS Architecture
With help from Microsoft Research (MSR), Dynamics team has ported SQL Server’s Database Tuning Advisor (DTA) to Azure SQL and integrated it into the microservice. DTA is a platform for evaluating queries and generating index and statistic recommendations to tune query performance of the most critical database workloads.
Like any other customer database in Azure SQL Database, Dynamics 365 databases have features like Query Store turned on by default. This feature provides insights on query plan choices and performance and simplifies performance troubleshooting by helping them in quickly find performance differences caused by query plan changes.
On top of these capabilities, Dynamics team also created an optimization tool that they share with their end customers to validate if their own customizations are implemented properly, detects things like how many data controls are placed in their visualization forms and provide recommendations that adhere to their best practices.
They also proactively monitor customer workloads to understand critical use cases and detect new patterns that customers may introduce, and make sure that the platform can run them efficiently.
Working side by side with Azure SQL Database engineers, the Dynamic team helped improving many areas of the database engine. One example is related to extremely large query plan caches (100k+ plans), a common problem for complex OLTP workloads, where spinlock contention on plan recompilations was creating high CPU utilization and inefficiencies. Solving this issue helped thousands of other Azure SQL Database customers running on the same platform.
Other areas they helped to improve were Constant Time Recovery, making failover process much more efficient for millions of databases, or managed lock priority for reducing blocking during automatic index creation.
In addition to what Azure SQL Database provides out of the box, Dynamics team also developed specific troubleshooting workflows for customer escalating performance issues. As an example, Dynamics support engineers can run Database Tuning Advisor on a problematic customer workload and understand specific recommendations that could be applied to mitigate customer issues.
A look into the future
Dynamics 365 was one of the prime influencers for increasing Azure SQL Database max instance size from 1TB to 4TB, given the scale of some of the largest end customers. That said, even 4TB now is a limit in their ability to scale, so Dynamics team is looking into Azure SQL Database Hyperscale as their next-gen relational storing tier for their service. Virtually unlimited database size, in conjunction with the separation between compute and storage sizing and the ability to leverage read replicas to scale out customer workloads are the most critical characteristics that the team is looking into.
Dynamics team is working side by side with Azure SQL team in testing and validating Azure SQL Database Hyperscale on all the challenging scenarios mentioned previously, and this collaboration will continue to be successful not only for the two teams respectively, but also for all the other customers running on the platform.