Background
Before the emergence of cloud computing, many teams across Microsoft had functional verification systems that were tied to aging, unreliable and unscalable lab infrastructure. Due to unique requirements in terms validation scenarios and scale, these teams could not use any other existing solutions. This resulted in high human and machine cost and slowed down engineering productivity. Engineers ended up cutting quality corners while trying to bypass this infrastructure.
We needed a fast and reliable verification infrastructure that could leverage cloud scale to make our engineers more productive. As teams had diverse requirements with respect to test configuration, deployment, and test frameworks, we needed this verification system to be extensible at the right places, e.g., the ability to leverage existing test collateral when moving to the cloud.
To address this need, in 2014 we built CloudTest, a multi-tenant, scalable, performant and extensible One Engineering System (1ES) verification service that provides test submission, scheduling, resourcing, execution, and result reporting capabilities. It’s currently used by more than 10,000 Microsoft developers and runs validations on more than one million Azure VMs every day.
Built from the ground up in the cloud, CloudTest brings the same agility to functional and integration testing. It runs tests outside the context of a build, provides incremental execution, and distributes tests across a large set of machines for maximum performance, and aggregates results.
With a well-defined extensibility model, we can extend CloudTest to meet the diverse needs of Microsoft teams. It is used for new testing as well as migrating from legacy systems. Test definitions are version controlled with the source code. It supports a wide variety of test runners (MSTest, NUnit, Boost, TAEF, Exe) and is extensible to add support for many more.
In this post, I’ll discuss key characteristics of the CloudTest infrastructure. We believe similar characteristics should be considered in all large-scale test infrastructures to improve engineers’ productivity and help engineers to ship high-quality software.
Key Characteristics
We categorized the key characteristics of CloudTest into three areas: Ease of use, feedback on test content, and fundamentals.
Ease of use
Developers shouldn’t need to spend a lot of time on onboarding, setting up, or maintaining their test environments. In this section, we’ll go over some of the main features that CloudTest provides to Microsoft developers to help them be more agile.
- Ease of onboarding: CloudTest has an Azure portal plugin that internal Microsoft developers can use to onboard. The onboarding experience is similar to creating any other Azure resource.
- Image creation and maintenance: Teams across Microsoft may require different OS versions or applications be available to run their tests. CloudTest provides this functionality by allowing customers to define what image they want to be used in their VM pools, and provides an automated image management system where customers can define their requirements and base image.
- Ease of authoring: All the CloudTest configurations reside in the teams’ repos next to their code. This allows validations to be different from branch to branch, and developers can try their tests before checking in the change in the main branch.
- Azure DevOps result reporting and logs: CloudTest has an Azure DevOps plugin that customers can use to view and debug their test results. The experience is close to debugging any other failures that happen on CI/CD pipelines on Azure DevOps–we don’t want developers to learn another tool for their validations.
- Azure DevOps Pipeline integration: CloudTest provides tasks that can be used in any CI/CD pipeline on Azure DevOps, as different teams may need to run CloudTest tasks in different stages of their pipelines
- Hold VMs for debugging: In addition to providing logs, customers can hold any VMs through Azure DevOps UI for debugging.
Feedback on Test Content
CloudTest provides different functionality to developers to know how well their tests are executing.
- Code coverage: Customers can turn on code coverage for their tests on Azure DevOps. CloudTest uses public and internal tools for collecting code coverage data.
- Flaky tests: Based on multiple executions of a single test, CloudTest determines if a test is flaky or not, and teams can set different policies on how to react to these flaky tests.
- Test execution telemetry: CloudTest collects any test execution data (e.g., result, time to setup, time to execute, log size) and makes it available to all developers. Teams can set up their own dashboards on their test execution data and work on any improvements they deem necessary.
Fundamentals
To help Microsoft teams to be more agile, each service in the CI/CD pipeline needs to be performant and reliable. Here are some of the areas that the CloudTest team has focused on in the last six years:
- Scalability: As more teams onboarded to CloudTest, we needed to make sure the service could handle the scale. We moved to Azure Kubernetes and made some fundamental changes to ensure the service is reliable and resilient against any internal or external failure.
- Multi-tenancy: We’ve made multiple changes to our services to make sure any issue on one team cannot affect other teams. Each team’s tests run on different VM pools, and pools are isolated and cannot affect each other.
- Cost-effectiveness: The major cost of running tests is the compute cost on the VMs. To improve productivity, we don’t want tests to wait for VMs to become available. On the other hand, if CloudTest over-provisions VMs, this can increase cost. We’ve added support for dynamic pre-provisioning of VMs to optimize cost.
- Result caching: The intuition behind result caching is that not all the tests need to be executed for every change. We collect the file dependency of each test, and we compute the hash of the content hash of all file dependencies. If the hash was previously available in the result, we use that result and skip executing the test. More than 20% of the tests can be cached by using this approach.
- Clean environments: We want to make sure each test runs on a clean VM so no artifacts from previous test executions can affect the result of the current test. For this purpose, CloudTest always creates/deletes or reimages the VM after each job.
CloudTest Architecture
CloudTest high-level architecture and its interactions with other Azure services
CloudTest is built on Azure using the latest modern cloud service solutions:
- Azure Kubernetes as the platform to run its middle-tier services
- Cosmos DB and Azure Blob Storage for storing recent results and storing internal states
- Azure Data Explorer and Azure Data Lake for storing telemetry and test results
- REST APIs for communication between services
- Azure Queue Storage and Azure Service Bus for event-based service communication
Summary
CloudTest is a multi-tenant, scalable, performant and extensible 1ES verification which is used by many teams across Microsoft. It allows Microsoft developers to validate their products faster and more reliably before shipping any code. We will continue to invest in this and bring more value to all Microsoft developers.
Sounds interesting, is there documentation or user guide to explain how to use CloudTest ?
And also, how is this different from normal VS test task we use in CI/CD pipelines ?
CloudTest is currently an internal tool for Microsoft engineering teams and the internal guide/documentation for this service is located at aka.ms/cug. This is different from the normal VS test task in many ways. The main differences that I think should be emphasized from the above post are scalability, easier setup for functional/integration testing, and more features. Scalability allows our engineering teams the ability to run multiple high value tests across a range of environments/setups all at once to get results back much faster. Our setup enables engineering teams to customize their environment (whether it be image, setup scripts to run,...