Large-scale distributed builds with Microsoft Build Accelerator
Microsoft has many products and a vast amount of source code to compile and build. For some of the larger products, performing clean single-computer builds has grown to be impractical for the pace of change we want to allow for our developers. In the early 2010’s we started applying distributed computing to builds. This effort started with various medium-sized codebases of Microsoft’s online services with CloudBuild.
CloudBuild focused not only on distributing builds, but also on adding reliability and trustworthy incrementality via caching. Incrementality refers to allowing builds to reuse the results of previous builds for parts of a codebase that had not changed. Incrementality was nothing revolutionary at the time, but it was often untrusted prior to CloudBuild due to discrepancies with build systems’ understanding of which parts of the codebase mapped to potentially reusable outputs of previous builds. The result would be that parts of a codebase wouldn’t get rebuilt even though they should have based on code changes. This led to the behavior of not trusting incremental builds to the point that all official builds were clean.
CloudBuild’s implementation of incrementality was trustworthy because it utilized runtime file access monitoring to make sure that no files other than those that were predicted for cache lookup were consumed. If an extra file were read, runtime enforcement would fail the build. It also allowed other protections, like breaking the build if different versions of a file were written to the same path in an uncoordinated way. These protections ensured that the result of the build was deterministic and caching was trustworthy, even when distributed across machines which changed timing. In fact, even single machine builds were more reliable when using the new build toolchain because implicit dependencies which usually worked due to lucky timing became build breaks until explicit dependencies were added.
BuildXL and an evolution of our caching algorithm
Microsoft Build Accelerator, or BuildXL, is an evolution of distributed builds for Microsoft’s largest codebases. It runs on the CloudBuild infrastructure as well as on single machine builds. It builds on the same principles of reliable distribution and caching verified by runtime file access observation. It increases scalability for larger and more detailed build graphs on the order of millions of process invocations. This allows for smaller units of execution and thus more granular incrementality. BuildXL has been an integral piece in allowing CloudBuild to now scale to Microsoft’s largest codebases.
BuildXL introduced an evolution of the caching algorithm. In the original version of the acceleration engine used in CloudBuild, a set of input files was predicted based on statically analyzing the codebase. Any inputs that could not be predicted needed to be manually added by the developer as annotations to build specifications. The cache lookup algorithm would then hash those files, create a fingerprint, and the retrieve the result from the cache.
This approach worked well in many cases but faced challenges in others, such as header directories in large C++ codebases. There, the compiler was presented with an include directory containing hundreds or more header files. The compiler then read a subset of those files based on what was referenced in the code it compiles. The original cache lookup algorithm required either maintaining precise specification of those header inputs as code changed or overapproximating the inputs specifying all headers in the include directory. Over approximation would have resulted in overly sensitive caching and thus a low cache hit rate. Maintaining precise annotations to describe the dependencies would have been burdensome on developers. A natural solution for that would be to write tooling to determine which headers are consumed based on source code. But that turns out to be a challenging problem because this pattern doesn’t only exist with the C++ compiler. Many other tools being executed in builds have similar challenges for predicting precise dependencies.
BuildXL took the novel approach of encoding this input uncertainty into the cache algorithm itself with a “two-phase” cache lookup. Instead of requiring the precise list of all files that would be consumed as inputs up front, the algorithm allows for both a static set of inputs processed in the first phase of the lookup and a dynamic set of inputs automatically discovered at build time in the second phase. This ensures that all consumed inputs are included in the cache fingerprint even if they could not be statically predicted, thus maintaining the correctness of reusing cached results. It also prevents the need for burdensome specification of a precise set of inputs. We’ll do a future blog post to describe this algorithm and its intricacies in more detail.
BuildXL and CloudBuild have allowed us to build codebases in minutes that previously took over 24 hours. They have significantly improved the reliability of those builds. Build time file access monitoring has given us the confidence to ship products from incremental builds. We continue to invest in this and other build tooling to make our engineering more productive and efficient. You can check out BuildXL on GitHub.