Beyond GVFS: more details on optimizing Git for large repositories

Over the last few years, Microsoft has been moving the entire company to a modern engineering system built on Visual Studio Team Services and using Git as our version control system. For many of the projects within Microsoft, this is no problem, since: the Git homepage tell us:

Git was built to work on the Linux kernel, meaning that it has had to effectively handle large repositories from day one.

And it’s true that Git does deal very effectively with the Linux kernel, which is indeed quite large for an open source project. It contains 60,000 files in HEAD and its history spans 12 years.

But as far as Enterprise projects go: 60,000 files isn’t really that many.

The repository that I work in contains the Visual Studio Team Services code base and is just a bit over twice as large, with 125,000 files. And at Microsoft, this is simply “medium sized”. When we talk about a large source tree, we talk about one thing: Windows, which weighs in at a whopping 3.5 million files. That’s the source, tests and build tools needed to create a build of Windows and create an ISO of the entire operating system.

This sounds crazy at first but it really shouldn’t be too surprising. When you think about the Linux kernel, those 60,000 files go towards building just a kernel (and the modules) that have to fit into RAM on your machine. My kernel – a stock Ubuntu 16.04 image – is 7 megabytes, and comes along with another 200 megabytes of modules.

Linus has joked that the kernel has become “bloated and huge“. Certainly this 200 megabytes is much, much larger than it was in the early 90’s, when a kernel had to fit on a floppy disk and had to power a machine with only 4 megabytes of RAM, all while making sure to leave room enough left over to actually power the system.

But Windows 10? That’s a 4GB ISO image.

Since all of Windows – the kernel, the libraries, the applications – are released together, they’re also versioned together, in a large “monorepo“. When planning the Windows move to Git, we looked at breaking the codebase down into many smaller repositories and layering them together with Git submodules or a system like Android’s repo. But sometimes monorepos are just the easiest way to collaborate.

@xjoeduffyx Even if componentization worked, I’d still have gone for monorepo. Collaboration benefits are paramount: https://t.co/xt03PCGh3D

— Joe Duffy (@xjoeduffyx) February 3, 2017

Unfortunately, one of the problems with a giant monorepo like Windows is that Git has not historically coped very well with a repository that size.

GVFS

Over the last few years, we’ve been working on adapting Git to scale to handle truly large monorepos like the Windows repository. The biggest part of this work – by far – is GVFS, the Git Virtual Filesystem. GVFS allows our developers to simply not download most of those 3.5 million files during a clone, and instead simply page in the (comparatively) small portion of the source tree that a developer is working with.

Saeed Noursalehi is writing a series of articles about GVFS, and how it allows us to scale Git. It’s incredibly advanced work and absolutely necessary to be able to work with a source tree the size of Windows, but it’s not the only work we’ve had to do to handle the Windows repository. Putting this many files in a single repository stresses a lot of Git’s data structures and storage mechanisms, even when not all the files are actually present in the working directory.

While GVFS is an important solution for giant repositories like the Windows team, this additional work we’ve done will help regular Git users with more standard repository sizes.

The Index

The index – also known as the “staging area” or the “cache” – is one of the core data structures of the Git repository. It contains a list of every file in the repository, and it’s consulted on almost every operation that touches the working directory. The index is populated with a list of paths being checked out when you clone a repository and when you switch branches. It’s examined when you run status to decide which files are staged and modified. And when you do a merge, the new tree (and all of the conflicts) are stored in the index.

Since it’s used for so many operations, accessing the index has to be fast, even when it contains 3.5 million files. Part of the way Git keeps index accesses fast is by keeping the list of paths sorted so that you can just binary search through them to find what you’re looking for.

But there’s overhead in keeping this list sorted. One of the first pain points we noticed in our large repository was switching branches: this common operation could take anywhere from 30 seconds to a minute and a half. Obviously the mere act of putting files on disk is the slowest part of a checkout, but when we dug a little bit deeper, we were surprised to find out that we were also spending a lot of time creating the new index so that it contained the files list in the new branch. For each file that we were inserting into the index, we would try to figure out where we should insert it. This meant a binary search through the index to find the position of the new path.

Logical enough, except that the list of files we were inserting was itself already sorted. So we were busy doing an O(log n) lookup on each path… just to discover that we needed to append that path at the end of the index. So we changed that to skip the binary lookup and just do the append instead.

That seemingly little optimization shaved 15-20% of the time off of a git checkout invocation. It turns out that O(n log n) gets rather slow when n is 3.5 million files.

While we were looking at the index, we observed another minor-seeming operation: the file’s checksum validation. When Git clients write the index, they calculate the SHA-1 hash of its contents and append that to the end of the file. This allows Git to compare that hash when re-reading the index to ensure that it was not damaged by subtle disk corruption.

For small repositories and for ones up to the size of the Linux kernel, this computation is basically a non-issue: calculating the SHA-1 hash while reading the index is very inexpensive. But for a large repository like Windows, hashing the contents of the index is almost as expensive as parsing it in the first place.

We first split off the hash calculation work into a background thread, with excellent results. But ultimately, validating the hash on every single operation is mostly unnecessary: it’s exceptionally rare to see this sort of subtle file corruption that gets detected by the checksum. (Though not completely unheard of).

So we were able to simplify this to skip the hash calculation entirely when reading the index. Now you can still validate the index checksum using git fsck, but every other operation that reads the index will get a speedup.

Renames

git itself – the command line application – is certainly the most obvious way that we work with Git repositories, but it’s not the only way. In Visual Studio Team Services, where we host all our Git repositories (including Windows), we use the libgit2 project to work with Git repositories.

libgit2 is an Open Source project that is now primarily maintained by GitHub and Microsoft employees, and it’s architected to support custom database drivers for repository access. This gives Visual Studio Team Services the ability to store repositories very efficiently in Azure blob storage instead of merely dumping bare repositories on a filesystem.

When Microsoft added the merge functionality to libgit2 – so that we could efficiently merge pull requests for Azure-hosted repositories in VSTS – we knew that we would want to handle pull requests from large repositories. But even with our best planning, there were still places where we faced performance issues on projects the size of the Windows repository.

When Git stores revisions, it doesn’t store the list of files that were changed between two revisions, or how they were changed. Instead, it stores a snapshot of the entire tree at each version. This means that when git shows you that you’ve renamed a file, it’s actually gone through all the files in two different revisions, comparing each file that was deleted to each file that was added. If a deleted file is suitably similar to a newly added file, git decides that you must have actually done a rename from the old file to the new.

This rename detection is especially important during a merge – if one developer has renamed a file from foo.txt to bar.txt, and another developer has made changes to foo.txt, you would like to make sure to include those changes in the new filename. With rename detection, the changes will be included in bar.txt, like you expected. Without rename detection, you’ll have a conflict on foo.txt (it was edited in one branch and deleted in another) and you’ll get a new file called bar.txt. This isn’t at all what you want.

Unfortunately, rename detection is inherently quadratic: you compare every deleted file to every added file to determine which has the best similarity match. To avoid this becoming terribly expensive, git has a setting called merge.renameLimit to avoid performing this expensive O(n^2) comparison for too large an n.

Like git, libgit2 obeys merge.renameLimit for the expensive similarity detection. And like git, libgit2 doesn’t bother with merge.renameLimit for exact rename detection. Instead of comparing the contents of two files to determine how similar they are, exact rename detection simply looks at the file IDs, which is the SHA-1 hash of their contents. An identical hash means identical contents, so you can easily decide that a file was renamed just by comparing the ID.

Unfortunately, libgit2 used the same O(n^2) algorithm that compared the ID of every deleted file to the ID of every added file when doing an exact rename detection pass. When Windows pushed up a very large refactoring, where they renamed a directory full of files, the exact rename detection went crazy, doing a quadratic time comparison of the IDs of the thousands of files that had been touched and caused this pull request to time out.

Dealing with this seemingly simple refactoring change caused us to go back and look at libgit2’s rename detection functionality. Instead of comparing every deleted file to every added file, we walk the list of deleted files and build a hash, mapping their ID to their old filename. Then we walk the list of added files, looking for the ID in that hash: if it’s found then we know we have an exact rename.

This straightforward change collapsed an O(n^2) operation to a simple detection in linear time and Windows was able once again to create pull requests with these incredibly large refactorings.

Impact

In many ways, this work is simply the next step of the evolution of Git to handle ever-larger repositories. We’re swapping out less efficient data structures and access patterns with more efficient ones, but this work has been done before. Many of these O(n log n) operations began life as O(n^2) algorithms and were improved once to help Git scale to where it is today.

But this work is tedious and time consuming. Performance work requires a different set of debugging skills than tracking down most bugs; stepping through in a debugger isn’t usually all that helpful. Good profiling tools help – we updated Git for Windows to be able to compile under Visual Studio specifically to allow us to take advantage of the awesome profiler built in to Visual Studio.

But generally it requires setting up a reproduction environment, and running the same slow operations over and over again, trying to identify the root cause – or worse, causes – of the performance problem. Often combined with a lot of staring at the same code looking for clever insights that come to you slowly (but seem so obvious once you finally see them.)

Once we get to the source of the problem, the fix for these performance issues is nearly always a trade-off. Sometimes we throw more memory at the problem, caching some values so that we don’t have to recompute them a second time. Sometimes we throw logic at a problem, recognizing a pattern and exploiting it to reduce the amount of work we have to do. And sometimes we throw CPU at the problem, parallelizing work across multiple threads.

But every performance problem requires us to throw our scarcest and most valuable asset at the problem: our developers.

Although we’re doing this performance work for the Windows team, we’re contributing these changes back to Git to improve its performance for everyone. This impacts the entire software development industry, from Microsoft to the development of the Linux kernel to the next disruptive startup. If you want to help us improve software development everywhere: we’re hiring.

Beyond GVFS: more details on optimizing Git for large repositories