{"id":32455,"date":"2017-05-30T06:19:46","date_gmt":"2017-05-30T14:19:46","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/visualstudioalm\/?p=32455"},"modified":"2019-02-14T15:51:36","modified_gmt":"2019-02-14T23:51:36","slug":"optimizing-git-beyond-gvfs","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/devops\/optimizing-git-beyond-gvfs\/","title":{"rendered":"Beyond GVFS: more details on optimizing Git for large repositories"},"content":{"rendered":"<p>Over the last few years, Microsoft has been <a href=\"https:\/\/devblogs.microsoft.com\/bharry\/scaling-git-and-some-back-story\/\">moving the entire company to a modern engineering system<\/a> built on <a href=\"https:\/\/www.visualstudio.com\/team-services\/\">Visual Studio Team Services<\/a>\u00a0and using Git as our version control system.\u00a0 For many of the projects within Microsoft, this is no problem, since: the <a href=\"https:\/\/git-scm.com\/about\/small-and-fast\">Git homepage tell us<\/a>:<\/p>\n<blockquote><p>Git was built to work on the Linux kernel, meaning that it has had to effectively handle large repositories from day one.<\/p><\/blockquote>\n<p>And it&#8217;s true that Git <em>does<\/em> deal very effectively with the Linux kernel, which is indeed quite large for an open source project.\u00a0 It contains 60,000 files in HEAD and its history spans 12 years.<\/p>\n<p>But as far as Enterprise projects go: 60,000 files isn&#8217;t really that many.<\/p>\n<p>The repository that I work in contains the Visual Studio Team Services code base and is just a bit over twice as large, with 125,000 files.\u00a0 And at Microsoft, this is simply &#8220;medium sized&#8221;.\u00a0 When we talk about a <em>large\u00a0<\/em>source tree, we talk about one thing:\u00a0 Windows, which weighs in at a whopping 3.5 million files.\u00a0 That&#8217;s the source, tests and build tools needed to create a build of Windows and create an ISO of the entire operating system.<\/p>\n<p>This sounds crazy at first but it really shouldn&#8217;t be <em>too<\/em> surprising.\u00a0 When you think about the Linux kernel, those 60,000 files go towards building just a kernel (and the modules) that have to fit into RAM on your machine.\u00a0 My kernel &#8211; a stock Ubuntu 16.04 image &#8211; is 7 megabytes, and comes along with another 200 megabytes of modules.<\/p>\n<p>Linus has joked that the kernel has become &#8220;<a href=\"https:\/\/www.theregister.co.uk\/2009\/09\/22\/linus_torvalds_linux_bloated_huge\/\">bloated and huge<\/a>&#8220;. Certainly this 200 megabytes is much, much larger than it was in the early 90&#8217;s, when a kernel had to fit on a floppy disk and had to power a machine with only <a href=\"https:\/\/github.com\/rdebath\/SLS-1.02\/blob\/master\/install\/a3\/readme\">4 megabytes of RAM<\/a>, all while making sure to leave room enough left over to actually power the system.<\/p>\n<p>But Windows 10?\u00a0 That&#8217;s a 4GB ISO image.<\/p>\n<p>Since all of Windows &#8211; the kernel, the libraries, the applications &#8211; are released together, they&#8217;re also versioned together, in a large &#8220;<a href=\"https:\/\/www.wired.com\/2015\/09\/google-2-billion-lines-codeand-one-place\/\">monorepo<\/a>&#8220;.\u00a0 When planning the Windows move to Git, we looked at breaking the codebase down into many smaller repositories and layering them together with Git submodules or a system like Android&#8217;s repo.\u00a0 But sometimes monorepos are just the easiest way to collaborate.<\/p>\n<blockquote class=\"twitter-tweet\">\n<p lang=\"en\" dir=\"ltr\"><a href=\"https:\/\/twitter.com\/xjoeduffyx\">@xjoeduffyx<\/a> Even if componentization worked, I&#8217;d still have gone for monorepo. Collaboration benefits are paramount: <a href=\"https:\/\/t.co\/xt03PCGh3D\">https:\/\/t.co\/xt03PCGh3D<\/a><\/p>\n<p>\u2014 Joe Duffy (@xjoeduffyx) <a href=\"https:\/\/twitter.com\/xjoeduffyx\/status\/827655274672435201\">February 3, 2017<\/a><\/p><\/blockquote>\n<p>Unfortunately, one of the problems with a giant monorepo like Windows is that Git has not historically coped very well with a repository that size.<\/p>\n<h3>GVFS<\/h3>\n<p>Over the last few years, we&#8217;ve been working on adapting Git to scale to handle <em>truly<\/em> large monorepos like the Windows repository.\u00a0 The biggest part of this work &#8211; by far &#8211; is GVFS, the Git Virtual Filesystem.\u00a0 GVFS allows our developers to simply <em>not download<\/em> most of those 3.5 million files during a clone, and instead simply page in the (comparatively) small portion of the source tree that a developer is working with.<\/p>\n<p>Saeed Noursalehi is writing a <a href=\"http:\/\/www.visualstudio.com\/learn\/git-at-scale\">series of articles about GVFS<\/a>, and how it allows us to scale Git.\u00a0 It&#8217;s incredibly advanced work and absolutely necessary to be able to work with a source tree the size of Windows, but it&#8217;s not the only work we&#8217;ve had to do to handle the Windows repository.\u00a0 Putting this many files in a single repository stresses a <em>lot<\/em> of Git&#8217;s data structures and storage mechanisms, even when not all the files are actually present in the working directory.<\/p>\n<p>While GVFS is an important solution for giant repositories like the Windows team, this additional work we\u2019ve done will help regular Git users with more standard repository sizes.<\/p>\n<h3>The Index<\/h3>\n<p>The index &#8211; also known as the &#8220;staging area&#8221; or the &#8220;cache&#8221; &#8211; is one of the core data structures of the Git repository.\u00a0 It contains a list of every file in the repository, and it&#8217;s consulted on almost every operation that touches the working directory.\u00a0 The index is populated with a list of paths being checked out when you clone a repository and when you switch branches.\u00a0 It&#8217;s examined when you run status to decide which files are staged and modified.\u00a0 And when you do a merge, the new tree (and all of the conflicts) are stored in the index.<\/p>\n<p>Since it&#8217;s used for so many operations, accessing the index has to be <em>fast<\/em>, even when it contains 3.5 million files.\u00a0 Part of the way Git keeps index accesses fast is by keeping the list of paths sorted so that you can just binary search through them to find what you&#8217;re looking for.<\/p>\n<p>But there&#8217;s overhead in keeping this list sorted.\u00a0 One of the first pain points we noticed in our large repository was switching branches:\u00a0 this common operation could take anywhere from 30 seconds to a minute and a half.\u00a0 Obviously the mere act of putting files on disk is the slowest part of a checkout, but when we dug a little bit deeper, we were surprised to find out that we were also spending a lot of time creating the new index so that it contained the files list in the new branch.\u00a0 For each file that we were inserting into the index, we would try to figure out where we should insert it.\u00a0 This meant a binary search through the index to find the position of the new path.<\/p>\n<p>Logical enough, except that the list of files we were inserting was itself <em>already<\/em> sorted.\u00a0 So we were busy doing an <code>O(log n)<\/code> lookup on <em>each path\u2026<\/em> just to discover that we needed to \u00a0append that path at the end of the index.\u00a0 <a href=\"https:\/\/public-inbox.org\/git\/20170418214025.GA4989@hank\/t\/\">So we changed that to skip the binary lookup and just do the append instead<\/a>.<\/p>\n<p>That seemingly little optimization shaved 15-20% of the time off of a <code>git checkout<\/code> invocation.\u00a0 It turns out that <code>O(n log n)<\/code> gets rather slow when n is 3.5 million files.<\/p>\n<p>While we were looking at the index, we observed another minor-seeming operation:\u00a0 the file&#8217;s checksum validation.\u00a0 When Git clients write the index, they calculate the SHA-1 hash of its contents and append that to the end of the file.\u00a0 This allows Git to compare that hash when re-reading the index to ensure that it was not damaged by subtle disk corruption.<\/p>\n<p>For small repositories and for ones up to the size of the Linux kernel, this computation is basically a non-issue:\u00a0 calculating the SHA-1 hash while reading the index is very inexpensive.\u00a0 But for a large repository like Windows, hashing the contents of the index is almost as expensive as parsing it in the first place.<\/p>\n<p>We first split off the hash calculation work into a background thread, with excellent results.\u00a0 But ultimately, validating the hash on every single operation is mostly unnecessary: it&#8217;s exceptionally rare to see this sort of subtle file corruption that gets detected by the checksum.\u00a0 (<a href=\"https:\/\/public-inbox.org\/git\/20131016083400.GA31266@sigill.intra.peff.net\/\">Though not completely unheard of<\/a>).<\/p>\n<p>So we were able to simplify this to <a href=\"https:\/\/public-inbox.org\/git\/20170414203221.43015-2-git@jeffhostetler.com\/\">skip the hash calculation entirely when reading the index<\/a>. \u00a0Now you can still validate the index checksum using <code>git fsck<\/code>, but every other operation that reads the index will get a speedup.<\/p>\n<h3>Renames<\/h3>\n<p>git itself &#8211; the command line application &#8211; is certainly the most obvious way that we work with Git repositories, but it&#8217;s not the only way.\u00a0 In Visual Studio Team Services, where we host all our Git repositories (including Windows), we use the libgit2 project to work with Git repositories.<\/p>\n<p>libgit2 is an Open Source project that is now primarily maintained by GitHub and Microsoft employees, and it&#8217;s architected to support custom database drivers for repository access.\u00a0 This gives Visual Studio Team Services the ability to store repositories very efficiently in Azure blob storage instead of merely dumping bare repositories on a filesystem.<\/p>\n<p>When Microsoft added the merge functionality to libgit2 &#8211; so that we could efficiently merge pull requests for Azure-hosted repositories in VSTS &#8211; we knew that we would want to handle pull requests from large repositories.\u00a0 But even with our best planning, there were still places where we faced performance issues on projects the size of the Windows repository.<\/p>\n<p>When Git stores revisions, it doesn&#8217;t store the list of files that were changed between two revisions, or how they were changed.\u00a0 Instead, it stores a snapshot of the entire tree at each version. This means that when git shows you that you&#8217;ve renamed a file, it&#8217;s actually gone through all the files in two different revisions, comparing each file that was deleted to each file that was added.\u00a0 If a deleted file is suitably similar to a newly added file, git decides that you must have actually done a rename from the old file to the new.<\/p>\n<p>This rename detection is especially important during a merge &#8211; if one developer has renamed a file from <code>foo.txt<\/code> to <code>bar.txt<\/code>, and another developer has made changes to <code>foo.txt<\/code>, you would like to make sure to include those changes in the new filename.\u00a0 With rename detection, the changes will be included in <code>bar.txt<\/code>, like you expected.\u00a0 Without rename detection, you&#8217;ll have a conflict on <code>foo.txt<\/code> (it was edited in one branch and deleted in another) and you&#8217;ll get a new file called <code>bar.txt<\/code>.\u00a0 This isn&#8217;t at all what you want.<\/p>\n<p>Unfortunately, rename detection is inherently quadratic:\u00a0 you compare <em>every<\/em> deleted file to <em>every<\/em> added file to determine which has the best similarity match.\u00a0 To avoid this becoming terribly expensive, git has a setting called <code>merge.renameLimit<\/code> to avoid performing this expensive <code>O(n^2)<\/code> comparison for too large an <code>n<\/code>.<\/p>\n<p>Like git, libgit2 obeys <code>merge.renameLimit<\/code> for the expensive similarity detection.\u00a0 And like git, libgit2 doesn&#8217;t bother with <code>merge.renameLimit<\/code> for <em>exact<\/em> rename detection.\u00a0 Instead of comparing the contents of two files to determine how similar they are, exact rename detection simply looks at the file IDs, which is the SHA-1 hash of their contents.\u00a0 An identical hash means identical contents, so you can easily decide that a file was renamed just by comparing the ID.<\/p>\n<p>Unfortunately, libgit2 used the same <code>O(n^2)<\/code> algorithm that compared the ID of every deleted file to the ID of every added file when doing an exact rename detection pass. \u00a0\u00a0When Windows pushed up a very large refactoring, where they renamed a directory full of files, the exact rename detection went crazy, doing a quadratic time comparison of the IDs of the thousands of files that had been touched and caused this pull request to time out.<\/p>\n<p>Dealing with this seemingly simple refactoring change caused us to go back and look at libgit2&#8217;s rename detection functionality.\u00a0 Instead of comparing every deleted file to every added file, we walk the list of deleted files and build a hash, mapping their ID to their old filename.\u00a0 Then we walk the list of added files, looking for the ID in that hash:\u00a0 if it&#8217;s found then we know we have an exact rename.<\/p>\n<p><a href=\"https:\/\/github.com\/libgit2\/libgit2\/pull\/4202\">This straightforward change<\/a> collapsed an <code>O(n^2)<\/code> operation to a simple detection in linear time and Windows was able once again to create pull requests with these incredibly large refactorings.<\/p>\n<h3>Impact<\/h3>\n<p>In many ways, this work is simply the next step of the evolution of Git to handle ever-larger repositories.\u00a0 We&#8217;re swapping out less efficient data structures and access patterns with more efficient ones, but this work has been done before.\u00a0 Many of these <a href=\"https:\/\/public-inbox.org\/git\/20170420183455.y5oz3hspz3v2g6yr@sigill.intra.peff.net\/\"><code>O(n log n)<\/code> operations began life as <code>O(n^2)<\/code><\/a> algorithms and were improved once to help Git scale to where it is today.<\/p>\n<p>But this work is tedious and time consuming.\u00a0 Performance work requires a different set of debugging skills than tracking down most bugs; stepping through in a debugger isn&#8217;t usually all that helpful.\u00a0 Good profiling tools help &#8211; <a href=\"https:\/\/github.com\/git-for-windows\/git\/pull\/773\">we updated Git for Windows to be able to compile under Visual Studio<\/a> specifically to allow us to take advantage of the awesome profiler built in to Visual Studio.<\/p>\n<p>But generally it requires setting up a reproduction environment, and running the same slow operations over and over again, trying to identify the root cause &#8211; or worse, <em>causes<\/em> \u2013 of the performance problem.\u00a0 Often combined with a lot of staring at the same code looking for clever insights that come to you slowly (but seem so obvious once you finally see them.)<\/p>\n<p>Once we get to the source of the problem, the fix for these performance issues is nearly always a trade-off.\u00a0 Sometimes we throw more memory at the problem, caching some values so that we don&#8217;t have to recompute them a second time.\u00a0 Sometimes we throw logic at a problem, recognizing a pattern and exploiting it to reduce the amount of work we have to do.\u00a0 And sometimes we throw CPU at the problem, parallelizing work across multiple threads.<\/p>\n<p>But every performance problem requires us to throw our scarcest and most valuable asset at the problem: our developers.<\/p>\n<p>Although we\u2019re doing this performance work for the Windows team, we\u2019re contributing these changes back to Git to improve its performance for <em>everyone<\/em>.\u00a0 This impacts the entire software development industry, from Microsoft to the development of the Linux kernel to the next disruptive startup.\u00a0 If you want to help us improve software development everywhere:\u00a0 <a href=\"http:\/\/aka.ms\/vstsjobs\">we\u2019re hiring<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Over the last few years, Microsoft has been moving the entire company to a modern engineering system built on Visual Studio Team Services\u00a0and using Git as our version control system.\u00a0 For many of the projects within Microsoft, this is no problem, since: the Git homepage tell us: Git was built to work on the Linux [&hellip;]<\/p>\n","protected":false},"author":233,"featured_media":45953,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,225],"tags":[],"class_list":["post-32455","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops","category-git"],"acf":[],"blog_post_summary":"<p>Over the last few years, Microsoft has been moving the entire company to a modern engineering system built on Visual Studio Team Services\u00a0and using Git as our version control system.\u00a0 For many of the projects within Microsoft, this is no problem, since: the Git homepage tell us: Git was built to work on the Linux [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts\/32455","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/users\/233"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/comments?post=32455"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts\/32455\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/media\/45953"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/media?parent=32455"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/categories?post=32455"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/tags?post=32455"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}