Supercharging the Git Commit Graph
Have you ever run
gitk and waited a few seconds before the window appears? Have you struggled to visualize your commit history into a sane order of contributions instead of a stream of parallel work? Have you ever run a force-push and waited seconds for Git to give any output? You may be having performance issues due to the number of commits in your repository.
If you have a large repository, then you may notice that
git log --graph takes a few seconds to write any output, while Visual Studio Team Services (VSTS) returns these results very quickly.
This is due to some really cool algorithms we built out and tested server side. We recently took the first steps in bringing those algorithms to the whole open source Git community by submitting the code to the core git project.
This week marks the release of Git 2.18.0 and Git for Windows 2.18.0. There are a lot of cool features and performance enhancements in this one, so I hope you upgrade and enjoy! One new feature in 2.18 is a serialized commit-graph. I think a lot of users will benefit from this feature, especially if you are working in a large repository with tens of thousands of commits (or more). The feature is optional, so right now you’ll need to enable it manually.
How to Enable the Commit-Graph Feature
Currently, the commit-graph feature requires a bit of self-maintenance, but we hope to improve this expeirence in future versions.
This is an experimental feature! Please use with caution. You can always turn off the feature using
git config core.commitGraph false. There are a few Git features that don’t work well with the commit-graph, such as shallow clones, replace-objects, and commit grafts. If you never use any of those features, then you should have no problems!
To enable the commit-graph feature in your repository, run
git config core.commitGraph true. Then, you can update your commit-graph file by running
git show-ref -s | git commit-graph write --stdin-commits
You are good to go! That last command created a file at
.git/objects/info/commit-graph relative to your repository root. This file contains a compact description of your commit history that is faster to parse than unzipping your packfiles and loose objects.
Go and test your favorite commands and see how long they take. You can compare commands before and after the commit-graph feature using something like the following:
time git -c core.commitGraph=false log --graph --oneline -10
time git -c core.commitGraph=true log --graph --oneline -10
I’d love to hear from you if you’ve had success with certain commands because of this feature!
If you don’t feel like testing this yourself without proof of the benefit, here are some performance numbers for a few important repos: Linux, Git, and Windows. In the case of Linux and Git, I include the exact commits I use so you can reproduce a similar experiment.
The Linux kernel repository is the gold standard for Git performance. It has a good number of files, and many commits (over 750,000), and is publicly available for everyone to clone and test themselves.
For this test, I had the following branch values:
This version of
master can reach 722,849 commits and is 30,986 commits behind
The Git repository is also publicly available, but is much smaller than the Linux repository. However, it is large enough to see benefits with the commit-graph feature.
For this test, I had the following branch values:
This version of
master can reach 49,361 commits and is 2,032 commits behind
The Windows Repository
The developers making Microsoft Windows use Git, enhanced by the Git Virtual File System (GVFS). We deployed the commit-graph feature to the Windows developers with a recent (private) release of GVFS. In that version, GVFS handles the maintenance of the commit-graph file, so it is updated with every fetch.
My local version of master has 2,214,796 reachable commits. The reason
git status improves is because my local version of master is 81,776 commits behind origin/master, and
git status walks commits to compute this count. With 4,000+ developers working in the repo, the branches move very quickly, so this is a realistic difference between a local and remote branch.
The above performance numbers are nice, but they are also isolated tests that I ran on my machine. It’s much better to have real-life examples of this helping users in their actual workflows.
For example, one user complained that a force-push command was slow. We found that the amount of data being sent to the server was not the problem. Instead, we found that the logic for deciding if a force-push is necessary walks the entire commit history from the new ref location. This meant that Git was walking over two million commits! The improved parse speed of the commit-graph feature was enough to improve the force-push time in this example from 90 seconds to 30 seconds. We are working to modify this logic so it doesn’t require walking all of those commits.
My History with the Git Commit Graph
Before I joined Microsoft, I was a mathematician working in computational graph theory. I spent years thinking about graphs every day, so it was a habit that was hard to break. Good thing Git stores its data as a directed acyclic graph, so everything we do in Git involves graphs in one way or another.
A few years ago, I left academia and joined the VSTS Git server team. My first year was spent mainly on implementing a commit-graph feature that accelerated commit walks on the service. While my contributions were only on the back-end server code, a fantastic team created a way to visualize the commit history as a graph in the web. This means that whenver you view the history of your repo, you’ll see the same output as if you ran
git log --graph, complete with a visualization of commit parents. Also, Matt talked a bit about the commit-graph in a performance blog post.
The above pictures show the commit history page on VSTS for the GitForWindows repository and a related
git log --topo-order call. The
--topo-order flag tells Git to order the commits the same as a
git log --graph call, but doesn’t render the commit-to-parent edges. In this case, there are so many merges that the
git log --graph output becomes a huge mess. VSTS uses the same graph rendering as Team Explorer in Visual Studio.
One problem with launching this feature was that the corresponding Git command is slow. For the command above,
git log --topo-order took 2.8 seconds. It takes even longer for larger repositories that have millions of commits! Today, the web request in VSTS takes around 0.22 seconds including a round trip to the server. Trying to do similar commands with the Linux kernel (750K commits) or the Windows repository (2 million commits) becomes quite painful in the command-line, but the web view stays around 200-400 milliseconds for most queries.
After being on the Git server team for VSTS, I chose to switch teams to the client team that works on Git, GVFS, and other version control clients. The primary reason I wanted to switch was so I could provide the same performance benefits we implemented on our servers to the Git community. The commit-graph feature in Git 2.18 is a major step in this direction. The current state of the commit-graph feature is almost exactly as I described in a talk at Git Merge 2018:
You can continue reading the next article in this series, Part II: File Format. In the coming weeks, I’ll post more articles that give more details about the commit-graph feature in Git 2.18, some powerful algorithms we have in VSTS, and how we are bringing those algorithms to Git soon.