Mundane git tricks: Combining two files into one while preserving line history

Suppose you have two files that you want to combine into one. Let’s set up a scratch repo to demonstrate. I’ve omitted the command prompts so you can copy-paste this into your shell of choice and play along at home. (The timestamps and commit hashes will naturally be different.)

git init

>fruits echo apple
git add fruits
git commit --author="Alice <alice>" -m "create fruits"
>>fruits echo grape
git commit --author="Bob <bob>"     -am "add grape"
>>fruits echo orange
git commit --author="Carol <carol>" -am "add orange"

>veggies echo celery
git add veggies
git commit --author="David <david>" -m "create veggies"
>>veggies echo lettuce
git commit --author="Eve <eve>"     -am "add lettuce"
>>veggies echo peas
git commit --author="Frank <frank>" -am "add peas"

git tag ready

We now have two files, one with fruits and one with vegetables. Each has its own history, and the git blame command can attribute each line to the commit that introduced it.

git blame fruits

^adbef3a (Alice 2019-05-14 07:00:00 -0700 1) apple
8312990f (Bob   2019-05-14 07:00:01 -0700 2) grape
2259ff53 (Carol 2019-05-14 07:00:02 -0700 3) orange

git blame veggies

2f11bacc (David 2019-05-14 07:00:03 -0700 1) celery
2d7b11e8 (Eve   2019-05-14 07:00:04 -0700 2) lettuce
8c8cf113 (Frank 2019-05-14 07:00:05 -0700 3) peas

Now you decide that fruits and veggies should be combined into a single file called produce. How do you do this while still preserving the commit and histories of the contributing files?

The naïve way of combining the files would be to do it in a single commit:

cat fruits veggies > produce
git rm fruits veggies
git add produce
git commit --author="Greg <greg>" -m "combine"

The resulting file gets blamed like this:

eefddfb1 produce (Greg  2019-05-14 07:01:00 -0700 1) apple
eefddfb1 produce (Greg  2019-05-14 07:01:00 -0700 2) grape
eefddfb1 produce (Greg  2019-05-14 07:01:00 -0700 3) orange
7a542f13 veggies (David 2019-05-14 07:00:03 -0700 4) celery
2c258db0 veggies (Eve   2019-05-14 07:00:04 -0700 5) lettuce
87296161 veggies (Frank 2019-05-14 07:00:05 -0700 6) peas

The history from veggies was preserved, but the history from fruits was not. What git saw in the commit was that one file appeared and two files vanished. The rename detection machinery kicked in and decided that since the majority of the produce file matches the veggies file, it infers that what you did was delete the fruits file, renamed the veggies file to produce, and then added three new lines to the top of produce.

You can tweak the git blame algorithms with options like -M and -C to get it to try harder, but in practice, you don’t often have control over those options: The git blame may be performed on a server, and the results reported back to you on a web page. Or the git blame is performed by a developer sitting at another desk (whose command line options you don’t get to control), and poor Greg has to deal with all the tickets that get assigned to him from people who used the git blame output to figure out who introduced the line that’s causing problems.

What we want is a way to get git blame to report the correct histories for both the fruits and the vegetables.

The trick is to use a merge. Let’s reset back to the original state.

git reset --hard ready

We set up two branches. In one branch, we rename veggies to produce. In the other branch, we rename fruits to produce.

git checkout -b rename-veggies
git mv veggies produce
git commit --author="Greg <greg>" -m "rename veggies to produce"

git checkout -
git mv fruits produce
git commit --author="Greg <greg>" -m "rename fruits to produce"

git merge -m "combine fruits and veggies" rename-veggies

The merge fails with a rename-rename conflict:

CONFLICT (rename/rename):
Rename fruits->produce in HEAD.
Rename veggies->produce in rename-veggies

Renaming fruits to produce~HEAD
and veggies to produce~rename-veggies instead

Automatic merge failed; fix conflicts and then commit the result.

Update: Version 2.25.1 changed what happens in the case of a rename/rename conflict.

CONFLICT (rename/rename):
Rename fruits->produce in HEAD.
Rename veggies->produce in rename-veggies

Auto-merging produce

Automatic merge failed; fix conflicts and then commit the result.

At this point, you create the resulting produce file by combining the two originals.

If running pre-2.25.1:

cat "produce~HEAD" "produce~rename-veggies" >produce

If running post-2.25.1:

git cat-file --filters HEAD:produce >produce
git cat-file --filters rename-veggies:produce >>produce

Once you’ve generated the combined file, you can treat the file as resolved.

git add produce
git merge --continue

The resulting produce file was created by a merge, so git knows to look in both parents of the merge to learn what happened. And that’s where it sees that each parent contributed half of the file, and it also sees that the files in each branch were themselves created via renames of other files, so it can chase the history back into both of the original files.

^fa19403 fruits  (Alice 2019-05-14 07:00:00 -0700 1) apple
00ef7240 fruits  (Bob   2019-05-14 07:00:01 -0700 2) grape
10e90730 fruits  (Carol 2019-05-14 07:00:02 -0700 3) orange
7a542f13 veggies (David 2019-05-14 07:00:03 -0700 4) celery
2c258db0 veggies (Eve   2019-05-14 07:00:04 -0700 5) lettuce
87296161 veggies (Frank 2019-05-14 07:00:05 -0700 6) peas

Magic! Greg is nowhere to be found in the blame history. Each line is correctly attributed to the person who introduced it in the original file, whether it’s fruits or veggies. People investigating the produce file get a more accurate history of who last touched each line of the file.

Greg might need to do some editing to the two files before committing. Maybe the results need to be sorted, and maybe Greg figures he should add a header to remind people to keep it sorted.

>produce echo # keep sorted
git cat-file --filters HEAD:produce >>produce
git cat-file --filters rename-veggies:produce >>produce
sort -o produce produce
git add produce
git merge --continue

git blame produce

057507c7 produce (Greg  2019-05-14 07:01:00 -0700 1) # keep sorted
^943c65d fruits  (Alice 2019-05-14 07:00:00 -0700 2) apple
cfce62ae veggies (David 2019-05-14 07:00:03 -0700 3) celery
43c9aeb6 fruits  (Bob   2019-05-14 07:00:01 -0700 4) grape
5f60490e veggies (Eve   2019-05-14 07:00:04 -0700 5) lettuce
143eb20f fruits  (Carol 2019-05-14 07:00:02 -0700 6) orange
75a1ad0c veggies (Frank 2019-05-14 07:00:05 -0700 7) peas

For best results, your rename commit should be a pure rename. Resist the tempotation to edit the file’s contents at the same time you rename it. A pure rename ensure that git’s rename detection will find the match. If you edit the file in the same commit as the rename, then whether the rename is detected as such will depend on git’s “similar files” heuristic.¹ If you need to edit the file as well as rename it, do it in two separate commits: One for the rename, another for the edit.

Wait, we didn’t use git commit-tree yet. What’s this doing in the Mundane git commit-tree tricks series?

We’ll add commit-tree to the mix next time. Today was groundwork, but this is a handy technique to keep in your bag of tricks, even if you never get around to the commit-tree part.

¹ If you cross the merge.renameLimit, then git won’t look for similar files; it requires exact matches. The Windows repo is so large that the rename limit is easily exceeded. The “similar files” detector is O(m × n) in the number of files changed in the two parents, and when your repo has 3 million files, that quadratic growth becomes a problem.

7 comments

Discussion is closed. Login to edit/delete existing comments.

Jon Wiswall July 22, 2019

For merging lots of files you might need this:
git merge -X find-renames=2
Found via this stackoverflow q&a. The key to look for is merge printing “CONFLICT (rename/rename)” instead of “CONFLICT (rename/add)” or similar.
- Raymond Chen Author July 22, 2019
  
  Part 7 of the series shows how to merge multiple files in one go.
tbodt May 16, 2019

You can put the shell redirect before the command??
- Raymond Chen Author May 17, 2019
  
  Not only can you do it, in many cases, you must do it to avoid unwanted spaces, or to avoid conflicts with lines that end with a digit.
Neil Rashbrook May 15, 2019

It’s O(n²) in the number of files changed? Unless you’re asking it to detect copies, I would have thought it would only be O(n²) in the number of files deleted, or am I missing something?
- Raymond Chen Author May 15, 2019
  
  Oops, you’re right. it’s O(m×n), where m is the number of added files, and n is the number of deleted files. But in the Windows repo, that’s still a very large number when, say, an aggregation branch merges into the trunk.
Alexis Ryan May 15, 2019

This is a nice trick to know