September 16th, 2019

How do I split a file into two while preserving git line history?

Some time ago, I showed how to combine two files into one while preserving line history. Today, we’re going to do the opposite: Split a file into two smaller files, while preserving line history.

Let’s set up a scratch repo to demonstrate. I’ve omitted the command prompts so you can copy-paste this into your shell of choice and play along at home. (The timestamps and commit hashes will naturally be different.)

git init

>foods echo apple
>>foods echo celery
>>foods echo cheese
git add foods
git commit --author="Alice <alice>" -m created

>>foods echo eggs
>>foods echo grape
>>foods echo lettuce
git commit --author="Bob <bob>"   -am middle

>>foods echo milk
>>foods echo orange
>>foods echo peas
git commit --author="Carol <carol>" -am last

git tag ready

With this starting point, the git blame output says

^e7a114d (Alice 2019-09-16 07:00:00 -0700 1) apple
^e7a114d (Alice 2019-09-16 07:00:00 -0700 2) celery
^e7a114d (Alice 2019-09-16 07:00:00 -0700 3) cheese
86348be4 (Bob   2019-09-16 07:00:01 -0700 4) eggs
86348be4 (Bob   2019-09-16 07:00:01 -0700 5) grape
86348be4 (Bob   2019-09-16 07:00:01 -0700 6) lettuce
34eb5bd1 (Carol 2019-09-16 07:00:02 -0700 7) milk
34eb5bd1 (Carol 2019-09-16 07:00:02 -0700 8) orange
34eb5bd1 (Carol 2019-09-16 07:00:02 -0700 9) peas

As we noted when we learned how to combine two files, the naïve way of splitting the file will treat the larger file as a continuation of the original (assuming you haven’t hit the rename limit), and the smaller file will be treated as a brand new file. The blame of the smaller file will blame you, the person who split them, instead of blaming the person who introduced each line.

To get git to follow the line attributes, we have to make each of the result files look like a rename of the original. We can do this by creating each piece in a different branch, then merging them all together.

In a new fruits branch, the first step is to do a pure rename, so that git will recognize that the fruits file is a continuation of the foods file.

git checkout -b fruits
git mv foods fruits
git commit --author="Greg <greg>" -m "split foods to fruits"

Now you can edit the fruits file to contain just the part you want to split out. In this case, we want the fruits (duh).

>fruits echo apple
>>fruits echo grape
>>fruits echo orange
git commit --author="Greg <greg>" -am "split foods to fruits"

git checkout -

Repeat for the veggies.

git checkout -b veggies
git mv foods veggies
git commit --author="Greg <greg>" -m "split foods to veggies"

>veggies echo celery
>>veggies echo lettuce
>>veggies echo peas
git commit --author="Greg <greg>" -am "split foods to veggies"

git checkout -

The last file (dairy) can be done directly in the original branch.

git mv foods dairy
git commit --author="Greg <greg>" -m "split foods to dairy"

>dairy echo cheese
>>dairy echo eggs
>>dairy echo milk
git commit --author="Greg <greg>" -am "split foods to dairy"

And now we octopus merge all the branches together.

git merge fruits veggies

This time, the octopus merge succeeds. All branches agree that the foods file be deleted, so there are no merge conflicts.

Trying simple merge with fruits
Trying simple merge with veggies
Merge made by the 'octopus' strategy.
 fruits  | 3 +++
 veggies | 3 +++
 2 files changed, 6 insertions(+)
 create mode 100644 fruits
 create mode 100644 veggies

And lo and behold, all three resulting files preserved the original line histories. Greg doesn’t show up anywhere.

git blame fruits

^e7a114d foods (Alice 2019-09-16 07:00:00 -0700 1) apple
86348be4 foods (Bob   2019-09-16 07:00:01 -0700 2) grape
34eb5bd1 foods (Carol 2019-09-16 07:00:02 -0700 3) orange

git blame veggies

^e7a114d foods (Alice 2019-09-16 07:00:00 -0700 1) celery
86348be4 foods (Bob   2019-09-16 07:00:01 -0700 2) lettuce
34eb5bd1 foods (Carol 2019-09-16 07:00:02 -0700 3) peas

git blame dairy

^e7a114d foods (Alice 2019-09-16 07:00:00 -0700 1) cheese
86348be4 foods (Bob   2019-09-16 07:00:01 -0700 2) eggs
34eb5bd1 foods (Carol 2019-09-16 07:00:02 -0700 3) milk

 

Topics
Other

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

8 comments

Discussion is closed. Login to edit/delete existing comments.

  • Alexey Badalov

    I wonder why git doesn’t just do the right thing: recognize that these new files are all exerts from an existing one and copy the histories. If there is a reason this feature should not exist or just that no one took the time to implement it yet.

    • Gunnar Dalsnes

      Or a specific command to tell git what has happened: git mvr (record move) that take one source and multiple dests (or vice verse)

      • Raymond ChenMicrosoft employee Author

        But where would it record that information? The only place git records history information is in the commit graph, and all it can do is compare trees. You’d have to invent some auxiliary “bonus history” database to keep track of this information.

      • Gunnar Dalsnes

        I was thinking it could do the same tricks as you did, but under the hood. Not sure it that would fit with the staging/commit model (don’t know git that well). Maybe it could work like a macro or something. Doing it manually seems like a pita thou:-)

      • Neil Rashbrook

        By comparison, the Mercurial VCS does have some sort of bonus history database which can be used to record sources of copies. In this case, you can do all of the above work in a single commit:

        hg cp foods fruits

        hg cp foods veggies

        hg mv foods dairy

        (edit files)

        hg commit

        (On the other hand, there will be tasks which are harder in Mercurial.)

    • Raymond ChenMicrosoft employee Author

      That would be very computationally expensive. “Hey, here’s a new file. Let me see if any of its lines came from existing files in the repo.” But maybe you can come up with a computationally cheap way of doing it. Git is open source, so feel free to submit a PR.

      • Chris DaMour

        But git IS doing this automatically..with the computationally expensive thing. AFAIK Git doesn't track renames, it detects them when you run git log. see https://git-scm.com/docs/git-diff#Documentation/git-diff.txt--Mltngt This split is just tricking git into detecting 2 renames instead of the normal one because the merge commit has two parent commits and the new files are BOTH being related to the original file within their respective parent commits.

        So IMO git is automatically...

        Read more
      • Raymond ChenMicrosoft employee Author

        Yes, it does the computationally expensive thing to identify renames, but not for copies. The expense is reduced by searching only for matches between deletes and creates. As you noted, you can ask it to find copies, but that won’t detect the case where lines are moved from one file to another. since neither is a copy of the other.