December 30th, 2024

How various git diff viewers represent file encoding changes in pull requests

In addition to the git command line tool, there are other tools or services that let you view changes in git history. The most interesting cases are those which present changes as part of a pull request, since those are changes you are reviewing and approving. But a common problem is that what they show you might not be what actually changed.

I’ll limit my discussion to services and tools I have experience with, which means that it’s the git command line, Azure DevOps, GitHub, and Visual Studio. You are welcome to share details for other services that you use, particularly those used for code reviews.

First, let’s consider a commit that changes the encoding of a file. For concreteness, let’s say that the file is this:

I just checked.
It costs A31.

where A3 represents a single byte with hex value 0xA3. This is the representation of £ in the Windows 1252 code page.

Suppose you change the encoding of this file to UTF-8:

It costs C2A31.

If you view this in the command line with git show you get

  I just checked.

- It costs <A3>1.
+ It costs £1.

The command line version shows you that there used to be a byte 0xA3 but now there is a £ character.

Next up is GitHub. Its diff says

  I just checked.
- It costs �1.
- It costs £1.

GitHub assumes that all files are in UTF-8, so it interprets the A3 as an illegal UTF-8 code unit sequence and represents it with U+FFFD REPLACEMENT CHARACTER.

Next up is Team Foundation Services Visual Studio Online Visual Studio Team Services Azure DevOps. Azure DevOps. That’s the name. Azure DevOps.

Here’s what Azure DevOps shows:

âš  The file differs only in whitespace.

And if you expand the file and enable “Show whitespace changes”, it shows you no changes, not even whitespace changes!

I just checked.
It costs £1.

This is quite concerning, because it means that if you made a change to the text of a file and also changed the encoding, Azure DevOps highlights the text changes, but does not give any indication that the encoding changed!

For example, maybe somebody changed the first line of text and accidentally changed the encoding from 1252 to UTF-8. Azure DevOps shows this as

I just checked.
I just looked.

It costs £1.

It happily shows you the text change, but completely ignores the encoding change.

That encoding change might have caused you to inadvertently change a bunch of strings in a Resource Script, resulting in mojibake.

If you ask Visual Studio to view the diff, it indicates that the file has been modified (M), but when you ask to see the diff, it says “0 changes”, and nothing is highlighted.

Now let’s consider a commit that inserted a UTF-8 BOM at the start of a file.

From the command line with git, you get this:

- I just checked.
+  I just checked.

The BOM displays as a space. Not great, but at least there is a +/− to show you that something changed, and if the first line is not otherwise blank, the shifted contents tell you that something got inserted at the start of the file.

For GitHub, the diff shows up like this:

- I just checked.
- I just checked.

The highlights tell you that something changed on that line, but squint all you want, you don’t see any change. The change must be invisible, but at least you’re told that there’s a change somewhere on that line; you just can’t see it.

And finally, we have Azure DevOps:

âš  The file differs only in whitespace.

As before, even if you expand the file and enable “Show whitespace changes”, you get no changes.

I just checked.
It costs £1.

So Azure DevOps tells you that the file changed in whitespace, but when you ask to see it, you are shown no changes.

If you ask Visual Studio to view the diff, it once again indicates that the file has been modified (M), but when you ask to see the diff, it says “0 changes”, and nothing is highlighted.

I suspect that in the cases where GitHub, Azure DevOps, or Visual Studio show no visible changes, most users will just conclude, “Must be a bug,” and not realize that no really, there’s a change in there that you can’t see.

So let’s summarize these results in a table.

  git command line GitHub Azure DevOps Visual Studio
Code page UTF-8 UTF-8 Guess Guess
Encoding changes Shown in diff Shown in diff No change shown No change shown
BOM change Show as space Invisible No change shown No change shown

My take-away from this table is that if you do your work with any of these systems, you need to pay close attention when dealing with files that contain characters outside the 7-bit ASCII set because changes to encoding or the presence of a BOM can be hard to spot, or even become outright invisible, even though it drastically changes what the contents of the file mean.

Topics
Other

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

3 comments

  • Kevin Norris 4 days ago

    IMHO, the proper solution is not to be afraid of non-ASCII characters (we should not be treating non-English text as second class), but to fix an encoding and mandate that all text is in that encoding.

    It then immediately follows that the only good candidate is UTF-8 without BOM, for the following reasons:

    * UTF-8 is maximally compatible with standard Git tools.
    * Tooling that needs the BOM is broken - such tooling is obviously encoding-aware, so...

    Read more
  • Chris Warrick

    Azure DevOps sometimes shows the encoding change by listing the old and new encodings in the header above the file contents.

  • Adam Rosenfield 5 days ago

    Another edge case where diff tools frequently stumble is the "newline at end of file" getting added or deleted. Sometimes they'll show "\newline at end of file" on the left or right side of the diff, sometimes they'll show an unspecified whitespace change, or sometimes they won't show anything at all.

    Whenever I need to really confirm the exact contents of the before or after file, I'll run it through `git show commit:filepath | hexdump...

    Read more