How TFS Version Control determines a file’s encoding

Buck Hodges

September 10th, 20050 0

TFS Version Control will automatically detect a file’s encoding based upon the following.

First, a file with a Unicode byte order mark (BOM) is added as that particular type (UTF-8, UTF-16 big endian, UTF-16 little endian, etc.).
If a file doesn’t have a BOM, we check for an unprintable ASCII character in the first 1 kilobyte of the file. If there is no unprintable ASCII character in there, the encoding is set to the current code page being used, which is Windows-1252 on US English Windows systems.
If an unprintable character is detected, the file is detected as being binary. The unprintable ASCII characters detected are in the range of 0 – 0x1F and 0x7F excluding 0x9 (TAB), 0xA (LF), 0xC (FF), 0xD (CR), and 0x1A (^Z).

The only exception to the foregoing is PDF files. Those are always detected as binary because they are so common and can be all text in the first 1 kilobyte with binary streams later in the file. The detection is based on the signature, “%PDF-“, that always appears at the start of a PDF file.

So, if you take a file that is in the euc-jp encoding and add it to source control on a US English Windows system, it will be added as Windows-1252 unless you specify a different encoding with the /type parameter on the add command (e.g., “tf add /type:euc-jp file.txt”). If the file is already in source control, use the edit command’s /type option to change the encoding.

Within Visual Studio 2005, you can change a committed file’s encoding by navigating to it using Source Control Explorer (View -> Other Windows -> Source Control Explorer), right-clicking on the file, and choosing Properties. On the General tab, click on the Set Encoding button and choose the encoding or click on the Detect button and have Version Control detect the encoding using the process described above.

Because changing the encoding requires pending a change on the file, you must have the file in your workspace. Files and folders in Source Control Explorer that are in grey text (rather than black text) are either cloaked or not mapped in your workspace or the workspace does not “have” the file (the server keeps track of what files are in your workspace).

Unfortunately, TFS does not support changing the encoding of a pending add. If you need to do that, you will have to undo the pending add, and then re-add the file using the command line and specify the /type option.

[UPDATE 6/8/2012] The TFS client (Visual Studio/Team Explorer) in 2012 has changed how this is done for file merges (e.g., when resolving conflicts).

1. VS 2012 reads the entirety of the source, target, and base files during automerge. This helps in the case of a UTF-8 file without a BOM where the first non-ASCII character is after the first 1024 bytes of the file.We detect a file that does not have a BOM as UTF-8 if:

There is at least one non-ASCII character (Unicode codepoint > 127)
There are no byte sequences in the file that are invalid in UTF-8. If you read http://en.wikipedia.org/wiki/Utf-8, you will see that all code points > 127 in UTF-8 must be represented as multibyte sequences following a very specific pattern. It would be unlikely for a meaningful non-UTF-8 file with non-ASCII characters to meet this criteria
If one of the above criteria is not met, we use the “fallback” encoding.

2. In VS 2012, the fallback encoding is the server encoding. This allows you to override our heuristic in the scenario where:

you have a file that you would like to be encoded as UTF-8 without a BOM
but it does not have any non-ASCII characters yet.

In previous versions we would fall back to the default system code page (e.g. Windows-1252). In VS 2012, if you set the server encoding to UTF-8, automerge will use UTF-8 instead.