You can tell the Microsoft Visual Studio compiler what encoding to use when reading source files, by means of the /source-charset compiler switch. If you don’t specify an encoding, then the compiler tries to guess:
- If the file begins with a UTF-16 BOM, then the source file is interpreted as UTF-16.
- If the file begins with a UTF-8 BOM, then the source file is interpreted as UTF-8.
- Otherwise, the source file is interpreted in the default user code page.
These settings are determined by your project configuration, which creates the possibility that they get out of sync with the actual intended file encoding.
For C and C++, you can at least assert that the compiler configuration and file encoding match what you intended. That way, if Visual Studio secretly changes the file encoding from CP 1252 to UTF-8, say,¹ you can force a compiler error to alert you that the file encoding is messed up.
// If you want to make sure it's CP 1252 static_assert('’' == '\x92', "File encoding appears to be corrupted.'); // If you want to make sure it's UTF-8. static_assert(L'☃' == L'\x2603', "File encoding appears to be corrupted.');
For code page 1252, I chose the character ’, which is code unit 0x92 in code page 1252 and which is frequently damaged in resource files. If the file is saved as UTF-8 the ’ is encoded as three bytes 0xE2 0x80 0x99, and if the file does not have a UTF-8 BOM (and you didn’t use the /source-charset option), the compiler interprets those as ANSI characters, sees three characters between single quotes, and produces the error “too many characters in constant”.
For Unicode, I use the snowman character because it does not appear in any common 8-bit code page, so if the file is accidentally converted to ANSI, the snowman will probably turn into a question mark and trigger the static assertion failure.
¹ Visual Studio seems to really enjoy secretly converting files to UTF-8 without telling you. And as we saw last time, source control systems often doesn’t tell you about encoding changes, meaning that it is very easy to corrupt files by accident in a way that never shows up in a code review.
Useful article.
I had written a streaming text I/O system a while back that was doing pretty well. Someday I might dig it up, and by correcting the places where I “left my hands”, let’s continue the discussion.
At the moment it’s holidays with us, I’m missing too many things to actively participate in the discussion.
Happy New Year, to those for whom it begins today!
Other than the drudgery of doing it, are there and fundamental blocks to just intentionally forcing everything to UTF-8?
I can see why other encodings were a thing at one point, but none of the reasons I’m aware of still carry much weight.
Not all tools default to UTF-8, so you'd have to remember to put the appropriate directive at the start of every file. And some tools default to producing UTF-16LE files, so you'll have to remember to manually convert them every time you create a new project. Choosing a default different from your toolchain's default creates friction, the Microspeak jargon for which is impedance mismatch.
Yes, for the Resource Compiler the “#pragma code_page(65001)” works fine (thanks for that).
However sometimes you want to pass localized strings via a -D switch to cl.exe, e.g.
cl -c … -DCUSTOMERNAME=”Ã…reskutans fjällby” …
for least amount of pain I use:
cl -c … -DCUSTOMERNAME=”\xC5reskutans fj\xE4llby” …
(Windows 1252 the hard way)