The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8

Raymond Chen

Raymond

The Resource Compiler assumes that its input is in the ANSI code page of the system that is doing the compiling. This may not be the same as the ANSI code page of the system that the .rc was authored on, nor may it be the same as the ANSI code page of the system that will consume the resulting resources.

It also completely ignores any clues in the file itself.

The saga begins in 1981.

At this time, code pages roamed the earth. There was no way to know what encoding to use for a file; you just assumed it was the ambient code page for the system that opened the file and hoped for the best.

This is the world the Resource Compiler was born into.

STRINGTABLE BEGIN
IDS_MYSTRING "Hello, world."
END

Some years later, Unicode was invented, and the Resource Compiler let you indicate that you wanted a Unicode string by using the L prefix.

STRINGTABLE BEGIN
IDS_MYSTRING L"Hello, world."
END

In the above case, the L didn’t have any effect since the string itself limits itself to 7-bit ASCII. But let’s say that you used a fancy apostrophe in the Windows-1252 code page.

STRINGTABLE BEGIN
IDS_MYSTRING L"What’s up?"
END

There are two things to note. First is that you need to put the L prefix on the string to get it to be interpreted as Unicode. And second, the apostrophe is encoded as the single byte 92h because the file is in the Windows-1252 code page.

Now, it’s possible that the system doing the compiling isn’t using Windows-1252 as its default code page. For example, you might author the files in Windows-1252 because your main office is in Redmond, Washington, but you then send the file to your Japanese office, and their code page is 932. The byte sequence 92h 73h means “apostrophe, small Latin letter s” in the Windows-1252 code page, but in code page 932, that byte sequence represents the character 痴. When the Japanese office compiles your resource script, they get What痴 up?. This is already embarrassing enough, but it’s compounded by the fact that the character 痴 means gonorrhea.

To avoid this problem, the Resource Compiler lets you declare the code page in which the subsequent lines should be interpreted. This removes any dependency on the execution environment of the compiler.

#pragma code_page(1252)

STRINGTABLE BEGIN
IDS_MYSTRING L"What’s up?"
END

Some years later, UTF-8 was introduced. This created an interesting problem, because you might load a file as Windows-1252, but then when you save it, your text editor “helpfully” converts it to UTF-8. This change often goes undetected because file comparison tools will frequently “helpfully” normalize the two files into a common encoding before comparing them, thereby hiding the encoding change.

And then you get a bug that says “Garbage characters in message. Message is supposed to say What’s up?, but instead it says What’s up?.”

What happened is that the byte 92h in Windows-1252 was re-encoded into UTF-8 as the bytes E2h 80h 99h. Those bytes then were interpreted by the compiler as Windows-1252, resulting in ’. The presence of a UTF-8 BOM at the start of the file was a subtle hint that the file was really UTF-8 encoded, but computers aren’t very good at subtlety. They just follow the rules they were given, and that rule is “Interpret the bytes in the system ANSI code page unless given explicit instructions to the contrary.”

The fix is to give explicit instructions to the contrary. Put this at the top of the file:

#pragma code_page(65001) // UTF-8

Now save the file in UTF-8.

Now you’re all set. Text editors nowadays will happily “help” you out by silently converting to UTF-8, but I don’t know of any that silently convert to Windows-1252.

 

Raymond Chen
Raymond Chen

Follow Raymond   

10 Comments
Avatar
Daniel Sturm 2019-06-07 07:22:52
Ah a classical problem. Not that there's much we can do to improve that. I remember a problem where a .gitignore file got saved as utf-16 which caused it to "ignore" its contents (well it would've ignore files with lots of embedded \0 characters). Someone please invent a time machine.
Avatar
Matthew van Eerde (^_^) 2019-06-07 08:17:30
A more stable fix is to remove the dependency on the charset of the .rc file entirely by using Unicode escape sequence for characters outside of the intersection of the various code pages. STRINGTABLE BEGIN IDS_MYSTRING L"What\x2019s up?" END
Avatar
Роман Гаусс 2019-06-07 20:09:09
https://jisho.org/search/痴 is ‘stupid’ not ‘gonorrhea’, didn’t you mistake it for 痳?
Avatar
Marc Fauser 2019-06-10 01:22:53
I checked my resource files and they are all UTF16 and it's working.They are genereated by a PowerShell script, so I think it was an accident and nobody noticed it.
James Lin
James Lin 2019-06-10 08:58:22
I don't know if it's still an issue, but years ago we would unpredictably get corrupted strings when compiling our translated .rc files even though the files were saved as UTF-8 and even though we specified `pragma code_page(65001)`.  We wanted to keep using UTF-8 because it played much more nicely with our source control system and code review tools, so we ended up converting them to UTF-16 at build time. =/
Avatar
Martin Müller 2019-06-11 07:31:04
It seems to me that UTF-8 with a BOM confuses rc.exe? foo.rc(8) : error RC2135 : file not found: 0foo.rc(9) : error RC2135 : file not found: 0x17Lfoo.rc(11) : error RC2135 : file not found: FILEOS UTF-16 with BOM works, though.