The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8

Raymond Chen

Raymond

The Resource Compiler assumes that its input is in the ANSI code page of the system that is doing the compiling. This may not be the same as the ANSI code page of the system that the .rc was authored on, nor may it be the same as the ANSI code page of the system that will consume the resulting resources.

It also completely ignores any clues in the file itself.

The saga begins in 1981.

At this time, code pages roamed the earth. There was no way to know what encoding to use for a file; you just assumed it was the ambient code page for the system that opened the file and hoped for the best.

This is the world the Resource Compiler was born into.

STRINGTABLE BEGIN
IDS_MYSTRING "Hello, world."
END

Some years later, Unicode was invented, and the Resource Compiler let you indicate that you wanted a Unicode string by using the L prefix.

STRINGTABLE BEGIN
IDS_MYSTRING L"Hello, world."
END

In the above case, the L didn’t have any effect since the string itself limits itself to 7-bit ASCII. But let’s say that you used a fancy apostrophe in the Windows-1252 code page.

STRINGTABLE BEGIN
IDS_MYSTRING L"What’s up?"
END

There are two things to note. First is that you need to put the L prefix on the string to get it to be interpreted as Unicode. And second, the apostrophe is encoded as the single byte 92h because the file is in the Windows-1252 code page.

Now, it’s possible that the system doing the compiling isn’t using Windows-1252 as its default code page. For example, you might author the files in Windows-1252 because your main office is in Redmond, Washington, but you then send the file to your Japanese office, and their code page is 932. The byte sequence 92h 73h means “apostrophe, small Latin letter s” in the Windows-1252 code page, but in code page 932, that byte sequence represents the character 痴. When the Japanese office compiles your resource script, they get What痴 up?. This is already embarrassing enough, but it’s compounded by the fact that the character 痴 means gonorrhea.

To avoid this problem, the Resource Compiler lets you declare the code page in which the subsequent lines should be interpreted. This removes any dependency on the execution environment of the compiler.

#pragma code_page(1252)

STRINGTABLE BEGIN
IDS_MYSTRING L"What’s up?"
END

Some years later, UTF-8 was introduced. This created an interesting problem, because you might load a file as Windows-1252, but then when you save it, your text editor “helpfully” converts it to UTF-8. This change often goes undetected because file comparison tools will frequently “helpfully” normalize the two files into a common encoding before comparing them, thereby hiding the encoding change.

And then you get a bug that says “Garbage characters in message. Message is supposed to say What’s up?, but instead it says What’s up?.”

What happened is that the byte 92h in Windows-1252 was re-encoded into UTF-8 as the bytes E2h 80h 99h. Those bytes then were interpreted by the compiler as Windows-1252, resulting in ’. The presence of a UTF-8 BOM at the start of the file was a subtle hint that the file was really UTF-8 encoded, but computers aren’t very good at subtlety. They just follow the rules they were given, and that rule is “Interpret the bytes in the system ANSI code page unless given explicit instructions to the contrary.”

The fix is to give explicit instructions to the contrary. Put this at the top of the file:

#pragma code_page(65001) // UTF-8

Now save the file in UTF-8.

Now you’re all set. Text editors nowadays will happily “help” you out by silently converting to UTF-8, but I don’t know of any that silently convert to Windows-1252.

 

Raymond Chen
Raymond Chen

Follow Raymond   

10 comments

Comments are closed.

  • Avatar
    Daniel Sturm

    Ah a classical problem.
    Not that there’s much we can do to improve that. I remember a problem where a .gitignore file got saved as utf-16 which caused it to “ignore” its contents (well it would’ve ignore files with lots of embedded \0 characters).
    Someone please invent a time machine.

  • Avatar
    Matthew van Eerde (^_^)

    A more stable fix is to remove the dependency on the charset of the .rc file entirely by using Unicode escape sequence for characters outside of the intersection of the various code pages.
    STRINGTABLE BEGIN
    IDS_MYSTRING L”What\x2019s up?”
    END

    • Yuri Khan
      Yuri Khan

      As an English speaker, you will probably tolerate an occasional \xC0D3 in strings.
      In pretty much every other Latin-based language, escaping letters with diacritics gets annoying really fast.
      In languages that use other scripts, such as Greek, Cyrillic, or CJK, escaping makes text unreadable.
      So no. The right way is to use UTF-8 everywhere, enforce it, and enforce the enforcement. That is, make a linter rule that detects any .rc files that do not declare themselves as UTF-8.

      • Avatar
        Matthew van Eerde (^_^)

        I like your idea of enforcing UTF8-ness a lot better than my idea of being codepage-independent 🙂

  • Avatar
    Marc Fauser

    I checked my resource files and they are all UTF16 and it’s working.They are genereated by a PowerShell script, so I think it was an accident and nobody noticed it.

  • James Lin
    James Lin

    I don’t know if it’s still an issue, but years ago we would unpredictably get corrupted strings when compiling our translated .rc files even though the files were saved as UTF-8 and even though we specified `pragma code_page(65001)`.  We wanted to keep using UTF-8 because it played much more nicely with our source control system and code review tools, so we ended up converting them to UTF-16 at build time. =/

  • Avatar
    Martin Müller

    It seems to me that UTF-8 with a BOM confuses rc.exe?
    foo.rc(8) : error RC2135 : file not found: 0foo.rc(9) : error RC2135 : file not found: 0x17Lfoo.rc(11) : error RC2135 : file not found: FILEOS
    UTF-16 with BOM works, though.