The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8

The Resource Compiler assumes that its input is in the ANSI code page of the system that is doing the compiling. This may not be the same as the ANSI code page of the system that the .rc was authored on, nor may it be the same as the ANSI code page of the system that will consume the resulting resources.

It also completely ignores any clues in the file itself.

The saga begins in 1981.

At this time, code pages roamed the earth. There was no way to know what encoding to use for a file; you just assumed it was the ambient code page for the system that opened the file and hoped for the best.

This is the world the Resource Compiler was born into.

STRINGTABLE BEGIN
IDS_MYSTRING "Hello, world."
END

Some years later, Unicode was invented, and the Resource Compiler let you indicate that you wanted a Unicode string by using the L prefix.

STRINGTABLE BEGIN
IDS_MYSTRING L"Hello, world."
END

In the above case, the L didn’t have any effect since the string itself limits itself to 7-bit ASCII. But let’s say that you used a fancy apostrophe in the Windows-1252 code page.

STRINGTABLE BEGIN
IDS_MYSTRING L"What’s up?"
END

There are two things to note. First is that you need to put the L prefix on the string to get it to be interpreted as Unicode. And second, the apostrophe is encoded as the single byte 92h because the file is in the Windows-1252 code page.

Now, it’s possible that the system doing the compiling isn’t using Windows-1252 as its default code page. For example, you might author the files in Windows-1252 because your main office is in Redmond, Washington, but you then send the file to your Japanese office, and their code page is 932. The byte sequence 92h 73h means “apostrophe, small Latin letter s” in the Windows-1252 code page, but in code page 932, that byte sequence represents the character 痴. When the Japanese office compiles your resource script, they get What痴 up?. This is already embarrassing enough, but it’s compounded by the fact that the character 痴 means gonorrhea.

To avoid this problem, the Resource Compiler lets you declare the code page in which the subsequent lines should be interpreted. This removes any dependency on the execution environment of the compiler.

#pragma code_page(1252)

STRINGTABLE BEGIN
IDS_MYSTRING L"What’s up?"
END

Some years later, UTF-8 was introduced. This created an interesting problem, because you might load a file as Windows-1252, but then when you save it, your text editor “helpfully” converts it to UTF-8. This change often goes undetected because file comparison tools will frequently “helpfully” normalize the two files into a common encoding before comparing them, thereby hiding the encoding change.

And then you get a bug that says “Garbage characters in message. Message is supposed to say What’s up?, but instead it says Whatâ€™s up?.”

What happened is that the byte 92h in Windows-1252 was re-encoded into UTF-8 as the bytes E2h 80h 99h. Those bytes then were interpreted by the compiler as Windows-1252, resulting in â€™. The presence of a UTF-8 BOM at the start of the file was a subtle hint that the file was really UTF-8 encoded, but computers aren’t very good at subtlety. They just follow the rules they were given, and that rule is “Interpret the bytes in the system ANSI code page unless given explicit instructions to the contrary.”

The fix is to give explicit instructions to the contrary. Put this at the top of the file:

#pragma code_page(65001) // UTF-8

Now save the file in UTF-8.

Now you’re all set. Text editors nowadays will happily “help” you out by silently converting to UTF-8, but I don’t know of any that silently convert to Windows-1252.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

10 comments

Martin Müller June 11, 2019

It seems to me that UTF-8 with a BOM confuses rc.exe?
foo.rc(8) : error RC2135 : file not found: 0foo.rc(9) : error RC2135 : file not found: 0x17Lfoo.rc(11) : error RC2135 : file not found: FILEOS
UTF-16 with BOM works, though.

James Lin June 10, 2019

I don’t know if it’s still an issue, but years ago we would unpredictably get corrupted strings when compiling our translated .rc files even though the files were saved as UTF-8 and even though we specified `pragma code_page(65001)`. We wanted to keep using UTF-8 because it played much more nicely with our source control system and code review tools, so we ended up converting them to UTF-16 at build time. =/

Marc Fauser June 10, 2019

I checked my resource files and they are all UTF16 and it’s working.They are genereated by a PowerShell script, so I think it was an accident and nobody noticed it.

Akash Bagh June 7, 2019

https://jisho.org/search/痴 is ‘stupid’ not ‘gonorrhea’, didn’t you mistake it for 痳?

Azarien June 8, 2019

痳 is U+75F3, 痴 is U+75F4

Stay informed

Get notified when new posts are published.

Discussion is closed. Login to edit/delete existing comments.

Martin Müller June 11, 2019

It seems to me that UTF-8 with a BOM confuses rc.exe?
foo.rc(8) : error RC2135 : file not found: 0foo.rc(9) : error RC2135 : file not found: 0x17Lfoo.rc(11) : error RC2135 : file not found: FILEOS
UTF-16 with BOM works, though.
James Lin June 10, 2019

I don’t know if it’s still an issue, but years ago we would unpredictably get corrupted strings when compiling our translated .rc files even though the files were saved as UTF-8 and even though we specified `pragma code_page(65001)`. We wanted to keep using UTF-8 because it played much more nicely with our source control system and code review tools, so we ended up converting them to UTF-16 at build time. =/
Marc Fauser June 10, 2019

I checked my resource files and they are all UTF16 and it’s working.They are genereated by a PowerShell script, so I think it was an accident and nobody noticed it.
Akash Bagh June 7, 2019

https://jisho.org/search/痴 is ‘stupid’ not ‘gonorrhea’, didn’t you mistake it for 痳?
- Azarien June 8, 2019
  
  痳 is U+75F3, 痴 is U+75F4

The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8

Category

Topics

Author

10 comments

Read next

How can I determine in a C++ header file whether C++/CX is enabled? How about C++/WinRT?

If you can use GUIDs to reference files, why not use them to remember “recently used” files so they can survive renames and moves?

Category

Topics

Share

Author

10 comments

Read next

How can I determine in a C++ header file whether C++/CX is enabled? How about C++/WinRT?

If you can use GUIDs to reference files, why not use them to remember “recently used” files so they can survive renames and moves?

Stay informed