Maurice Kayser reported an issue with Win32 API loading of PE resources containing lowercase letters. Maurice did some experiments adding resources named MyIcon, MyIcÖn, and MyIcön, then trying to load them using various names, and built up a table of results. I’ve broken it up into three tables depending on the nature of the accented character.
Arg | Can load MyIcon | Can load MyIcÖn | Can load MyIcön |
---|---|---|---|
myicon | Yes | No | No |
MyIcon | Yes | No | No |
mYiCoN | Yes | No | No |
MYICON | Yes | No | No |
This table shouldn’t be surprising. The argument passed to LoadResource
is compared case-insensitively with the name of the resource, treating accented characters as different from their unaccented versions.
Here’s the next batch.
Arg | Can load MyIcon | Can load MyIcÖn | Can load MyIcön |
---|---|---|---|
myicÖn | No | Yes | No (!) |
MyIcÖn | No | Yes | No (!) |
mYiCÖN | No | Yes | No (!) |
MYICÖN | No | Yes | No (!) |
The first column is consistent with our previous result, namely that unaccented characters are treated as not the same as accented characters.
The second column is not surprising either, since the strings do match according to a case-insensitive comparison.
The third column is surprising. It seems that accented characters are case-sensitive, even though the documentation says that the comparison is case-insensitive.
Okay, here’s the third block.
Arg | Can load MyIcon | Can load MyIcÖn | Can load MyIcön |
---|---|---|---|
myicön | No | Yes (?) | No (!) |
MyIcön | No | Yes (?) | No (!) |
mYiCöN | No | Yes (?) | No (!) |
MYICöN | No | Yes (?) | No (!) |
The PE specification says that the resources are sorted “in ascending order”, and the names are sorted “by case-sensitive string.”¹
That’s all it says. The rest is left to interpretation.
First of all, even though the file format specification says that the resource names can be in any case, the FindResource
function converts all names to uppercase before searching, so any names with lowercase characters are effectively unfindable. Fortunately, the Resource Compiler also converts names to uppercase before storing them in the resources, so it all cancels out, right?
Well, it cancels out only if the Resource Compiler and the FindResource
function agree on how the names are converted to uppercase.
The Resource Compiler uses _wcsupr
to convert the names to uppercase, and _wcsupr
uses the default C locale,² which as we noted before, is not a very interesting locale. It converts Latin unaccented lowercase letters a-z to Latin unaccented uppercase letters A-Z, and that’s all.
Let’s update the top row of the table by converting the names to uppercase according to the C locale.
Arg | Can load MYICON | Can load MYICÖN | Can load MYICöN |
---|
How does the FindResource
function convert strings to uppercase? It uses the uppercase table corresponding to the system default language. It is almost certain that Ö and ö are uppercase and lowercase partners in the system default language. That means that the left columns are all effectively MYICON in the first table, and that they are all effectively MYICÖN in the second and third tables.
With these adjustments, the tables make more sense.
Arg | Loaded as |
Can load MyIcon | Can load MyIcÖn | Can load MyIcön |
---|---|---|---|---|
Stored as MYICON | Stored as MYICÖN | Stored as load MYICöN | ||
myicon | MYICON | Yes | No | No |
MyIcon | ||||
mYiCoN | ||||
MYICON | ||||
myicÖn | MYICÖN | No | Yes | No |
MyIcÖn | ||||
mYiCÖN | ||||
MYICÖN | ||||
myicön | ||||
MyIcön | ||||
mYiCöN | ||||
MYICöN |
Okay, so after we have accounted for how the Resource Compiler stores names and how FindResource
searches for names, the table looks less bonkers.
The moral of the story, I think, is that you should just stick to ASCII characters for resource names. Everybody agrees on that subset.
¹ Note that the specification is incomplete: It doesn’t say what collation to use for sorting. Does it use a locale-sensitive sort, so that Ö comes before P in German, but after P in Swedish?³ Does it use a case-sensitive sort where all punctuation come before all alphabetics? The FindResource
function assumes that the resources are sorted lexicographically by code unit (not code point) numerical value. Which is a good thing, because you don’t want a file compiled on a German system to be considered corrupted by a Swedish system.
² But what about the #pragme code_page()
directive? That directive tells the Resource Compiler how to convert quoted strings to Unicode, but it does not affect character mapping or collation.
³ In German dictionary sorting, the letter Ö is sorted as if it had no accent mark. But in German phone book sorting, the letter Ö is sorted as if it were two characters O + e. And in Austrian phone book sorting, the letter Ö is sorted as if it were two characters O + ¨, where the ¨ is treated as a character that comes after Z. And in Swedish, the letter Ö is treated as one of the three accented characters that come after Z.
Could this be, IDK, fixed?
Have FindResource search for the C locale uppercased string first, and after that fall back to the current behavior?
You (Microsoft) have the source codes after all. I don’t think it would break compatibility.
Or even better: Give RC new switch, perhaps /modern, that’d store the resource names unchanged (perhaps alongside with uppercased copy for backward compatibility), and have FindResource try to find the exact string first.
Ah, I miss the era of 2k/XP/7 when these things actually evolved.
If you want to fix it, fix it properly. Unicode specifies multiple algorithms for caseless string matching in section 3.13.5 - for something like this, you probably want an "identifier caseless match," but any of them would entirely solve the problem Raymond describes (albeit, in some cases, the non-default case folds might have backwards compatibility issues due to the use of Unicode normalization forms). None of these algorithms are locale-dependent, although the standard does vaguely gesture at Turkish dotted and dotless i as a potential source of issues.
ICU, as you might expect, implements the default case fold operation (and so...
I get where you’re coming from, but that’d be way too dangerous change in terms of backwards compatibility. And the desire is to simply load the correct resource, even if there are several whose names differ only in case.