How can CharUpper and CharLower guarantee that the uppercase version of a string is the same length as the lowercase version?

Raymond Chen

Raymond

The CharUpper function takes a buffer of characters and converts them in place to uppercase. This requires that the uppercase version of any character occupy the same number of code units as the lowercase counterpart. However, there is nothing in the Unicode specification that appears to require this. Did Microsoft come to some sort of special under-the-table deal with the Unicode Consortium to ensure that this property holds for all characters?

No, there is no such special under-the-table deal, probably because there is also no such guarantee. And in fact, there are counterexamples if you look closely enough. We noted earlier that the uppercase version of the ß character for a long time was the two-character combination SS. (This got even more complicated with the adoption of the capital ẞ in 2017.) There’s also U+1F80 GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI “ᾀ” whose uppercase version is the two characters U+1F08 GREEK CAPITAL LETTER ALPHA WITH PSILI and U+0399 GREEK CAPITAL LETTER IOTA “ἈΙ”. But if you ask CharUpper to convert them, it leaves ß unchanged, and it converts “ᾀ” to U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI “ᾈ”.

The CharUpper function tries to convert the string in place, but if the uppercase and lowercase versions of a character are not the same length, then it panics and does something strange.

The CharUpper function is a legacy function that remains for compatibility with the AnsiUpper function in 16-bit Windows, and we saw last time that the AnsiUpper function was originally hard-coded to code page 1252. Over time, Windows added support for other code pages, and they happened to have enjoyed the property that the uppercase and lowercase versions of a string have the same length. (Again, if you ignore the weird ß ↔ SS thing.)

Eventually, that rule broke down, but you can’t go back in time and kill CharUpper‘s parents before it was born. You just have to accept that there’s this thing called CharUpper that has some baked-in assumptions that are wrong. If you give it a string that violates those assumptions, then it does what it can, but the results aren’t the best.

I would consider the CharUpper and CharLower family of functions to be deprecated. Instead, use the LCMapStringEx function with the LCMAP_UPPERCASE or LCMAP_LOWERCASE flag, as appropriate.

7 comments

Comments are closed. Login to edit/delete your existing comments

  • Avatar
    Alex Martin

    Can the documentation be updated to indicate that these functions are deprecated? It seems like it would be very easy for a programmer to go looking for a way to case-change a string, not think about LCMapStringEx, and use CharUpper/CharLower, leading to potential incorrect behavior.

  • Avatar
    Elmar

    I was wondering about something related with regard to paths and file names. We all know that file names on Windows are case insensitive (under normal circumstances). But what brand of case insensitive exactly? I didn’t find an answer in Microsoft’s documentation. What do I call if I want to test whether two Unicode strings, which might differ in case, will be considered the same when interpreted as a file name?

    • Avatar
      Jeffrey Tippet

      The brand of case depends on when the filesystem was created. It would be a breaking change for two files on disk to become equal today, if they weren’t considered equal yesterday. So when an NTFS (or exFAT) filesystem is created, the OS writes into the filesystem’s metadata a table of all known uppercase codepoint mappings. From then on, those are the casing tables that everyone uses when contemplating files on that volume. For example, Unicode 1.1 had U+03F3, but it wasn’t until Unicode 7 that its uppercase U+037F was added. If the volume was created by an NTFS implementation that only knows about Unicode 6, then the filesystem will treat “U+037F” and “U+03F3” as separate filenames. But if you write Unicode 7 or later into the NTFS upcase table, then the same filenames would be considered equivalent. This system would also allow you to do something reasonable with the famous example of U+0069, U+0049, and U+0130 (although I don’t know whether current versions of Windows choose to open that can of worms).

      Your other question is: how do you determine if two filenames would be equivalent? It’s essentially impossible to predict in advance, given all the special cases (junction points, non-NTFS filesystems, goofy filesystem filter drivers, and `fsutil file SetCaseSensitiveInfo c:\some\special\path enable`). So I’d suggest just creating the two files and checking if you get an error. Since you have to handle the TOCTOU race anyway, you might as well make that the main path.

      • Avatar
        cheong00

        Agreed. And it’s not just about Unicode uppercase and lowercase when talking about filenames. Systems with NTFS partition created before WinXP have filenames saved in system code page (e.g.: BIG5 for Traditional Chinese systems) You can have multiple strings points to the same file because the filesystem driver will automatically translate them as needed.

        I end up creating FileInfo object with the parameter and use its .Name property to check whether it’s the same.

      • Avatar
        Simon Clarkstone

        > So when an NTFS (or exFAT) filesystem is created, the OS writes into the filesystem’s metadata a table of all known uppercase codepoint mappings.

        I had wondered how MS handled that and had assumed that they picked one global standard. Filesystem metadata is an ingenious choice.