August 4th, 2020

How can CharUpper and CharLower guarantee that the uppercase version of a string is the same length as the lowercase version?

The CharUpper function takes a buffer of characters and converts them in place to uppercase. This requires that the uppercase version of any character occupy the same number of code units as the lowercase counterpart. However, there is nothing in the Unicode specification that appears to require this. Did Microsoft come to some sort of special under-the-table deal with the Unicode Consortium to ensure that this property holds for all characters?

No, there is no such special under-the-table deal, probably because there is also no such guarantee. And in fact, there are counterexamples if you look closely enough. We noted earlier that the uppercase version of the ß character for a long time was the two-character combination SS. (This got even more complicated with the adoption of the capital ẞ in 2017.) There’s also U+1F80 GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI “á¾€” whose uppercase version is the two characters U+1F08 GREEK CAPITAL LETTER ALPHA WITH PSILI and U+0399 GREEK CAPITAL LETTER IOTA “ἈΙ”. But if you ask CharUpper to convert them, it leaves ß unchanged, and it converts “á¾€” to U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI “ᾈ”.

The CharUpper function tries to convert the string in place, but if the uppercase and lowercase versions of a character are not the same length, then it panics and does something strange.

The CharUpper function is a legacy function that remains for compatibility with the AnsiUpper function in 16-bit Windows, and we saw last time that the AnsiUpper function was originally hard-coded to code page 1252. Over time, Windows added support for other code pages, and they happened to have enjoyed the property that the uppercase and lowercase versions of a string have the same length. (Again, if you ignore the weird ß ↔ SS thing.)

Eventually, that rule broke down, but you can’t go back in time and kill CharUpper‘s parents before it was born. You just have to accept that there’s this thing called CharUpper that has some baked-in assumptions that are wrong. If you give it a string that violates those assumptions, then it does what it can, but the results aren’t the best.

I would consider the CharUpper and CharLower family of functions to be deprecated. Instead, use the LCMapStringEx function with the LCMAP_UPPERCASE or LCMAP_LOWERCASE flag, as appropriate.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

7 comments

Discussion is closed. Login to edit/delete existing comments.

  • Piotr Siódmak

    What does LC stand for? Long Character? LoCale?

    • Alex Martin

      I would guess locale, as it would align with the abbreviation used in other systems (the Unix locale environment variables are prefixed with LC_).

  • Elmar

    I was wondering about something related with regard to paths and file names. We all know that file names on Windows are case insensitive (under normal circumstances). But what brand of case insensitive exactly? I didn’t find an answer in Microsoft’s documentation. What do I call if I want to test whether two Unicode strings, which might differ in case, will be considered the same when interpreted as a file name?

    • Jeffrey Tippet

      The brand of case depends on when the filesystem was created. It would be a breaking change for two files on disk to become equal today, if they weren't considered equal yesterday. So when an NTFS (or exFAT) filesystem is created, the OS writes into the filesystem's metadata a table of all known uppercase codepoint mappings. From then on, those are the casing tables that everyone uses when contemplating files on that volume. For example, Unicode 1.1 had U+03F3, but it wasn't until Unicode 7 that its uppercase U+037F was added. If the volume was...

      Read more
      • Simon Clarkstone

        > So when an NTFS (or exFAT) filesystem is created, the OS writes into the filesystem’s metadata a table of all known uppercase codepoint mappings.

        I had wondered how MS handled that and had assumed that they picked one global standard. Filesystem metadata is an ingenious choice.

      • cheong00

        Agreed. And it’s not just about Unicode uppercase and lowercase when talking about filenames. Systems with NTFS partition created before WinXP have filenames saved in system code page (e.g.: BIG5 for Traditional Chinese systems) You can have multiple strings points to the same file because the filesystem driver will automatically translate them as needed.

        I end up creating FileInfo object with the parameter and use its .Name property to check whether it’s the same.

  • Alex Martin

    Can the documentation be updated to indicate that these functions are deprecated? It seems like it would be very easy for a programmer to go looking for a way to case-change a string, not think about LCMapStringEx, and use CharUpper/CharLower, leading to potential incorrect behavior.