February 6th, 2025

The default C locale is not a very interesting one

Although the C and C++ languages provide facilities for localization, the default locale is the so-called “C” locale, which barely understands anything.

In the “C” locale, the uppercase characters are “A” through “Z“; the lowercase characters are “a” through “z“, the decimal separator is “.“, and there is no thousands separator.

The “C” locale is designed to be minimal. But it also means that unless you’ve taken special efforts to change your process’s locale to something else, functions like towupper and _wcslwr produce only extremely rudimentary results. All they know is the characters in the 7-bit ASCII set. They don’t even know that the uppercase version of ä is Ä.

Support for any locales beyond the “C” locale is implementation-defined, and the standard considers it a quality of implementation issue. Microsoft’s Visual C++ compiler uses BCP47 for locale names, like sr-Cyrl-BA for “Serbian, Cyrillic script, as used in Bosnia and Herzegovina.” The gcc library appears to use a custom format, such as de_AT.iso885915@euro for “German, as used in Austria, using the ISO-8859-15 character set and the Euro as the currency.”

This means that if you just dive in and call towlower without doing any locale preparation, all you’re going to get support for is characters U+0041 (LATIN CAPITAL LETTER A) through U+005A (LATIN CAPITAL LETTER Z) mapping to U+0061 (LATIN SMALL LETTER A) through U+007A (LATIN SMALL LETTER Z).

The Microsoft Visual C++ compiler standard library comes with bonus functions like _strlwr and _wcslwr for converting strings to lowercase. By default, these follow the current C runtime locale, so again, if you don’t do any locale preparation, you’re going to get the naïve case mapping.

wchar_t example[] = L"\x00C0" L"BC"; // ÀBC
_wcslwr_s(example); // Result: Àbc

Next time, we’ll look at how to get _wcslwr to operate on more interesting locales than the C locale.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

7 comments

Discussion is closed. Login to edit/delete existing comments.

  • Jamie Anderson

    The `wcrtomb` function dates back to the C99 standard. That was a much simpler time and applications often only used a single thread, so it want such a big issue to use `setlocale`.

    It seems like `wcrtomb` hasn't been updated because there isn't usually much need to convert just a single character at a time. It's far more common to convert entire strings, and `wcstombs` has variant where a locale can be passed in, `_wcstombs_l`.

    If you truly do have the need to convert a single character at a time, you can always call a `wcstombs` variant where the input string is...

    Read more
  • Dmitry 2 weeks ago

    Well, as you might have guessed, the Hungarian notation part is not what worries me, but the ”lwr” part. Especially while having towlower et les garçons. After all, those guys do introduce new functions for which backward compatibility is not quite a problem. And macros would do the trick anyway, if anything goes wrong.

    Also, I guess some of them could have a limit of six: strcpy makes think so.

  • Damir Valiulin 2 weeks ago

    This reminds of a period of time 10 years ago when some of our customers were reporting mysterious crashes and corrupted files. It took some time to find a common thread: a user would open either File Open or File Save dialog and after that things went downhill. In couple of corrupted files we got back, we noticed that numbers with decimals had comma as a separator vs. ".".

    Armed with this information, we then tried to figure out with the customers what causes the switch from period to comma. It turned out that iCloud's dll ShellStreams64.dll was being loaded...

    Read more
    • Kevin Norris 2 weeks ago

      I really wish the C and POSIX people would deprecate all of these APIs that expose global mutable state. We don't need a "current" locale, timezone, etc. We just need read-only interfaces that tell us the OS's best guess as to those values, and functions that let us override the values on a per-call basis.

      Your app is not special, and your library is even less special. You don't need to second-guess the OS on these things, and you certainly don't need to reconfigure a process that you don't even own in the first place. If the values are wrong, tell...

      Read more
  • Dmitry

    While quite understandable for C in 1970, I’ve always wondered who are those guys trying to economize on vowels today. Like if they get punished for every single letter omitting which doesn’t prevent one from guessing right the full words.

    • Jason Harrison 2 weeks ago

      “Hungarian notation” explains a lot of the absence of vowels/letters. If you start with a short type name, wcs for wide character string, you aren’t adding a significant number of characters per function or type name.

      Plus, the holdover from compilers and coding styles where only the first eight characters were used in the identifier lookup tables.

  • Paul Jackson

    In a recent project, I needed a function like wcrtomb_l, but MSVC only has wcrtomb, which forced me to use setlocale with it, which is more ugly and error prone