February 6th, 2025

The default C locale is not a very interesting one

Although the C and C++ languages provide facilities for localization, the default locale is the so-called “C” locale, which barely understands anything.

In the “C” locale, the uppercase characters are “A” through “Z“; the lowercase characters are “a” through “z“, the decimal separator is “.“, and there is no thousands separator.

The “C” locale is designed to be minimal. But it also means that unless you’ve taken special efforts to change your process’s locale to something else, functions like towupper and _wcslwr produce only extremely rudimentary results. All they know is the characters in the 7-bit ASCII set. They don’t even know that the uppercase version of ä is Ä.

Support for any locales beyond the “C” locale is implementation-defined, and the standard considers it a quality of implementation issue. Microsoft’s Visual C++ compiler uses BCP47 for locale names, like sr-Cyrl-BA for “Serbian, Cyrillic script, as used in Bosnia and Herzegovina.” The gcc library appears to use a custom format, such as de_AT.iso885915@euro for “German, as used in Austria, using the ISO-8859-15 character set and the Euro as the currency.”

This means that if you just dive in and call towlower without doing any locale preparation, all you’re going to get support for is characters U+0041 (LATIN CAPITAL LETTER A) through U+005A (LATIN CAPITAL LETTER Z) mapping to U+0061 (LATIN SMALL LETTER A) through U+007A (LATIN SMALL LETTER Z).

The Microsoft Visual C++ compiler standard library comes with bonus functions like _strlwr and _wcslwr for converting strings to lowercase. By default, these follow the current C runtime locale, so again, if you don’t do any locale preparation, you’re going to get the naïve case mapping.

wchar_t example[] = L"\x00C0" L"BC"; // ÀBC
_wcslwr_s(example); // Result: Àbc

Next time, we’ll look at how to get _wcslwr to operate on more interesting locales than the C locale.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

4 comments

  • Damir Valiulin 29 minutes ago

    This reminds of a period of time 10 years ago when some of our customers were reporting mysterious crashes and corrupted files. It took some time to find a common thread: a user would open either File Open or File Save dialog and after that things went downhill. In couple of corrupted files we got back, we noticed that numbers with decimals had comma as a separator vs. ".".

    Armed with this information, we then tried to figure out with the customers what causes the switch from period to comma. It turned out that iCloud's dll ShellStreams64.dll was being loaded...

    Read more
  • Dmitry

    While quite understandable for C in 1970, I’ve always wondered who are those guys trying to economize on vowels today. Like if they get punished for every single letter omitting which doesn’t prevent one from guessing right the full words.

    • Jason Harrison 40 minutes ago

      “Hungarian notation” explains a lot of the absence of vowels/letters. If you start with a short type name, wcs for wide character string, you aren’t adding a significant number of characters per function or type name.

      Plus, the holdover from compilers and coding styles where only the first eight characters were used in the identifier lookup tables.

  • Paul Jackson

    In a recent project, I needed a function like wcrtomb_l, but MSVC only has wcrtomb, which forced me to use setlocale with it, which is more ugly and error prone