A popular but wrong way to convert a string to uppercase or lowercase

Raymond Chen

It seems that a popular way of converting a string to uppercase or lowercase is to do it letter by letter.

std::wstring name;

std::transform(name.begin(), name.end(), name.begin(),
    std::tolower);

This is wrong for many reasons.

First of all, std::tolower is not an addressible function. This means, among other things, that you are not allowed to take the function’s address,¹ like we’re doing here when we pass a pointer to the function to std::transform. So we’ll have to use a lambda.

std::wstring name;

std::transform(name.begin(), name.end(), name.begin(),
    [](auto c) { return std::tolower(c); });

The next mistake is a copy-pasta: The code is using std::tolower to convert wide characters (wchar_t) even though std::tolower works only for narrow characters (even more restrictive than that: it works only for unsigned narrow characters unsigned char). There is no compile-time error because std::tolower accepts an int, and on most systems, wchar_t is implicitly promotable to int, so the compiler accepts the value without complaint even though over 99% of the potential values are out of range.

Even if we fix the code to use std::towlower:

std::wstring name;

std::transform(name.begin(), name.end(), name.begin(),
    [](auto c) { return std::towlower(c); });

it’s still wrong because it assumes that case mapping can be performed char-by-char or wchar_t-by-wchar_t in a context-free manner.

If the wchar_t encoding is UTF-16, then characters outside the basic multilingual plane (BMP) are represented by pairs of wchar_t values. For example, the Unicode character OLD HUNGARIAN CAPITAL LETTER A² (U+10C80) is represented by two UTF-16 code units: D803 followed by DC80.

Passing these two code units to towlower one at a time prevents towlower from understanding how they interact with each other. If you call towlower with DC80, it recognizes that you passed only half of a character, but it doesn’t know what the other half is, so it has to just shrug its shoulders and say, “Um, DC80?” Too bad, because the lowercase version of OLD HUNGARIAN CAPITAL LETTER A (U+10C80) is OLD HUNGARIAN SMALL LETTER A (U+10CC0), so it should have returned DCC0. Of course towlower doesn’t have psychic powers, so you can’t really expect it to have known that the DC80 was the partner of an unseen D803.

Another problem (which applies even if wchar_t is UTF-32) is that the uppercase and lowercase versions of a character might have different lengths. For example, LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE, and LATIN SMALL LIGATURE FL (“ﬂ” U+FB02) uppercases to the two-character sequence “FL”. In both examples, converting the string to uppercase causes the string to get longer. And in certain forms of the French language, capitalizing an accented character causes the accent to be dropped: à Paris ⇒ A PARIS. If the accented character à were encoded as LATIN SMALL LETTER A (U+0061) followed by COMBINING GRAVE ACCENT (U+0300), then converting to uppercase causes the string to get shorter.

Similar issues apply to the std::string version:

std::string name;

std::transform(name.begin(), name.end(), name.begin(),
    [](auto c) { return std::tolower(c); });

If the string potentially contains characters outside the 7-bit ASCII range, then this triggers undefined behavior when those characters are encountered. And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.

Okay, so those are the problems. What’s the solution?

If you need to perform a case mapping on a string, you can use LCMapStringEx with LCMAP_LOWERCASE or LCMAP_UPPERCASE, possibly with other flags like LCMAP_LINGUISTIC_CASING. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

¹ The standard imposes this limitation because the implementation may need to add default function parameters, template default parameters, or overloads in order to accomplish the various requirements of the standard.

² I find it quaint that Unicode character names are ALL IN CAPITAL LETTERS, in case you need to put them in a Baudot telegram or something.

³ Under the pre-1996 rules, the ß can capitalize under certain conditions to “SZ”: Maßen ⇒ MASZEN. And in 2017, the Council for German Orthography (Rat für deutsche Rechtschreibung) permitted LATIN CAPITAL LETTER SHARP S (“ẞ” U+1E9E) to be used as a capital form of ß.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

14 comments

Discussion is closed. Login to edit/delete existing comments.

Кирилл Ярин October 13, 2024

I wish I could ask this under “Can INI files be Unicode?” post on 2024-06-06, but discussion is closed over there.

Why batch files (.bat and .cmd) can not be UTF-16?
Dos Moonen October 11, 2024

The next C standard after C23 will include N3366 – Restartable Functions for Efficient Character Conversions

So now C++ can build upon those functions
Christian Chung October 9, 2024

Does anywhere online have a place for finding these types of things? I’ve never even heard of LCMapStringEx until now
IS4 October 9, 2024

The solution is not to use C++. Use C# which standardizes all these conversions!
alan robinson October 9, 2024

A lot of this discussion misses in my mind the most important use case. 99.9% of the time or maybe more that I use a to lower or to upper function it’s because I want to do case insensitive comparison. Given the potentials for multiple mappings revealed in the comments it seems like rather than combining two functions to do this we really should be using a standard compare case insensitive method that works across all UTF encodings and locales, or even a more permissive fuzzy match method that ignores whitespace.
- Kevin Norris October 10, 2024
  
  The correct way to do this is to use the Unicode standard's CaseFolding.txt file, or more likely a library that wraps it and makes it ergonomic to use, such as ICU.
  
  If you also want to deal with other problems of "this string looks similar/identical to that string, but they're not the same characters," then you're probably looking for NFD/NFKD normalization, which should be applied after case folding since case folding may reintroduce non-normalized characters. Use NFD if you just want to make combining and precomposed accents equivalent, or NFKD if you also want to eliminate distinctions like circled numerals, fancy...
  Read more
  The correct way to do this is to use the Unicode standard’s CaseFolding.txt file, or more likely a library that wraps it and makes it ergonomic to use, such as ICU.
  
  If you also want to deal with other problems of “this string looks similar/identical to that string, but they’re not the same characters,” then you’re probably looking for NFD/NFKD normalization, which should be applied after case folding since case folding may reintroduce non-normalized characters. Use NFD if you just want to make combining and precomposed accents equivalent, or NFKD if you also want to eliminate distinctions like circled numerals, fancy mathematical scripts, or non-breaking spaces. NFD is suitable for use on strings that will be displayed (case folding is not). NFKD is (generally) suitable for cases where your users are not trying to do weird things, such as typing formatted text into a plain text field using the mathematical symbols block.
  
  If you’re dealing with “somebody is intentionally trying to construct strings that look like other strings, but are actually different, and that represents a security issue,” then you need to apply a Confusables check, which works differently but resembles a significantly more aggressive variation of NFKD (i.e. it is aggressive enough that the text may be rendered functionally illegible, so you should only use it for comparison purposes and never try to display it to the user). ICU has the class SpoofChecker for this. Its main use case is the URL bar of a web browser (to detect domain names with funny Unicode characters that pretend to be other domains for phishing purposes).
  
  Read less
Frank Schmitt October 8, 2024

Germans decided in 2017 that converting ß into SS is stupid and officially started to use a capital letter version:

ẞ U+1E9E LATIN CAPITAL LETTER SHARP S Lateinischer Großbuchstabe scharfes S ẞ
Kristof Roomp October 8, 2024

Especially people that lived through the Unicode migration of Windows and thought that now that all characters are 16 bit we never have to worry that kind of stuff again. Dangers of early adoption I suppose.
Jonathan Wilson October 8, 2024

Is there a reason wcslwr isn’t a solution to this problem?
Or do they not handle all the wierd stuff from various languages properly?
- Mike Winterberg October 9, 2024 · Edited
  
  _wcslwr_s_l is implemented with LCMapStringW(LCMAP_LOWERCASE).
Henke37 October 7, 2024

My favorite gotcha here is Greek. There’s a character with one uppercase version and two lowercase versions.
Rob Bernstein October 7, 2024

So there’s no facility built into Standard C++ that handles this correctly even in C++23?
- LB October 10, 2024 · Edited
  
  There is no single correct way to handle it, see the comment above yours about Greek having two lowercase versions of the same uppercase character, and the comment below ours about Turkish. Human languages are significantly more complicated than some standard library functions can hope to deal with, and they’d have to be continually updated as human languages evolve. Dedicated libraries with bindings to multiple languages (such as ICU) are currently the best option.