{"id":110345,"date":"2024-10-07T07:00:00","date_gmt":"2024-10-07T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=110345"},"modified":"2024-10-07T09:31:20","modified_gmt":"2024-10-07T16:31:20","slug":"20241007-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20241007-00\/?p=110345","title":{"rendered":"A popular but wrong way to convert a string to uppercase or lowercase"},"content":{"rendered":"<p>It seems that a popular way of converting a string to uppercase or lowercase is to do it letter by letter.<\/p>\n<pre>std::wstring name;\r\n\r\nstd::transform(name.begin(), name.end(), name.begin(),\r\n    std::tolower);\r\n<\/pre>\n<p>This is wrong for many reasons.<\/p>\n<p>First of all, <code>std::<wbr \/>tolower<\/code> is not an <i>addressible function<\/i>. This means, among other things, that you are not allowed to take the function&#8217;s address,\u00b9 like we&#8217;re doing here when we pass a pointer to the function to <code>std::<wbr \/>transform<\/code>. So we&#8217;ll have to use a lambda.<\/p>\n<pre>std::wstring name;\r\n\r\nstd::transform(name.begin(), name.end(), name.begin(),\r\n    [](auto c) { return std::tolower(c); });\r\n<\/pre>\n<p>The next mistake is a copy-pasta: The code is using <code>std::<wbr \/>tolower<\/code> to convert wide characters (<code>wchar_t<\/code>) even though <code>std::<wbr \/>tolower<\/code> works only for narrow characters (even more restrictive than that: it works only for <i>unsigned<\/i> narrow characters <code>unsigned char<\/code>). There is no compile-time error because <code>std::<wbr \/>tolower<\/code> accepts an <code>int<\/code>, and on most systems, <code>wchar_t<\/code> is implicitly promotable to <code>int<\/code>, so the compiler accepts the value without complaint even though over 99% of the potential values are out of range.<\/p>\n<p>Even if we fix the code to use <code>std::<wbr \/>towlower<\/code>:<\/p>\n<pre>std::wstring name;\r\n\r\nstd::transform(name.begin(), name.end(), name.begin(),\r\n    [](auto c) { return std::towlower(c); });\r\n<\/pre>\n<p>it&#8217;s still wrong because it assumes that case mapping can be performed <code>char<\/code>-by-<code>char<\/code> or <code>wchar_t<\/code>-by-<code>wchar_t<\/code> in a context-free manner.<\/p>\n<p>If the <code>wchar_t<\/code> encoding is UTF-16, then characters outside the basic multilingual plane (BMP) are represented by pairs of <code>wchar_t<\/code> values. For example, the Unicode character OLD HUNGARIAN CAPITAL LETTER A\u00b2 (U+10C80) is represented by two UTF-16 code units: <kbd>D803<\/kbd> followed by <kbd>DC80<\/kbd>.<\/p>\n<p>Passing these two code units to <code>towlower<\/code> one at a time prevents <code>towlower<\/code> from understanding how they interact with each other. If you call <code>towlower<\/code> with <kbd>DC80<\/kbd>, it recognizes that you passed only half of a character, but it doesn&#8217;t know what the other half is, so it has to just shrug its shoulders and say, &#8220;Um, <kbd>DC80<\/kbd>?&#8221; Too bad, because the lowercase version of OLD HUNGARIAN CAPITAL LETTER A (U+10C80) is OLD HUNGARIAN SMALL LETTER A (U+10CC0), so it should have returned <kbd>DCC0<\/kbd>. Of course <code>towlower<\/code> doesn&#8217;t have psychic powers, so you can&#8217;t really expect it to have known that the <kbd>DC80<\/kbd> was the partner of an unseen <kbd>D803<\/kbd>.<\/p>\n<p>Another problem (which applies even if <code>wchar_t<\/code> is UTF-32) is that the uppercase and lowercase versions of a character might have different lengths. For example, LATIN SMALL LETTER SHARP S (&#8220;\u00df&#8221; U+00DF) uppercases to the two-character sequence &#8220;SS&#8221;:\u00b3 Stra\u00dfe \u21d2 STRASSE, and LATIN SMALL LIGATURE FL (&#8220;\ufb02&#8221; U+FB02) uppercases to the two-character sequence &#8220;FL&#8221;. In both examples, converting the string to uppercase causes the string to get longer. And in certain forms of the French language, capitalizing an accented character causes the accent to be dropped: \u00e0 Paris \u21d2 A PARIS. If the accented character \u00e0 were encoded as LATIN SMALL LETTER A (U+0061) followed by COMBINING GRAVE ACCENT (U+0300), then converting to uppercase causes the string to get shorter.<\/p>\n<p>Similar issues apply to the <code>std::string<\/code> version:<\/p>\n<pre>std::string name;\r\n\r\nstd::transform(name.begin(), name.end(), name.begin(),\r\n    [](auto c) { return std::tolower(c); });\r\n<\/pre>\n<p>If the string potentially contains characters outside the 7-bit ASCII range, then this triggers undefined behavior when those characters are encountered. And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.<\/p>\n<p>Okay, so those are the problems. What&#8217;s the solution?<\/p>\n<p>If you need to perform a case mapping on a string, you can use <code>LCMap\u00adString\u00adEx<\/code> with <code>LCMAP_<wbr \/>LOWERCASE<\/code> or <code>LCMAP_<wbr \/>UPPERCASE<\/code>, possibly with other flags like <code>LCMAP_<wbr \/>LINGUISTIC_<wbr \/>CASING<\/code>. If you use the <a href=\"https:\/\/icu.unicode.org\/\"> International Components for Unicode (ICU)<\/a> library, you can use <code>u_strToUpper<\/code> and <code>u_strToLower<\/code>.<\/p>\n<p>\u00b9 The standard imposes this limitation because the implementation may need to add default function parameters, template default parameters, or overloads in order to accomplish the various requirements of the standard.<\/p>\n<p>\u00b2 I find it quaint that Unicode character names are ALL IN CAPITAL LETTERS, in case you need to put them in a Baudot telegram or something.<\/p>\n<p>\u00b3 Under the pre-1996 rules, the \u00df can capitalize under certain conditions to &#8220;SZ&#8221;: Ma\u00dfen \u21d2 MASZEN. And in 2017, the Council for German Orthography (<i lang=\"de\">Rat f\u00fcr deutsche Rechtschreibung<\/i>) permitted LATIN CAPITAL LETTER SHARP S (&#8220;\u1e9e&#8221; U+1E9E) to be used as a capital form of \u00df.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Converting character by character isn&#8217;t good enough any more.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-110345","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Converting character by character isn&#8217;t good enough any more.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110345","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=110345"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110345\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=110345"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=110345"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=110345"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}