{"id":104040,"date":"2020-08-04T07:00:00","date_gmt":"2020-08-04T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=104040"},"modified":"2020-08-04T06:52:19","modified_gmt":"2020-08-04T13:52:19","slug":"20200804-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20200804-00\/?p=104040","title":{"rendered":"How can CharUpper and CharLower guarantee that the uppercase version of a string is the same length as the lowercase version?"},"content":{"rendered":"<p>The <code>CharUpper<\/code> function takes a buffer of characters and converts them in place to uppercase. This requires that the uppercase version of any character occupy the same number of code units as the lowercase counterpart. However, there is nothing in the Unicode specification that appears to require this. Did Microsoft come to some sort of special under-the-table deal with the Unicode Consortium to ensure that this property holds for all characters?<\/p>\n<p>No, there is no such special under-the-table deal, probably because there is also no such guarantee. And in fact, there are counterexamples if you look closely enough. We noted earlier that the uppercase version of the \u00df character for a long time was the two-character combination SS. (This got even more complicated with the adoption of the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Capital_%E1%BA%9E\"> capital \u1e9e<\/a> in 2017.) There&#8217;s also U+1F80 GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI &#8220;\u1f80&#8221; whose uppercase version is the two characters U+1F08 GREEK CAPITAL LETTER ALPHA WITH PSILI and U+0399 GREEK CAPITAL LETTER IOTA &#8220;\u1f08\u0399&#8221;. But if you ask <code>CharUpper<\/code> to convert them, it leaves \u00df unchanged, and it converts &#8220;\u1f80&#8221; to U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI &#8220;\u1f88&#8221;.<\/p>\n<p>The <code>CharUpper<\/code> function tries to convert the string in place, but if the uppercase and lowercase versions of a character are not the same length, then it panics and does something strange.<\/p>\n<p>The <code>CharUpper<\/code> function is a legacy function that remains for compatibility with the <code>AnsiUpper<\/code> function in 16-bit Windows, and we saw last time that <a title=\"Peeking inside the implementation of &lt;CODE&gt;AnsiUpper&lt;\/CODE&gt; and &lt;CODE&gt;AnsiLower&lt;\/CODE&gt; in Windows 1.0\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20200803-00\/?p=104038\"> the <code>AnsiUpper<\/code> function was originally hard-coded to code page 1252<\/a>. Over time, Windows added support for other code pages, and they happened to have enjoyed the property that the uppercase and lowercase versions of a string have the same length. (Again, if you ignore the weird \u00df \u2194 SS thing.)<\/p>\n<p>Eventually, that rule broke down, but you can&#8217;t go back in time and kill <code>CharUpper<\/code>&#8216;s parents before it was born. You just have to accept that there&#8217;s this thing called <code>CharUpper<\/code> that has some baked-in assumptions that are wrong. If you give it a string that violates those assumptions, then it does what it can, but the results aren&#8217;t the best.<\/p>\n<p>I would consider the <code>CharUpper<\/code> and <code>CharLower<\/code> family of functions to be deprecated. Instead, use the <code>LCMapStringEx<\/code> function with the <code>LCMAP_UPPERCASE<\/code> or <code>LCMAP_LOWERCASE<\/code> flag, as appropriate.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Does it have a special under-the-table deal with the Unicode Consortium?<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-104040","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Does it have a special under-the-table deal with the Unicode Consortium?<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/104040","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=104040"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/104040\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=104040"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=104040"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=104040"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}