{"id":110443,"date":"2024-10-31T07:00:00","date_gmt":"2024-10-31T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=110443"},"modified":"2024-10-30T19:10:46","modified_gmt":"2024-10-31T02:10:46","slug":"20241031-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20241031-00\/?p=110443","title":{"rendered":"What has case distinction but is neither uppercase nor lowercase?"},"content":{"rendered":"<p>If you go exploring the Unicode Standard, you may be surprised to find that there are some characters that have case distinction yet are themselves neither uppercase nor lowercase.<\/p>\n<p>Oooooh, spooky.<\/p>\n<p>In other words, it is a character <var>c<\/var> with the properties that<\/p>\n<ul>\n<li>toUpper(<var>c<\/var>) \u2260 toLower(<var>c<\/var>), yet<\/li>\n<li><var>c<\/var> \u2260 toUpper(<var>c<\/var>) and <var>c<\/var> \u2260 toLower(<var>c<\/var>).<\/li>\n<\/ul>\n<p>Congratulations, you found the mysterious third case: Title case.<\/p>\n<p>There are some Unicode characters that occupy a single code point but represent two graphical symbols packed together. For example, the Unicode character \u01f3 (U+01F1 LATIN SMALL LETTER DZ), looks like two Unicode characters placed next to each other: dz (U+0064 LATIN SMALL LETTER D followed by U+007A LATIN SMALL LETTER Z).<\/p>\n<p>These diagraphs are characters in the alphabets of some languages, most notably Hungarian. In those languages, the diagraph is considered a separate letter of the alphabet. For example, the first ten letters of the Hungarian alphabet are\u00b9<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td>a<\/td>\n<td>\u00e1<\/td>\n<td>b<\/td>\n<td>c<\/td>\n<td>cs<\/td>\n<td>d<\/td>\n<td>dz<\/td>\n<td>dzs<\/td>\n<td>e<\/td>\n<td>\u00e9<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These digraphs (and one trigraph) have three forms.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Form<\/th>\n<th>Result<\/th>\n<\/tr>\n<tr>\n<td>Uppercase<\/td>\n<td>\u01f1<\/td>\n<\/tr>\n<tr>\n<td>Title case<\/td>\n<td>\u01f2<\/td>\n<\/tr>\n<tr>\n<td>Lowercase<\/td>\n<td>\u01f3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Unicode includes four diagraphs in its encoding.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Uppercase<\/th>\n<th>Title case<\/th>\n<th>Lowercase<\/th>\n<\/tr>\n<tr>\n<td>\u01c4<\/td>\n<td>\u01c5<\/td>\n<td>\u01c6<\/td>\n<\/tr>\n<tr>\n<td>\u01c7<\/td>\n<td>\u01c8<\/td>\n<td>\u01c9<\/td>\n<\/tr>\n<tr>\n<td>\u01ca<\/td>\n<td>\u01cb<\/td>\n<td>\u01cc<\/td>\n<\/tr>\n<tr>\n<td>\u01f1<\/td>\n<td>\u01f2<\/td>\n<td>\u01f3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>But wait, we have a Unicode code point for the dz digraph, but we don&#8217;t have one for the cs digraph or the dzs trigraph. What&#8217;s so special about dz?<\/p>\n<p>These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.\u00b9<\/p>\n<p>Just another situation where the world is more complicated than you think. You thought you understood uppercase and lowercase, but there&#8217;s another case in between that you didn&#8217;t know about.<\/p>\n<p><b>Bonus chatter<\/b>: The fact that dz is treated as a single letter in Hungarian means that if you search for &#8220;mad&#8221;, it should not match &#8220;<span lang=\"hu\">madzag<\/span>&#8221; (which means &#8220;string&#8221;) because the &#8220;dz&#8221; in &#8220;<span lang=\"hu\">madzag<\/span>&#8221; is a single letter and not a &#8220;d&#8221; followed by a &#8220;z&#8221;, no more than &#8220;lav&#8221; should match &#8220;law&#8221; just because the first part of the letter &#8220;w&#8221; looks like a &#8220;v&#8221;. Another surprising result if you mistakenly use a literal substring search rather than a locale-sensitive one. We&#8217;ll look at locale-sensitive substrings searches next time.<\/p>\n<p>\u00b9 I got this information from the Unicode Standard, Version 15.0, <a href=\"https:\/\/www.unicode.org\/versions\/Unicode15.0.0\/ch07.pdf\"> Chapter 7<\/a>: &#8220;Europe I&#8221;, Section 7.1: &#8220;Latin&#8221;, subsection &#8220;Latin Extended-B: U+0180-U+024F&#8221;, sub-subsection &#8220;Croatian Digraphs Matching Serbian Cyrillic Letters.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It has one foot in each world but belongs to neither.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-110443","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>It has one foot in each world but belongs to neither.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110443","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=110443"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110443\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=110443"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=110443"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=110443"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}