October 31st, 2024
mind blownheart10 reactions

What has case distinction but is neither uppercase nor lowercase?

If you go exploring the Unicode Standard, you may be surprised to find that there are some characters that have case distinction yet are themselves neither uppercase nor lowercase.

Oooooh, spooky.

In other words, it is a character c with the properties that

  • toUpper(c) ≠ toLower(c), yet
  • c ≠ toUpper(c) and c ≠ toLower(c).

Congratulations, you found the mysterious third case: Title case.

There are some Unicode characters that occupy a single code point but represent two graphical symbols packed together. For example, the Unicode character dz (U+01F1 LATIN SMALL LETTER DZ), looks like two Unicode characters placed next to each other: dz (U+0064 LATIN SMALL LETTER D followed by U+007A LATIN SMALL LETTER Z).

These diagraphs are characters in the alphabets of some languages, most notably Hungarian. In those languages, the diagraph is considered a separate letter of the alphabet. For example, the first ten letters of the Hungarian alphabet are¹

a á b c cs d dz dzs e é

These digraphs (and one trigraph) have three forms.

Form Result
Uppercase DZ
Title case Dz
Lowercase dz

Unicode includes four diagraphs in its encoding.

Uppercase Title case Lowercase
DŽ Dž dž
LJ Lj lj
NJ Nj nj
DZ Dz dz

But wait, we have a Unicode code point for the dz digraph, but we don’t have one for the cs digraph or the dzs trigraph. What’s so special about dz?

These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹

Just another situation where the world is more complicated than you think. You thought you understood uppercase and lowercase, but there’s another case in between that you didn’t know about.

Bonus chatter: The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”, no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”. Another surprising result if you mistakenly use a literal substring search rather than a locale-sensitive one. We’ll look at locale-sensitive substrings searches next time.

¹ I got this information from the Unicode Standard, Version 15.0, Chapter 7: “Europe I”, Section 7.1: “Latin”, subsection “Latin Extended-B: U+0180-U+024F”, sub-subsection “Croatian Digraphs Matching Serbian Cyrillic Letters.”

Topics

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

23 comments

Discussion is closed. Login to edit/delete existing comments.

Sort by :
  • MiloÅ¡ Milutinović

    “Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹”

    This is slightly wrong, Serbian also uses Latin – it uses both Latin and Cyrillic, per personal preference. To quote Wikipedia, “Serbian is practically the only European standard language whose speakers are fully functionally digraphic,[18] using both Cyrillic and Latin alphabets.”

  • Tudor Iordăchescu

    The EU law imposes the user’s informed consent for the use of cookies, that’s it.

    Some corporations/people that first complied with that law (I admit I’m a little bit fuzzy on the historical timeline) chose to implement the most annoying version possible as a form of malicious compliance, probably hoping that public outcry would trigger a revision of the law. I bet that 99% of people implementing such popups nowadays are just lazy and follow the herd instead of researching what the law actually requires.

    • Bela Zsir · Edited

      Some first compliers were hoping for a public outcry, since then we have all the herd following, ie. the overall impact is a million times greater, and still no public outcry.
      What does that say about the public? …am I really the only one so annoyed with this?

  • David Faulks · Edited

    The reason these letters exist is because Unicode has a policy of 1-to-1 round trip encoding compatibility with older character sets, and Yugoslavia (keep in mind Unicode came out in 1991) used to have an 8-bit character set (YUSCII) that included these digraph letters. I’m not sure why so many commentators are focused on Hungarian.

  • Michael Chermside

    Your point about how Hungarians actually use the characters is excellent -- and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course dz may have been added to support Serbo-Croatian, not Hungarian.)

    However, your ire about cookie popups is misplaced. Computer technologists did not invent and impose them, an EU law mandated the cookie popups...

    Read more
    • Bela Zsir · Edited

      Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don't want now to choke on too many.

      PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I tried here to give just an example, of what is a million times more disturbing than the made-up problem of the lack of some unneeded...

      Read more