October 31st, 2024

What has case distinction but is neither uppercase nor lowercase?

If you go exploring the Unicode Standard, you may be surprised to find that there are some characters that have case distinction yet are themselves neither uppercase nor lowercase.

Oooooh, spooky.

In other words, it is a character c with the properties that

  • toUpper(c) ≠ toLower(c), yet
  • c ≠ toUpper(c) and c ≠ toLower(c).

Congratulations, you found the mysterious third case: Title case.

There are some Unicode characters that occupy a single code point but represent two graphical symbols packed together. For example, the Unicode character dz (U+01F1 LATIN SMALL LETTER DZ), looks like two Unicode characters placed next to each other: dz (U+0064 LATIN SMALL LETTER D followed by U+007A LATIN SMALL LETTER Z).

These diagraphs are characters in the alphabets of some languages, most notably Hungarian. In those languages, the diagraph is considered a separate letter of the alphabet. For example, the first ten letters of the Hungarian alphabet are¹

a á b c cs d dz dzs e é

These digraphs (and one trigraph) have three forms.

Form Result
Uppercase DZ
Title case Dz
Lowercase dz

Unicode includes four diagraphs in its encoding.

Uppercase Title case Lowercase
DŽ Dž dž
LJ Lj lj
NJ Nj nj
DZ Dz dz

But wait, we have a Unicode code point for the dz digraph, but we don’t have one for the cs digraph or the dzs trigraph. What’s so special about dz?

These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹

Just another situation where the world is more complicated than you think. You thought you understood uppercase and lowercase, but there’s another case in between that you didn’t know about.

Bonus chatter: The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”, no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”. Another surprising result if you mistakenly use a literal substring search rather than a locale-sensitive one. We’ll look at locale-sensitive substrings searches next time.

¹ I got this information from the Unicode Standard, Version 15.0, Chapter 7: “Europe I”, Section 7.1: “Latin”, subsection “Latin Extended-B: U+0180-U+024F”, sub-subsection “Croatian Digraphs Matching Serbian Cyrillic Letters.”

Topics
Other

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

23 comments

  • Miloš Milutinović 7 days ago

    “Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹”

    This is slightly wrong, Serbian also uses Latin – it uses both Latin and Cyrillic, per personal preference. To quote Wikipedia, “Serbian is practically the only European standard language whose speakers are fully functionally digraphic,[18] using both Cyrillic and Latin alphabets.”

  • Tudor Iordăchescu 1 week ago

    The EU law imposes the user's informed consent for the use of cookies, that's it.

    Some corporations/people that first complied with that law (I admit I'm a little bit fuzzy on the historical timeline) chose to implement the most annoying version possible as a form of malicious compliance, probably hoping that public outcry would trigger a revision of the law. I bet that 99% of people implementing such popups nowadays are just lazy and...

    Read more
    • Bela Zsir 1 week ago · Edited

      Some first compliers were hoping for a public outcry, since then we have all the herd following, ie. the overall impact is a million times greater, and still no public outcry.
      What does that say about the public? …am I really the only one so annoyed with this?

  • David Faulks 1 week ago · Edited

    The reason these letters exist is because Unicode has a policy of 1-to-1 round trip encoding compatibility with older character sets, and Yugoslavia (keep in mind Unicode came out in 1991) used to have an 8-bit character set (YUSCII) that included these digraph letters. I’m not sure why so many commentators are focused on Hungarian.

  • Michael Chermside

    Your point about how Hungarians actually use the characters is excellent -- and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course dz may have been added to support...

    Read more
    • Bela Zsir 1 week ago · Edited

      Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don't want now to choke on too many.

      PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I...

      Read more
  • Kristof RoompMicrosoft employee · Edited

    Ll and Ch were considered single letters in Spanish until they changed the rules in 1994. Spanish (traditional sort) treats them as single characters vs Modern sort.

  • Bela Zsir 2 weeks ago · Edited

    As a Hungarian, I'd like to add my two cents to the discussion.

    TLDR: I know that lately it's become a "woke" habit to look for 'oppressed victims' who have not the slightest idea that they’re supposed to be victims. But thank you, we Hungarians—and our language—have no need for digraphs; in fact, having them and anybody using them would be actually harmful.

    I began programming in the 80s, we Hungarians spent about two...

    Read more
  • Jonas Barklund 2 weeks ago

    Raymond, did you try to make a distinction between digraph and diagraph, or was the latter a typo for digraph?

  • Álvaro González 2 weeks ago

    Funny. That same letter also used to exist in Spanish, together with ll (double L). Both were demoted in the mid 1990s so I guess they never made into Unicode.
    I also think it was for the best. To look up things in a list or dictionary you had to know the language it was written on.

  • Chris Warrick

    > The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”

    This sounds mad to me. Polish has a fair share of digraphs and trigraphs, but I expect partially-typed digraphs not to change the search result. It is disorienting if the result...

    Read more
    • Aram Dulyan

      Polish doesn’t treat those digraphs as letters of the alphabet though.

      In Czech, ch is a letter of the alphabet that comes after h. If a building has a vertical sign for bread, it would look like:
      Ch
      L
      É
      B

    • Daniel Chýlek

      I guess it is weird if you combine multiple languages on your system, but to me it’s entirely reasonable to expect that Czech Windows will not find a file containing ‘ch’ when you search for ‘c’ or ‘h’. That is how it works right now.

  • Jan Ringoš

    In Czech, we have similar letter ‘ch’ but it never got assigned a single Unicode codepoint.

    It’s probably for the best.