October 31st, 2024

10 reactions

What has case distinction but is neither uppercase nor lowercase?

Raymond Chen

If you go exploring the Unicode Standard, you may be surprised to find that there are some characters that have case distinction yet are themselves neither uppercase nor lowercase.

Oooooh, spooky.

In other words, it is a character c with the properties that

toUpper(c) ≠ toLower(c), yet
c ≠ toUpper(c) and c ≠ toLower(c).

Congratulations, you found the mysterious third case: Title case.

There are some Unicode characters that occupy a single code point but represent two graphical symbols packed together. For example, the Unicode character ǳ (U+01F1 LATIN SMALL LETTER DZ), looks like two Unicode characters placed next to each other: dz (U+0064 LATIN SMALL LETTER D followed by U+007A LATIN SMALL LETTER Z).

These diagraphs are characters in the alphabets of some languages, most notably Hungarian. In those languages, the diagraph is considered a separate letter of the alphabet. For example, the first ten letters of the Hungarian alphabet are¹

dzs

These digraphs (and one trigraph) have three forms.

Form	Result
Uppercase	Ǳ
Title case	ǲ
Lowercase	ǳ

Unicode includes four diagraphs in its encoding.

Uppercase	Title case	Lowercase
Ǆ	ǅ	ǆ
Ǉ	ǈ	ǉ
Ǌ	ǋ	ǌ
Ǳ	ǲ	ǳ

But wait, we have a Unicode code point for the dz digraph, but we don’t have one for the cs digraph or the dzs trigraph. What’s so special about dz?

These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹

Just another situation where the world is more complicated than you think. You thought you understood uppercase and lowercase, but there’s another case in between that you didn’t know about.

Bonus chatter: The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”, no more than “lav” should match “law” just because the first part of the letter “w” looks like a “v”. Another surprising result if you mistakenly use a literal substring search rather than a locale-sensitive one. We’ll look at locale-sensitive substrings searches next time.

¹ I got this information from the Unicode Standard, Version 15.0, Chapter 7: “Europe I”, Section 7.1: “Latin”, subsection “Latin Extended-B: U+0180-U+024F”, sub-subsection “Croatian Digraphs Matching Serbian Cyrillic Letters.”

Topics

Other

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

23 comments

Discussion is closed. Login to edit/delete existing comments.

Miloš Milutinović November 7, 2024

“Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.¹”

This is slightly wrong, Serbian also uses Latin – it uses both Latin and Cyrillic, per personal preference. To quote Wikipedia, “Serbian is practically the only European standard language whose speakers are fully functionally digraphic,[18] using both Cyrillic and Latin alphabets.”
Tudor Iordăchescu November 5, 2024

The EU law imposes the user’s informed consent for the use of cookies, that’s it.

Some corporations/people that first complied with that law (I admit I’m a little bit fuzzy on the historical timeline) chose to implement the most annoying version possible as a form of malicious compliance, probably hoping that public outcry would trigger a revision of the law. I bet that 99% of people implementing such popups nowadays are just lazy and follow the herd instead of researching what the law actually requires.
- Bela Zsir November 5, 2024 · Edited
  
  Some first compliers were hoping for a public outcry, since then we have all the herd following, ie. the overall impact is a million times greater, and still no public outcry.
  What does that say about the public? …am I really the only one so annoyed with this?
David Faulks November 4, 2024 · Edited

The reason these letters exist is because Unicode has a policy of 1-to-1 round trip encoding compatibility with older character sets, and Yugoslavia (keep in mind Unicode came out in 1991) used to have an 8-bit character set (YUSCII) that included these digraph letters. I’m not sure why so many commentators are focused on Hungarian.
Michael Chermside November 4, 2024

Your point about how Hungarians actually use the characters is excellent -- and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course ǳ may have been added to support Serbo-Croatian, not Hungarian.)

However, your ire about cookie popups is misplaced. Computer technologists did not invent and impose them, an EU law mandated the cookie popups...
Read more
Your point about how Hungarians actually use the characters is excellent — and remains so even if it turns out that SOME Hungarian speakers disagree with you on this. In my opinion, the authors of the Unicode standing should generally attempt to support oddities that are unique within a language but when native speakers disagree about an oddity, Unicode should err on the side of simplicity. (Of course ǳ may have been added to support Serbo-Croatian, not Hungarian.)

However, your ire about cookie popups is misplaced. Computer technologists did not invent and impose them, an EU law mandated the cookie popups (and still does). I don’t even live in the EU and I still have to Wade through thickets of cookie agreement popups. Perhaps you could persuade your politicians to change that.

Read less
- Bela Zsir November 4, 2024 · Edited
  
  Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don't want now to choke on too many.
  
  PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I tried here to give just an example, of what is a million times more disturbing than the made-up problem of the lack of some unneeded...
  Read more
  Thank you for the reply, sorry for the long rant. I was trying to make the point that after decades of NOT having all our characters, we don’t want now to choke on too many.
  
  PS. Your are right about the cookie thing being misplaced here, when I can, I speak out in the right place, but nobody cares. as if it were not a problem for anyone else in the world, in desperation I tried here to give just an example, of what is a million times more disturbing than the made-up problem of the lack of some unneeded digraphs. (reminds me of the story made in the news how the ‘Calculator Team’ (sic! / sick?) in Redmond solved a 20 year problem in the Windows calculator, this and all similar waste of human resources ie. just to have a ‘Calculator Team’ make me sad)
  
  I am aware that it is a law, but nowhere in the law it is stated that half the page must be taken up by a cookie prompt graying out the rest of the page making it unusable till you answer a silly question. (I click always on OK, Yes, Agree, or whatever the ‘dont care just go’ button says to avoid the pointless further prompts)
  
  In my spare time, I volunteer at a centre for people with disabilities, doing what I can: electronics and programming, refurbishing the computers they receive as donations. ‘inventing’ alternative pointing devices. These people are blessed with a computer and the internet. Scrolling is easy for them without the help of their hands, but clicking a mouse on a randomly popping up window with a mess of buttons is each time a challenge, and it is growing.
  I am desperately trying to help these people with my browser extension scripts that auto-click away these bs, but they keep on coming, there are not two identically programmed out there. My only wish for you, Computer technologists, please help us, just make the implementation a standard. (for lawyer users you can leave all as is)
  If talking about the law, why is there no option for a legally binding statement built-in a browser that I will allow/deny all cookies for the next 10 years or whatever, just don’t ask me a million times of the same thing. I will go to a notary to sign this life-long statement with a wax seal if needed. What law it is if I can give away a billion-dollar asset at the click of a button, then I am an adult? But this cookie thing is so important that it will be reasked 547 times just over the next week.
  
  I began with ‘sorry for my rant’, now I went on, sorry again. Annoying, isn’t it? those cookie prompts are much more annoying for my friends in that center.
  
  Read less

Stay informed

Get notified when new posts are published.

Email *

Country/Region *

I would like to receive the The Old New Thing Newsletter. Privacy Statement.

Follow this blog

What has case distinction but is neither uppercase nor lowercase?

Category

Topics

Author

23 comments

Read next

Microspeak: Real estate and Airspace

Assessing the attack complexity of a race condition security vulnerability

Category

Topics

Share

Author

23 comments

Read next

Microspeak: Real estate and Airspace

Assessing the attack complexity of a race condition security vulnerability

Stay informed