September 5th, 2003

Case mapping on Unicode is hard

Occasionally, I’m asked, “I have to identify strings that are identical, case-insensitively.  How do I do it?”

The answer is, “Well, it depends. Whose case-mapping rules do you want to use?”

Sometimes the reply is, “I want this to be language-independent.”

Now you have a real problem.

Every locale has its own case-mapping rules. Many of them are in conflict with the rules for other locales. For example, which of the the following pairs of words compare case-insensitive equal?

1. gif GIF
2. Maße MASSE
3. Maße Masse
4. même MEME

Answers:

  1. no in Turkey, yes in US
  2. no in US, yes in Germany
  3. no in US, no in Germany, yes in Switzerland! (Though you would likely never see it written as “Maße” in Switzerland.)
  4. yes in France, no in Quebec!

(And I’ve heard that the capitalization rules for German are context-sensitive. Maybe that changed with the most recent spelling reform.) Unicode Technical Report #21 has more examples.

Just because you’re using Unicode doesn’t mean that all your language problems are solved. Indeed, the ability to represent characters in nearly all of the world’s languages means that you have more things to worry about, not less.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.

Feedback