If you ask Windows to break the Korean string U+1100 U+1161 into graphemes, it will get broken up into two characters. U+1100 is HANGUL CHOSEONG KIYEOK (ᄀ) and U+1161 is HANGUL JUNGSEONG A (ᅡ).
Korean is written in the Hangul alphabet, and characters are composed of units known as jamo. In the above example, the two jamo combine to form the single syllable 가.
If the two code points combine to form a single character, why are they treated as separate graphemes? ICU treats them as a single grapheme. iOS treats them as a single grapheme. Android treats them as a single grapheme. Everybody treats them as a single grapheme, except Windows. Why does Windows do things wrong?
This is another case where Windows adopted a standard before anybody else and ended up suffering from the first-mover curse. In this case, Windows is following the Korean standard KS X 1026 and treating the characters as separate. (Indeed, the case of U+1100 U+1161 is the example used in the specification.) So the question isn’t why Windows is doing things wrong. The question is why everybody else is doing things wrong.
Everybody else does things wrong because everybody else ignores the standard. But if you’re the only one doing things right, then you end up looking wrong.
In practice, therefore, there are two competing standards. You have the de jure standard, which says that the characters are separate, and the de facto standard, which says that the characters form a single grapheme.
If you are interoperating with other systems, you would be best served by following the conventions that those other systems follow when communicating with them. In practice, this will usually mean that you need to ignore what the Unicode and Korean standards committees recommend, and instead do “what everybody else is doing.” Since ICU is one of those “everybody else”s, you can switch to using ICU to decompose your strings.
Today is Hangul Day, a Korean national holiday commemorating the invention of the Hangul alphabet.
Bonus reading: Frequently Asked Questions about Korean and Unicode.
quite interesting
At first glance, I thought of ICU = Intensive Care Unit, especially at times where counting available hospital beds are on the national news regularly.
Took me a while to figure out that ICU = International Components for Unicode (http://site.icu-project.org), which I’ve never hear about before.
Raymond, with all due respect, I think you're incorrect here. The Unicode standard (UAX #29) is quite clear on where grapheme cluster boundaries are in Hangul syllables, and there definitely no boundary between U+1100 (a jamo in the L category) and U+1161 (a jamo in the V category). Section 7.3 (1) of KS X 1026-1 confirms this explicitly. I've researched this a bit and think the most likely explanation is that Microsoft just misinterpreted the standard and shipped an implementation that has grapheme boundary (and text layout) logic in violation of both Unicode and KS X 1026-1.
Now things get a...
Thanks, Ralph. I am not the subject matter expert here; I’m recapturing information I received from the people who implemented it. From what I can tell, KS X 1026-1 was released in 2008, but UAX 29 did not address Hangul breaking until version 18 (2011). It’s possible that KS X 1026-1 was incorrectly implemented.
Thanks a lot!
Well, that’s the story I got from the globalization team. Maybe they got it wrong?
I think so. I’m happy to continue this thread by email if you’re motivated to get to the bottom of it. It’s relevant to current work in Druid (Rust GUI toolkit) to use platform capabilities to do text layout (unlike browsers, which today all use HarfBuzz), and this is currently an area where DirectWrite is variant.
But what is the logic behind this? Why do some jamo required to behave differently?
It boils down to coding efficiency. You can find a lot more detail in the History section of Hangul syllables on Wikipedia, but the simplified version is this. Modern Hangul has a relatively small number of jamo - 19 leading consonants, 21 vowels, and 27 (optional) trailing consonants. The simplest way to encode Hangul would be a code point for every jamo, which in UTF-8 would be 9 bytes for a typical LVT syllable. There are exactly 11,184 valid combinations of these, and, as of Unicode 2.0, these are encoded into the U+AC00..U+D7AF range, so 3 byte for each syllable.
I...
Sorry but in short, it is not standard method in even Korea.
Is my previous comment missing....?
---edited---
As a Korean, I agree johab method you mentioned is apparently a simpler way, but it is not standard, and never was an official single standard. it always was a 'de facto' for a brief time period(~90s) or an 'alternative' way after since.
Korea submitted following standard to unicode.org:
In addition, we must adhere to the following rules when representing in Johab Hangul syllable blocks.
1) As same as in the rules of representation format of Hangul letters (see 5.1), two or more
code positions...
I once read a story about Apple inventing Firewire. The story said Apple was unwilling to be the first to adopt the standard that it had created.
Are you sure this isn’t a case of competing standards rather than ICU ignoring or following a mere de facto standard? If I look at UAX #29, which contains Unicode’s rules for grapheme clustering, it appears to state pretty clearly: “Do not break Hangul syllable sequences.“
Eight more years until I can manifest everything for UTF-8 character set and drop most of this LPWSTR headache. (Another first-mover penalty Windows devs still pay). https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page