October 9th, 2020

A consequence of being the first to adopt a standard is that you may end up being the only one to adopt it: The sad story of Korean jamo

If you ask Windows to break the Korean string U+1100 U+1161 into graphemes, it will get broken up into two characters. U+1100 is HANGUL CHOSEONG KIYEOK (ᄀ) and U+1161 is HANGUL JUNGSEONG A (ᅡ).

Korean is written in the Hangul alphabet, and characters are composed of units known as jamo. In the above example, the two jamo combine to form the single syllable 가.

If the two code points combine to form a single character, why are they treated as separate graphemes? ICU treats them as a single grapheme. iOS treats them as a single grapheme. Android treats them as a single grapheme. Everybody treats them as a single grapheme, except Windows. Why does Windows do things wrong?

This is another case where Windows adopted a standard before anybody else and ended up suffering from the first-mover curse. In this case, Windows is following the Korean standard KS X 1026 and treating the characters as separate. (Indeed, the case of U+1100 U+1161 is the example used in the specification.) So the question isn’t why Windows is doing things wrong. The question is why everybody else is doing things wrong.

Everybody else does things wrong because everybody else ignores the standard. But if you’re the only one doing things right, then you end up looking wrong.

In practice, therefore, there are two competing standards. You have the de jure standard, which says that the characters are separate, and the de facto standard, which says that the characters form a single grapheme.

If you are interoperating with other systems, you would be best served by following the conventions that those other systems follow when communicating with them. In practice, this will usually mean that you need to ignore what the Unicode and Korean standards committees recommend, and instead do “what everybody else is doing.” Since ICU is one of those “everybody else”s, you can switch to using ICU to decompose your strings.

Today is Hangul Day, a Korean national holiday commemorating the invention of the Hangul alphabet.

Bonus reading: Frequently Asked Questions about Korean and Unicode.

Topics
History

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

13 comments

Discussion is closed. Login to edit/delete existing comments.

  • Andrew Dinh

    quite interesting

  • Jonathan BarnerMicrosoft employee

    At first glance, I thought of ICU = Intensive Care Unit, especially at times where counting available hospital beds are on the national news regularly.
    Took me a while to figure out that ICU = International Components for Unicode (http://site.icu-project.org), which I’ve never hear about before.

  • Raph Levien

    Raymond, with all due respect, I think you're incorrect here. The Unicode standard (UAX #29) is quite clear on where grapheme cluster boundaries are in Hangul syllables, and there definitely no boundary between U+1100 (a jamo in the L category) and U+1161 (a jamo in the V category). Section 7.3 (1) of KS X 1026-1 confirms this explicitly. I've researched this a bit and think the most likely explanation is that Microsoft just misinterpreted the standard and shipped an implementation that has grapheme boundary (and text layout) logic in violation of both Unicode and KS X 1026-1.

    Now things get a...

    Read more
    • Raymond ChenMicrosoft employee Author

      Thanks, Ralph. I am not the subject matter expert here; I’m recapturing information I received from the people who implemented it. From what I can tell, KS X 1026-1 was released in 2008, but UAX 29 did not address Hangul breaking until version 18 (2011). It’s possible that KS X 1026-1 was incorrectly implemented.

    • أنت أفضل أم في العالم كله

      Thanks a lot!

    • Raymond ChenMicrosoft employee Author

      Well, that’s the story I got from the globalization team. Maybe they got it wrong?

      • Raph Levien

        I think so. I’m happy to continue this thread by email if you’re motivated to get to the bottom of it. It’s relevant to current work in Druid (Rust GUI toolkit) to use platform capabilities to do text layout (unlike browsers, which today all use HarfBuzz), and this is currently an area where DirectWrite is variant.

    • أنت أفضل أم في العالم كله

      But what is the logic behind this? Why do some jamo required to behave differently?

      • Raph Levien

        It boils down to coding efficiency. You can find a lot more detail in the History section of Hangul syllables on Wikipedia, but the simplified version is this. Modern Hangul has a relatively small number of jamo - 19 leading consonants, 21 vowels, and 27 (optional) trailing consonants. The simplest way to encode Hangul would be a code point for every jamo, which in UTF-8 would be 9 bytes for a typical LVT syllable. There are exactly 11,184 valid combinations of these, and, as of Unicode 2.0, these are encoded into the U+AC00..U+D7AF range, so 3 byte for each syllable.

        I...

        Read more
      • silversoul silversoul

        Sorry but in short, it is not standard method in even Korea.
        Is my previous comment missing....?

        ---edited---
        As a Korean, I agree johab method you mentioned is apparently a simpler way, but it is not standard, and never was an official single standard. it always was a 'de facto' for a brief time period(~90s) or an 'alternative' way after since.

        Korea submitted following standard to unicode.org:

        In addition, we must adhere to the following rules when representing in Johab Hangul syllable blocks.
        1) As same as in the rules of representation format of Hangul letters (see 5.1), two or more
        code positions...

        Read more
  • Flux

    I once read a story about Apple inventing Firewire. The story said Apple was unwilling to be the first to adopt the standard that it had created.

  • Elmar

    Are you sure this isn’t a case of competing standards rather than ICU ignoring or following a mere de facto standard? If I look at UAX #29, which contains Unicode’s rules for grapheme clustering, it appears to state pretty clearly: “Do not break Hangul syllable sequences.