Use Unicode!

Shawn Steele

Recently there were some updates that changed the behavior of a national encoding standard, changing the behavior when mapping to and from Unicode.  Those kinds of changes lead to data corruption.  I figured I’d take this opportunity to remind folks of the benefits of “using Unicode!”

Unicode Benefits

  • All modern operating systems are natively Unicode.  Any other encoding requires conversion, and the resulting performance impact.
  • Unicode supports characters for all scripts and languages.
  • Avoids encoding disparities between platforms and libraries.
  • Reduces the number of replacement characters in data, like ?
  • Reduces the amount of “mojibake” or “gobbledygook” encountered from mismatched decoding.
  • Consistent use of Unicode avoids errors from mis-tagged encodings, or missing encoding declarations.
  • Legacy encodings are rarely used nowadays and will only become rarer in the future.
  • Using Unicode increases the reliability of interoperability between systems and services.
  • UTF-8 is a superset of ASCII

Performance

Modern operating systems are Unicode, usually UTF-16 or UTF-8 internally.  Converting between other codepages, particularly table-based ones, consumes resources and takes time.  Even if there’s a difference between UTF-8 data and a UTF-16 platform, that conversion is algorithmic and heavily optimized to be fast.  You can’t get that from other “national” codepages.  Because they require tables, which have to be mapped into memory, or at least the CPU cache, it slows down the conversions.

On Windows, it is much faster to encode or decode data using UTF-8 than with the legacy Windows-1252 encoding.

Data Integrity

Consistent use of Unicode is the most reliable way to protect the integrity of your user’s data.  Every other codepage is subject to potential data corruption in various ways.

Missing Codepoints – The most obvious source of corruption is when data is encountered that doesn’t fit in the encoding’s character set.  Then they get replaced with ? or other replacement characters.  Even seemingly simple scenarios like customer names can run afoul of this.  A customer prefers Théo to Theo, and the application was only expecting ASCII.  Then we end up with Th?o instead, pleasing no one.

Encoding Stability – Over the years, nearly all of the encodings have undergone some churn.  For some it is relatively minor.  Others have added major blocks of codepoints, or corrected errors in the assignment of the codepoints.  In such cases, data encoded in one version of an encoding may not decode the same way in a slightly different version.  Such as after an OS update or some such.  Especially in the modern world with computers distributed around the world, it’s usually impossible to guarantee consistent versions are being used.  Even when confined to the same platform.

Platform Differences – As encodings evolved, platforms have adopted the updates at differencing paces.  Some may choose the most recent and perceived “correct” encoding to keep up to date, at the expense of data compatibility.  Others may update slowly, or never, choosing data stability for their customers.  Additionally, there are the occasional implementation errors and other factors.  The net result is that most platforms and browsers have minor differences in their character encodings.  In particular for the larger character sets.  This leads to the occasional random unexpected glyph or ? in decoded data.

Most applications have to interact with other software libraries, platforms, or operating systems, so these sorts of corruptions are unfortunately common.

Encoding (Mis)tagging – Similarly, to properly exchange encoded data, the sender and recipient must agree on the encoding.  Common errors are using Windows-1252, ISO-8859-1, &/or ASCII nearly interchangeably for Latin.  Perhaps writing a document on a system that uses one encoding and then hosting it on server that uses a different default.  Since they’re all Latin based and the ASCII range is similar, this can be difficult to notice.  I’m sure we’ve all seen this type of corruption, commonly showing up with error such as “quoted” text, where the quotes are replaced with gibberish characters.  Or a trademark or other symbol being mis-mapped.

A related problem is protocols that don’t really provide for the labeling of the encoding at all.  Or applications that don’t provide an encoding even when permitted.

Usage on the Internet

Unicode has become the most common encoding to store data, particularly on the Internet.  Web metrics show that over 98% of web pages are encoded in UTF-8.  Much of the remainder is in one of the common Latin based defaults (ASCII, ISO-8859-1, or Windows-1252).  Even in the Asian regions that had a historical need for national character sets to support their large character repertoire, over 95% of the web data in those regions is in UTF-8.

UTF-8 Versus ASCII

ASCII is a special case, because it’s pretty common.  And sometimes it’s mixed up with Windows-1252 or ISO-8859-1.  There’s no advantage to using ASCII over UTF-8.  UTF-8 is a superset of ASCII, so you get all of the benefits of ASCII, plus the ability to encode any other character that the computer can use.  If you are looking for a single byte character set to support Latin scripts, then pick UTF-8.  It reduces ambiguity, is about the same size, and will still allow encoding every character you might unexpectedly encounter.

Protocols and Data Interchange

Where you really want to Use Unicode! is in protocols.  Anywhere where you’re exchanging data with another system or component, you want a clean Unicode interface.  Oftentimes this might be UTF-8 for data on the wire, or UTF-16 for internal API calls on a UTF-16 system.  Specifying Unicode for the interfaces removes any ambiguity about which encoding is necessary.  And it reduces the risk of mis-tagged data sneaking through one or more layers of the system.  It also allows any character that users might need to use, even if you initially intend the software to only be used in a Latin script market.

Legacy “A” APIs

The legacy Windows “ANSI” (a misnomer) APIs have severe limitations compared to their Unicode counterparts.  Historically, these used the system codepage prior to Windows support of Unicode.  And they continue to be used by apps that don’t use Unicode.

These legacy APIs have different behavior, depending on the system codepage settings.  Those differences can even lead to data loss when shared between systems in different regions.  They usually internally convert to Unicode, then back to the system encoding, making them slower than the native “W” Unicode APIs.  And they often don’t provide the functionality of their preferred Unicode counterparts.

Legacy Data

Now that I’ve convinced everyone to Use Unicode!, I do recognize that there is a lot of legacy data that isn’t stored in Unicode.  And that many processes may depend on older systems that were designed for a different encoding.  Eventually nearly all of that data will move to Unicode – because the entire industry is continuing to move in that direction.  In the meantime, try to confine those legacy data structures behind Unicode enabled interfaces.  That way at least the problem is contained, and when it does eventually get updated the number of additional pieces that need to be updated are smaller.

Use Unicode!

5 comments

Comments are closed. Login to edit/delete your existing comments

  • deadcream 0

    Is there any difference between using “W” APIs and using “A” APIs with UTF-8 via

    <activeCodePage>UTF-8</activeCodePage>

    in manifest file (assuming that you target Windows 10)?

    • Shawn SteeleMicrosoft employee 0

      Internally, Windows is primarily UTF-16. Therefore the “A” APIs, even in UTF-8 mode, first do a conversion to UTF-16. Since it’s algorithmic, that conversion is faster with UTF-8, even compared to Windows-1252. The conversion still has to happen though. Of course, there’s less risk of loss of confusion from missing or replacement characters when UTF-8 is used.

      Additionally, sometimes the behavior of the A and W APIs varies for historical reasons. And newer concepts have sometimes been enabled in the W forms of the APIs and not the legacy A APIs.

      In short: If you are converting to UTF-16 and using WCHAR internally anyway, then the W APIs are preferred. Or if it’s fairly easy to convert from UTF-8 to UTF-16. When it’s more convenient to use UTF-8 and the A APIs, then that is acceptable. If you happen to be using some of the legacy A APIS (with or without UTF-8), and are running into some of the historical cases where behavior is unexpected, then you may prefer (or need to) convert and use the W APIs. Typically, the documentation will call out differences in behavior between the W and A APIs if there is any.

  • Daniel Smith 0

    Wouldn’t life be much simpler if we just used UTF-8 everywhere?

    It really bugs me that emerging technologies like WebAssembly have the opportunity of a fresh start, and could mandate strict UTF-8 for all APIs, but instead they’ve come up with the most convoluted design in order to accommodate languages that use UTF-16 strings. If WebAssembly just used strict UTF-8 throughout the ecosystem, things would be so much simpler.

    • Shawn SteeleMicrosoft employee 0

      It depends?

      Much of the operating system stuff is often UTF-16. And lots of the language processing data is natively UTF-16. (Like ICU and Unicode data). So, you may often have to convert at some layer to get basic linguistic functionality, like sorting a directory.

      Certainly UTF-8 is a lot more portable in many plain-text environments.

      Personally, I’d be thrilled if the other 5-10% or whatnot of the legacy codepage usage went away. That causes most of the random and unexpected pain with encodings. UTF-8 UTF-16 may be annoying. It rarely introduces data corruption do to confusion on which codepage is being used or different interpretations. Which is common with other encodings.

  • W L 0

    if c++ has a good std::string(like qt does, i.e. QString), then we can save time and not think about this problem.

Feedback usabilla icon