Recently there were some updates that changed the behavior of a national encoding standard, changing the behavior when mapping to and from Unicode. Those kinds of changes lead to data corruption. I figured I’d take this opportunity to remind folks of the benefits of “using Unicode!”
- All modern operating systems are natively Unicode. Any other encoding requires conversion, and the resulting performance impact.
- Unicode supports characters for all scripts and languages.
- Avoids encoding disparities between platforms and libraries.
- Reduces the number of replacement characters in data, like ?
- Reduces the amount of “mojibake” or “gobbledygook” encountered from mismatched decoding.
- Consistent use of Unicode avoids errors from mis-tagged encodings, or missing encoding declarations.
- Legacy encodings are rarely used nowadays and will only become rarer in the future.
- Using Unicode increases the reliability of interoperability between systems and services.
- UTF-8 is a superset of ASCII
Modern operating systems are Unicode, usually UTF-16 or UTF-8 internally. Converting between other codepages, particularly table-based ones, consumes resources and takes time. Even if there’s a difference between UTF-8 data and a UTF-16 platform, that conversion is algorithmic and heavily optimized to be fast. You can’t get that from other “national” codepages. Because they require tables, which have to be mapped into memory, or at least the CPU cache, it slows down the conversions.
On Windows, it is much faster to encode or decode data using UTF-8 than with the legacy Windows-1252 encoding.
Consistent use of Unicode is the most reliable way to protect the integrity of your user’s data. Every other codepage is subject to potential data corruption in various ways.
Missing Codepoints – The most obvious source of corruption is when data is encountered that doesn’t fit in the encoding’s character set. Then they get replaced with ? or other replacement characters. Even seemingly simple scenarios like customer names can run afoul of this. A customer prefers Théo to Theo, and the application was only expecting ASCII. Then we end up with Th?o instead, pleasing no one.
Encoding Stability – Over the years, nearly all of the encodings have undergone some churn. For some it is relatively minor. Others have added major blocks of codepoints, or corrected errors in the assignment of the codepoints. In such cases, data encoded in one version of an encoding may not decode the same way in a slightly different version. Such as after an OS update or some such. Especially in the modern world with computers distributed around the world, it’s usually impossible to guarantee consistent versions are being used. Even when confined to the same platform.
Platform Differences – As encodings evolved, platforms have adopted the updates at differencing paces. Some may choose the most recent and perceived “correct” encoding to keep up to date, at the expense of data compatibility. Others may update slowly, or never, choosing data stability for their customers. Additionally, there are the occasional implementation errors and other factors. The net result is that most platforms and browsers have minor differences in their character encodings. In particular for the larger character sets. This leads to the occasional random unexpected glyph or ? in decoded data.
Most applications have to interact with other software libraries, platforms, or operating systems, so these sorts of corruptions are unfortunately common.
Encoding (Mis)tagging – Similarly, to properly exchange encoded data, the sender and recipient must agree on the encoding. Common errors are using Windows-1252, ISO-8859-1, &/or ASCII nearly interchangeably for Latin. Perhaps writing a document on a system that uses one encoding and then hosting it on server that uses a different default. Since they’re all Latin based and the ASCII range is similar, this can be difficult to notice. I’m sure we’ve all seen this type of corruption, commonly showing up with error such as “quoted” text, where the quotes are replaced with gibberish characters. Or a trademark or other symbol being mis-mapped.
A related problem is protocols that don’t really provide for the labeling of the encoding at all. Or applications that don’t provide an encoding even when permitted.
Usage on the Internet
Unicode has become the most common encoding to store data, particularly on the Internet. Web metrics show that over 98% of web pages are encoded in UTF-8. Much of the remainder is in one of the common Latin based defaults (ASCII, ISO-8859-1, or Windows-1252). Even in the Asian regions that had a historical need for national character sets to support their large character repertoire, over 95% of the web data in those regions is in UTF-8.
UTF-8 Versus ASCII
ASCII is a special case, because it’s pretty common. And sometimes it’s mixed up with Windows-1252 or ISO-8859-1. There’s no advantage to using ASCII over UTF-8. UTF-8 is a superset of ASCII, so you get all of the benefits of ASCII, plus the ability to encode any other character that the computer can use. If you are looking for a single byte character set to support Latin scripts, then pick UTF-8. It reduces ambiguity, is about the same size, and will still allow encoding every character you might unexpectedly encounter.
Protocols and Data Interchange
Where you really want to Use Unicode! is in protocols. Anywhere where you’re exchanging data with another system or component, you want a clean Unicode interface. Oftentimes this might be UTF-8 for data on the wire, or UTF-16 for internal API calls on a UTF-16 system. Specifying Unicode for the interfaces removes any ambiguity about which encoding is necessary. And it reduces the risk of mis-tagged data sneaking through one or more layers of the system. It also allows any character that users might need to use, even if you initially intend the software to only be used in a Latin script market.
Legacy “A” APIs
The legacy Windows “ANSI” (a misnomer) APIs have severe limitations compared to their Unicode counterparts. Historically, these used the system codepage prior to Windows support of Unicode. And they continue to be used by apps that don’t use Unicode.
These legacy APIs have different behavior, depending on the system codepage settings. Those differences can even lead to data loss when shared between systems in different regions. They usually internally convert to Unicode, then back to the system encoding, making them slower than the native “W” Unicode APIs. And they often don’t provide the functionality of their preferred Unicode counterparts.
Now that I’ve convinced everyone to Use Unicode!, I do recognize that there is a lot of legacy data that isn’t stored in Unicode. And that many processes may depend on older systems that were designed for a different encoding. Eventually nearly all of that data will move to Unicode – because the entire industry is continuing to move in that direction. In the meantime, try to confine those legacy data structures behind Unicode enabled interfaces. That way at least the problem is contained, and when it does eventually get updated the number of additional pieces that need to be updated are smaller.