{"id":190,"date":"2023-09-12T14:56:16","date_gmt":"2023-09-12T21:56:16","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/i18n\/?p=190"},"modified":"2023-09-12T16:43:26","modified_gmt":"2023-09-12T23:43:26","slug":"use-unicode","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/i18n\/use-unicode\/","title":{"rendered":"Use Unicode!"},"content":{"rendered":"<p>Recently there were some updates that changed the behavior of a national encoding standard, changing the behavior when mapping to and from Unicode.\u00a0 Those kinds of changes lead to data corruption.\u00a0 I figured I&#8217;d take this opportunity to remind folks of the benefits of &#8220;using Unicode!&#8221;<\/p>\n<h2>Unicode Benefits<\/h2>\n<ul>\n<li>All modern operating systems are natively Unicode.\u00a0 Any other encoding requires conversion, and the resulting performance impact.<\/li>\n<li>Unicode supports characters for all scripts and languages.<\/li>\n<li>Avoids encoding disparities between platforms and libraries.<\/li>\n<li>Reduces the number of replacement characters in data, like ?<\/li>\n<li>Reduces the amount of &#8220;mojibake&#8221; or &#8220;gobbledygook&#8221; encountered from mismatched decoding.<\/li>\n<li>Consistent use of Unicode avoids errors from mis-tagged encodings, or missing encoding declarations.<\/li>\n<li>Legacy encodings are rarely used nowadays and will only become rarer in the future.<\/li>\n<li>Using Unicode increases the reliability of interoperability between systems and services.<\/li>\n<li>UTF-8 is a superset of ASCII<\/li>\n<\/ul>\n<h2>Performance<\/h2>\n<p>Modern operating systems are Unicode, usually UTF-16 or UTF-8 internally.\u00a0 Converting between other codepages, particularly table-based ones, consumes resources and takes time.\u00a0 Even if there&#8217;s a difference between UTF-8 data and a UTF-16 platform, that conversion is algorithmic and heavily optimized to be fast.\u00a0 You can&#8217;t get that from other &#8220;national&#8221; codepages.\u00a0 Because they require tables, which have to be mapped into memory, or at least the CPU cache, it slows down the conversions.<\/p>\n<p>On Windows, it is much faster to encode or decode data using UTF-8 than with the legacy Windows-1252 encoding.<\/p>\n<h2>Data Integrity<\/h2>\n<p>Consistent use of Unicode is the most reliable way to protect the integrity of your user&#8217;s data.\u00a0 Every other codepage is subject to potential data corruption in various ways.<\/p>\n<p><strong>Missing Codepoints<\/strong> &#8211; The most obvious source of corruption is when data is encountered that doesn&#8217;t fit in the encoding&#8217;s character set.\u00a0 Then they get replaced with ? or other replacement characters.\u00a0 Even seemingly simple scenarios like customer names can run afoul of this.\u00a0 A customer prefers Th\u00e9o to Theo, and the application was only expecting ASCII.\u00a0 Then we end up with Th?o instead, pleasing no one.<\/p>\n<p><strong>Encoding Stability<\/strong> &#8211; Over the years, nearly all of the encodings have undergone some churn.\u00a0 For some it is relatively minor.\u00a0 Others have added major blocks of codepoints, or corrected errors in the assignment of the codepoints.\u00a0 In such cases, data encoded in one version of an encoding may not decode the same way in a slightly different version.\u00a0 Such as after an OS update or some such.\u00a0 Especially in the modern world with computers distributed around the world, it&#8217;s usually impossible to guarantee consistent versions are being used.\u00a0 Even when confined to the same platform.<\/p>\n<p><strong>Platform Differences<\/strong> &#8211; As encodings evolved, platforms have adopted the updates at differencing paces.\u00a0 Some may choose the most recent and perceived &#8220;correct&#8221; encoding to keep up to date, at the expense of data compatibility.\u00a0 Others may update slowly, or never, choosing data stability for their customers.\u00a0 Additionally, there are the occasional implementation errors and other factors.\u00a0 The net result is that most platforms and browsers have minor differences in their character encodings.\u00a0 In particular for the larger character sets.\u00a0 This leads to the occasional random unexpected glyph or ? in decoded data.<\/p>\n<p>Most applications have to interact with other software libraries, platforms, or operating systems, so these sorts of corruptions are unfortunately common.<\/p>\n<p><strong>Encoding (Mis)tagging<\/strong> &#8211; Similarly, to properly exchange encoded data, the sender and recipient must agree on the encoding.\u00a0 Common errors are using Windows-1252, ISO-8859-1, &amp;\/or ASCII nearly interchangeably for Latin.\u00a0 Perhaps writing a document on a system that uses one encoding and then hosting it on server that uses a different default.\u00a0 Since they&#8217;re all Latin based and the ASCII range is similar, this can be difficult to notice.\u00a0 I&#8217;m sure we&#8217;ve all seen this type of corruption, commonly showing up with error such as &#8220;quoted&#8221; text, where the quotes are replaced with gibberish characters.\u00a0 Or a trademark or other symbol being mis-mapped.<\/p>\n<p>A related problem is protocols that don&#8217;t really provide for the labeling of the encoding at all.\u00a0 Or applications that don&#8217;t provide an encoding even when permitted.<\/p>\n<h2>Usage on the Internet<\/h2>\n<p>Unicode has become the most common encoding to store data, particularly on the Internet.\u00a0 Web metrics show that over 98% of web pages are encoded in UTF-8.\u00a0 Much of the remainder is in one of the common Latin based defaults (ASCII, ISO-8859-1, or Windows-1252).\u00a0 Even in the Asian regions that had a historical need for national character sets to support their large character repertoire, over 95% of the web data in those regions is in UTF-8.<\/p>\n<h2>UTF-8 Versus ASCII<\/h2>\n<p>ASCII is a special case, because it&#8217;s pretty common.\u00a0 And sometimes it&#8217;s mixed up with Windows-1252 or ISO-8859-1.\u00a0 There&#8217;s no advantage to using ASCII over UTF-8.\u00a0 UTF-8 is a superset of ASCII, so you get all of the benefits of ASCII, plus the ability to encode any other character that the computer can use.\u00a0 If you are looking for a single byte character set to support Latin scripts, then pick UTF-8.\u00a0 It reduces ambiguity, is about the same size, and will still allow encoding every character you might unexpectedly encounter.<\/p>\n<h2>Protocols and Data Interchange<\/h2>\n<p>Where you really want to Use Unicode! is in protocols.\u00a0 Anywhere where you&#8217;re exchanging data with another system or component, you want a clean Unicode interface.\u00a0 Oftentimes this might be UTF-8 for data on the wire, or UTF-16 for internal API calls on a UTF-16 system.\u00a0 Specifying Unicode for the interfaces removes any ambiguity about which encoding is necessary.\u00a0 And it reduces the risk of mis-tagged data sneaking through one or more layers of the system.\u00a0 It also allows any character that users might need to use, even if you initially intend the software to only be used in a Latin script market.<\/p>\n<h2>Legacy &#8220;A&#8221; APIs<\/h2>\n<p>The legacy Windows &#8220;ANSI&#8221; (a misnomer) APIs have severe limitations compared to their Unicode counterparts.\u00a0 Historically, these used the system codepage prior to Windows support of Unicode.\u00a0 And they continue to be used by apps that don&#8217;t use Unicode.<\/p>\n<p>These legacy APIs have different behavior, depending on the system codepage settings.\u00a0 Those differences can even lead to data loss when shared between systems in different regions.\u00a0 They usually internally convert to Unicode, then back to the system encoding, making them slower than the native &#8220;W&#8221; Unicode APIs.\u00a0 And they often don&#8217;t provide the functionality of their preferred Unicode counterparts.<\/p>\n<h2>Legacy Data<\/h2>\n<p>Now that I&#8217;ve convinced everyone to Use Unicode!, I do recognize that there is a lot of legacy data that isn&#8217;t stored in Unicode.\u00a0 And that many processes may depend on older systems that were designed for a different encoding.\u00a0 Eventually nearly all of that data will move to Unicode &#8211; because the entire industry is continuing to move in that direction.\u00a0 In the meantime, try to confine those legacy data structures behind Unicode enabled interfaces.\u00a0 That way at least the problem is contained, and when it does eventually get updated the number of additional pieces that need to be updated are smaller.<\/p>\n<h2>Use Unicode!<\/h2>\n","protected":false},"excerpt":{"rendered":"<p>Recently there were some updates that changed the behavior of a national encoding standard, changing the behavior when mapping to and from Unicode.\u00a0 Those kinds of changes lead to data corruption.\u00a0 I figured I&#8217;d take this opportunity to remind folks of the benefits of &#8220;using Unicode!&#8221; Unicode Benefits All modern operating systems are natively Unicode.\u00a0 [&hellip;]<\/p>\n","protected":false},"author":17042,"featured_media":6,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-190","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-i18n"],"acf":[],"blog_post_summary":"<p>Recently there were some updates that changed the behavior of a national encoding standard, changing the behavior when mapping to and from Unicode.\u00a0 Those kinds of changes lead to data corruption.\u00a0 I figured I&#8217;d take this opportunity to remind folks of the benefits of &#8220;using Unicode!&#8221; Unicode Benefits All modern operating systems are natively Unicode.\u00a0 [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/posts\/190","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/users\/17042"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/comments?post=190"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/posts\/190\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/media\/6"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/media?parent=190"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/categories?post=190"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/i18n\/wp-json\/wp\/v2\/tags?post=190"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}