{"id":27223,"date":"2007-04-17T10:00:00","date_gmt":"2007-04-17T10:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2007\/04\/17\/the-notepad-file-encoding-problem-redux\/"},"modified":"2007-04-17T10:00:00","modified_gmt":"2007-04-17T10:00:00","slug":"the-notepad-file-encoding-problem-redux","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20070417-00\/?p=27223","title":{"rendered":"The Notepad file encoding problem, redux"},"content":{"rendered":"<p>\nAbout every ten months,\nsomebody new discovers\n<a HREF=\"http:\/\/blogs.msdn.com\/oldnewthing\/archive\/2004\/03\/24\/95235.aspx\">\nthe Notepad file encoding problem<\/a>.\nLet&#8217;s see what else there is to say about it.\n<\/p>\n<p>\nFirst of all, can we change Notepad&#8217;s detection algorithm?\nThe problem is that there are a lot of different text files out there.\nLet&#8217;s look just at the ones that Notepad supports.\n<\/p>\n<ul>\n<li>8-bit ANSI (of which 7-bit ASCII is a subset).\nThese have no BOM; they just dive right in with bytes of text.\nThey are also probably the most common type of text file.<\/p>\n<li>UTF-8.\nThese usually begin with a BOM but not always.<\/p>\n<li>Unicode big-endian (UTF-16BE).\nThese usually begin with a BOM but not always.<\/p>\n<li>Unicode little-endian (UTF-16LE).\nThese usually begin with a BOM but not always.\n<\/ul>\n<p>\nIf a BOM is found, then life is easy, since the BOM tells you\nwhat encoding the file uses.\nThe problem is when there is no BOM.\nNow you have to guess, and when you guess, you can guess wrong.\nFor example, consider this file:\n<\/p>\n<pre>\nD0 AE\n<\/pre>\n<p>\nDepending on which encoding you assume, you get very different results.\n<\/p>\n<ul>\n<li>If you assume 8-bit ANSI (with code page 1252),\nthen the file consists of the two characters\n<code>U+00D0 U+00AE<\/code>, or\n&#8220;&#xD0;&#xAE;&#8221;.\nSure this looks strange, but maybe it&#8217;s part of the word\nVATNI&#xD0;&#xAE; which might be the name of an Icelandic hotel.<\/p>\n<li>If you assume UTF-8,\nthen the file consists of the single Cyrillic character\n<code>U+042E<\/code>, or &#8220;&#x42E;&#8221;.<\/p>\n<li>If you assume Unicode big-endian, then the file consists of the\nKorean Hangul syllable\n<code>U+D0AE<\/code>, or\n&#8220;&#xD0AE;&#8221;.<\/p>\n<li>If you assume Unicode little-endian, then the file consists of\nthe Korean Hangul syllable\n<code>U+AED0<\/code>, or\n&#8220;&#xAED0;&#8221;.\n<\/ul>\n<p>\nOkay, so this file can be interpreted in four different ways.\nAre you going to use the &#8220;try to guess&#8221; algorithm from\n<code>IsTextUnicode<\/code>?\n(<a HREF=\"http:\/\/blogs.msdn.com\/michkap\/archive\/2005\/01\/30\/363308.aspx\">Michael Kaplan has some thoughts on this subject<\/a>.)\nIf so, then you are right where Notepad is today.\nNotice that all four interpretations are linguistically plausible.\n<\/p>\n<p>\nSome people might say that the rule should be &#8220;All files without\na BOM are 8-bit ANSI.&#8221;\nIn that case, you&#8217;re going to misinterpret all the files\nthat use UTF-8 or UTF-16 and don&#8217;t have a BOM.\nNote that the Unicode standard even advises <strong>against<\/strong>\nusing a BOM for UTF-8,\nso you&#8217;re already throwing out everybody who follows the\nrecommendation.\n<\/p>\n<p>\nOkay, given that the Unicode folks recommend against using a BOM for\nUTF-8, maybe your rule is &#8220;All files without a BOM are UTF-8.&#8221;\nWell, that messes up all 8-bit ANSI files that use characters\nabove 127.\n<\/p>\n<p>\nMaybe you&#8217;re willing to accept that ambiguity, and use the\nrule, &#8220;If the file looks like valid UTF-8, then use UTF-8;\notherwise use 8-bit ANSI, but under no circumstances should you\ntreat the file as UTF-16LE or UTF-16BE.&#8221;\nIn other words, &#8220;never auto-detect UTF-16&#8221;.\nFirst, you still have ambiguous cases, like the file above,\nwhich could be either 8-bit ANSI or UTF-8.\nAnd second, you are going to be flat-out wrong when\nyou run into a Unicode file that\nlacks a BOM, since you&#8217;re going to misinterpret it as either\nUTF-8 or (more likely) 8-bit ANSI.\nYou might decide that programs that generate UTF-16 files without\na BOM are broken, but that doesn&#8217;t mean that they don&#8217;t exist.\nFor example,\n<\/p>\n<pre>\ncmd \/u \/c dir &gt;results.txt\n<\/pre>\n<p>\nThis generates a UTF-16LE file without a BOM.\nIf you poke around your Windows directory, you&#8217;ll probably\nfind other Unicode files without a BOM.\n(For example, I found <code>COM+.log<\/code>.)\nThese files still &#8220;worked&#8221; under the old <code>IsTextUnicode<\/code>\nalgorithm, but now they are unreadable.\nMaybe you consider that an acceptable loss.\n<\/p>\n<p>\nThe point is that no matter how you decide to resolve the ambiguity,\nsomebody will win and somebody else will lose.\nAnd then people can start experimenting with the &#8220;losers&#8221; to find\none that makes your algorithm look stupid for choosing &#8220;incorrectly&#8221;.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>About every ten months, somebody new discovers the Notepad file encoding problem. Let&#8217;s see what else there is to say about it. First of all, can we change Notepad&#8217;s detection algorithm? The problem is that there are a lot of different text files out there. Let&#8217;s look just at the ones that Notepad supports. 8-bit [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-27223","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>About every ten months, somebody new discovers the Notepad file encoding problem. Let&#8217;s see what else there is to say about it. First of all, can we change Notepad&#8217;s detection algorithm? The problem is that there are a lot of different text files out there. Let&#8217;s look just at the ones that Notepad supports. 8-bit [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/27223","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=27223"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/27223\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=27223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=27223"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=27223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}