{"id":36253,"date":"2005-03-08T07:00:00","date_gmt":"2005-03-08T07:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2005\/03\/08\/keep-your-eye-on-the-code-page\/"},"modified":"2022-06-30T10:22:04","modified_gmt":"2022-06-30T17:22:04","slug":"keep-your-eye-on-the-code-page","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20050308-00\/?p=36253","title":{"rendered":"Keep your eye on the code page"},"content":{"rendered":"<p>Remember that there are typically two 8-bit code pages active, the so-called &#8220;ANSI&#8221; code page and the so-called &#8220;OEM&#8221; code page. GUI programs usually use the ANSI code page for 8-bit files (though utf-8 is becoming more popular lately), whereas console programs usually use the OEM code page.<\/p>\n<p>This means, for example, when you open an 8-bit text file in Notepad, it assumes the ANSI code page. But if you use the TYPE command from the command prompt, it will use the OEM code page.<\/p>\n<p>This has interesting consequences if you switch between the GUI and the command line frequently.<\/p>\n<p>The two code pages typically agree on the first 128 characters, but they nearly always disagree on the characters from 128 to 255 (so-called &#8220;extended characters&#8221;). For example, on a US-English machine, character 0x80 in the OEM code page is \u00c7, whereas in the ANSI code page it is \u20ac.<\/p>\n<p>Consider a directory which contains a file named <tt>\u00c7<\/tt>. If you type &#8220;dir&#8221; at a command prompt, you see a happy <tt>\u00c7<\/tt> on the screen. On the other hand, if you do &#8220;dir &gt;files.txt&#8221; and open files.txt in a GUI editor like Notepad, you will find that the \u00c7 has changed to a \u20ac, because the 0x80 in the file is being interpreted in the ANSI character set instead of the OEM character set.<\/p>\n<p>Stranger yet, if you mark\/select the file name from the console window and paste it into Notepad, you get a \u00c7. That&#8217;s because the console window&#8217;s mark\/select code saves text on the clipboard as Unicode; the character saved into the clipboard is not 0x80 but rather U+00C7, the Unicode code point for &#8220;Latin Capital Letter C With Cedilla&#8221;. When this is pasted into Notepad, it gets converted from Unicode to the ANSI code page, which on a US-English system encodes the \u00c7 character as 0xC7.<\/p>\n<p>But wait, there&#8217;s more. The command processor has an option (\/U) to generate all piped and redirected output in Unicode rather than the OEM code page.<\/p>\n<p>(Note that the built-in documentation for the command processor says that the \/A switch produces ANSI output; this is incorrect. \/A produces OEM output. This is one of those bugs that you recognize instantly if you are familiar with what is going on. It&#8217;s so obviously OEM that when I see the documentation say &#8220;ANSI&#8221;, my mind just reads it as &#8220;OEM&#8221;. In the same way native English speakers often fail to notice misspellings or doubled words.)<\/p>\n<p>If you run the command<\/p>\n<pre>cmd \/U \/C dir ^&gt;files.txt\r\n<\/pre>\n<p>then the output will be in Unicode and therefore will record the \u00c7 character as U+00C7, which Notepad will then be able to read back.<\/p>\n<p>This has serious consequences for batch files.<\/p>\n<p>Batch files are 8-bit files and are interpreted according to the OEM character set. This means that if you write a batch file with Notepad or some other program that uses the ANSI character set for 8-bit files, and your batch file contains extended characters, the results you get will not match the what you see in your editor.<\/p>\n<p>Why the discrepancy between GUI programs and console programs over how 8-bit characters should be interpreted?<\/p>\n<p>The reason is, of course, historical.<\/p>\n<p>Back in the days of MS-DOS, the code page was what today is called the OEM code page. For US-English systems, this is the code page with the box-drawing characters and the fragments of the integral signs. It contained accented letters, but not a very big set of them, just enough to cover the German, French, Spanish, and Italian languages. And Swedish. (Why Swedish yet not Danish and Norwegian I don&#8217;t know.)<\/p>\n<p>When Windows came along, it decided that those box-drawing characters were wasting valuable space that could be used for adding still more accented characters, so out went the box-drawing characters and in went characters for Danish, Norwegian, Icelandic, and Canadian French. (Yes, Canadian French uses characters that European French does not.)<\/p>\n<p>Thus began the schism between console programs (MS-DOS) and GUI programs (Windows) over how 8-bit character data should be interpreted.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Remember that there are typically two 8-bit code pages active, the so-called &#8220;ANSI&#8221; code page and the so-called &#8220;OEM&#8221; code page. GUI programs usually use the ANSI code page for 8-bit files (though utf-8 is becoming more popular lately), whereas console programs usually use the OEM code page. This means, for example, when you open [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-36253","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>Remember that there are typically two 8-bit code pages active, the so-called &#8220;ANSI&#8221; code page and the so-called &#8220;OEM&#8221; code page. GUI programs usually use the ANSI code page for 8-bit files (though utf-8 is becoming more popular lately), whereas console programs usually use the OEM code page. This means, for example, when you open [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/36253","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=36253"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/36253\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=36253"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=36253"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=36253"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}