Keep your eye on the code page

Raymond Chen

Remember that there are typically two 8-bit code pages active, the so-called “ANSI” code page and the so-called “OEM” code page. GUI programs usually use the ANSI code page for 8-bit files (though utf-8 is becoming more popular lately), whereas console programs usually use the OEM code page.

This means, for example, when you open an 8-bit text file in Notepad, it assumes the ANSI code page. But if you use the TYPE command from the command prompt, it will use the OEM code page.

This has interesting consequences if you switch between the GUI and the command line frequently.

The two code pages typically agree on the first 128 characters, but they nearly always disagree on the characters from 128 to 255 (so-called “extended characters”). For example, on a US-English machine, character 0x80 in the OEM code page is Ç, whereas in the ANSI code page it is €.

Consider a directory which contains a file named Ç. If you type “dir” at a command prompt, you see a happy Ç on the screen. On the other hand, if you do “dir >files.txt” and open files.txt in a GUI editor like Notepad, you will find that the Ç has changed to a €, because the 0x80 in the file is being interpreted in the ANSI character set instead of the OEM character set.

Stranger yet, if you mark/select the file name from the console window and paste it into Notepad, you get a Ç. That’s because the console window’s mark/select code saves text on the clipboard as Unicode; the character saved into the clipboard is not 0x80 but rather U+00C7, the Unicode code point for “Latin Capital Letter C With Cedilla”. When this is pasted into Notepad, it gets converted from Unicode to the ANSI code page, which on a US-English system encodes the Ç character as 0xC7.

But wait, there’s more. The command processor has an option (/U) to generate all piped and redirected output in Unicode rather than the OEM code page.

(Note that the built-in documentation for the command processor says that the /A switch produces ANSI output; this is incorrect. /A produces OEM output. This is one of those bugs that you recognize instantly if you are familiar with what is going on. It’s so obviously OEM that when I see the documentation say “ANSI”, my mind just reads it as “OEM”. In the same way native English speakers often fail to notice misspellings or doubled words.)

If you run the command

cmd /U /C dir ^>files.txt

then the output will be in Unicode and therefore will record the Ç character as U+00C7, which Notepad will then be able to read back.

This has serious consequences for batch files.

Batch files are 8-bit files and are interpreted according to the OEM character set. This means that if you write a batch file with Notepad or some other program that uses the ANSI character set for 8-bit files, and your batch file contains extended characters, the results you get will not match the what you see in your editor.

Why the discrepancy between GUI programs and console programs over how 8-bit characters should be interpreted?

The reason is, of course, historical.

Back in the days of MS-DOS, the code page was what today is called the OEM code page. For US-English systems, this is the code page with the box-drawing characters and the fragments of the integral signs. It contained accented letters, but not a very big set of them, just enough to cover the German, French, Spanish, and Italian languages. And Swedish. (Why Swedish yet not Danish and Norwegian I don’t know.)

When Windows came along, it decided that those box-drawing characters were wasting valuable space that could be used for adding still more accented characters, so out went the box-drawing characters and in went characters for Danish, Norwegian, Icelandic, and Canadian French. (Yes, Canadian French uses characters that European French does not.)

Thus began the schism between console programs (MS-DOS) and GUI programs (Windows) over how 8-bit character data should be interpreted.

0 comments

Discussion is closed.

Feedback usabilla icon