Is there a code page that matches ASCII and can round trip arbitrary bytes through Unicode?

Raymond Chen

Raymond

Is there a code page that matches ASCII for the first 128 values and can round trip arbitrary bytes through Unicode?

You may find yourself looking for such a code page when you have a chunk of binary data with embedded ASCII text. You want to be able to dig out and even manipulate the ASCII text, and treat the non-ASCII parts as mysterious characters that have no meaning, but you need to be able to convert them back into the original bytes.

For example, the format of the binary data might be an ASCII string followed by a 32-bit integer in big-endian format. You want to parse out the ASCII string, then take the next four characters, reverse them, and then convert them back to bytes so you can extract the integer.

The C# language has a lot of handy facilities for manipulating text, but they require C# String objects, which are expressed as a sequence of UTF16-LE code units. So what you want is a way to convert bytes into UTF16-LE code units, with the property that bytes less than 128 map to corresponding ASCII characters, and bytes 128 and above map to something in a reversible way.

The UTF-8 code page won’t work because there are invalid byte sequences in UTF-8 which will not convert to text and back. Another choice you might go for is code page 1252, but it fails because there are many undefined code units, which means that a byte with one of those values may get converted to U+FFFD REPLACEMENT CHARACTER, which is a special Unicode character that means “There was a character here, but I can’t express it in Unicode.” It is commonly used to represent encoding errors.

Even though you are unlikely to consider it, I would just mention that double-byte code pages are also not going to work because they take pairs of bytes and convert them to Unicode. This means that reversing the Unicode code units and then converting back to binary is not going to work.

Okay, I’ll cut to the chase. The code page I use for this sort of thing is code page 437. Every bytes is defined and maps to a unique Unicode code point, and it agrees with ASCII for the first 128 values.

using System;

class Program
{
    static public void Main()
    {
        var bytes = new byte[256];
        int i;
        for (i = 0; i < 256; i++)
        {
            bytes[i] = (byte)i;
        }
        var oem = System.Text.Encoding.GetEncoding(437);
        var text = oem.GetChars(bytes);
        Array.Reverse(text);
        bytes = oem.GetBytes(text);
        for (i = 0; i < 256; i++)
        {
            if (bytes[i] != (byte)(255 - i)) break;
        }
        System.Console.WriteLine(i); // should print 256
    }
}

I take the bytes 0 through 255 and convert them to a string via code page 437. I reverse the string, then convert back to bytes and verify that the resulting bytes are also reversed.

6 comments

Comments are closed. Login to edit/delete your existing comments

  • Avatar
    Jan Kraetzschmar

    Just make sure you’re not making round-trips across implementations which disagree on whether 0xE1 maps to sharp S (U+00DF) or small beta (U+03B2).

  • Avatar
    Joshua Hudson

    We’ve got some code where another developer used Windows-1252 for this. It turns out if the data starts as bytes the actual mapping table in the implementation doesn’t change the few unallocated slots to FFFD but rather C1 control codes, which translate back to the same slots.

    I think there’s some compatibility hacks in the conversion table because it didn’t bomb when more characters were added and we turned the 1252 strings back to bye arrays. Yes, the wise guy was turning byte arrays into strings to stuff into a DB VARCHAR column.

  • Avatar
    Bill Godfrey

    The 256 characters starting from U+0100 to U+01FF are all “normal” characters and translate to and from with simple bit twiddling. These don’t overlap with ASCII but if you need to carry bytes in a string package then that’s probably not as an important criteria.

    It wouldn’t surprise me if someone’s found a more efficient selection of 2ⁿ code-points in the first 64K of codepoints that exclude all control characters, joining characters, etc for use as a base-64-esque encoding method.

  • Avatar
    Александр Алексеев

    Perhaps the “best fit” strategies should be mentioned. For example, MultiByteToWideChar silently converts invalid 1252’s 0x81 code point to Unicode control character 0x0081 (and visa versa).