A program to detect mojibake that results from a UTF-8-encoded file being misinterpreted as code page 1252

It is not uncommon to see UTF-8-encoded data misinterpreted as code page 1252, due to file content lacking an explicit encoding declaration, such as you might encounter with the Resource Compiler. So let’s write a program to detect that specific kind of mojibake.

The first observation is that code page 1252 agrees with Unicode’s first 256 code points, with the exception of the range U+0080 through U+009F, which code page 1252 assigns to various types of punctuation. Therefore, the conversion is almost trivial.

wchar_t To1252Table[32];

void Init1252Table()
{
    char as8bit[32];
    for (int i = 0; i < 32; i++) {
        as8bit[i] = (char)(i + 0x80);
    }
    MultiByteToWideChar(1252, 0, as8bit, ARRAYSIZE(as8bit),
                        To1252Table, ARRAYSIZE(To1252Table));
}

BYTE To1252(wchar_t ch)
{
    if (ch < 0x100) return (BYTE)ch;
    for (int i = 0; i < ARRAYSIZE(To1252Table); i++) {
        if (To1252Table[i] == ch) return (BYTE)(i + 0x80);
    }
    return 0;
}

We start by using the MultiByteToWideChar function to obtain the reverse conversion table for those troublesome 32 bytes. If the code point is below U+0100, then we just throw away the high-order bits and declare victory. Otherwise, we look inside our table, and if we find a match, we convert the index into the corresponding 1252 character. If that fails, then we declare failure.

There is a bit of a hole here: Unicode code points U+0080 through U+009F are erroneously converted to 1252 characters 0x80 through 0x9F. But for our purposes, this is close enough.

BOOL CALLBACK StringEnumProc(
    HMODULE module,
    LPCWSTR type,
    LPWSTR name,
    LONG_PTR param)
{
    PCWSTR filename = (PCWSTR)param;
    auto hrsrc = FindResource(module, name, type);
    auto res = LoadResource(module, hrsrc);
    int base = (int)((INT_PTR)name * 16);
    auto p = reinterpret_cast<wchar_t*>(LockResource(res));
    for (int i = 0; i < 16; i++) {
        // See if there is anything that looks mis-encoded.
        bool isSuspicious = false;
        for (int j = 0; j < *p; j++) {
            if (p[j + 1] == 0xFFFD) {
                // Ugh, REPLACEMENT CHARACTER.
                // Original *.rc was corrupted.
                isSuspicious = true;
                break;
            } else {
                BYTE ch = To1252(p[j + 1]);
                if (ch > 0x7f) {
                    // Does this look like UTF-8?
                    if ((ch & 0xE0) == 0xC0 &&
                        (To1252(p[j + 2]) & 0xC0) == 0x80) {
                        isSuspicious = true;
                        break;
                    }
                    if ((ch & 0xf0) == 0xE0 &&
                        (To1252(p[j + 2]) & 0xC0) == 0x80 &&
                        (To1252(p[j + 3]) & 0xC0) == 0x80) {
                        isSuspicious = true;
                        break;
                    }
                    if ((ch & 0xf8) == 0xF0 &&
                        (To1252(p[j + 2]) & 0xC0) == 0x80 &&
                        (To1252(p[j + 3]) & 0xC0) == 0x80 &&
                        (To1252(p[j + 4]) & 0xC0) == 0x80) {
                        isSuspicious = true;
                        break;
                    }
                }
            }
        }
        if (isSuspicious) {
            printf("%ls: %d: ", filename, base + i);
            for (int j = 0; j < *p; j++) {
                wchar_t ch = p[j + 1];
                if (ch < 0x80) {
                    putwc(p[j + 1], stdout);
                } else {
                    printf("\\u%04x", ch);
                }
            }
            printf("\n");
        }
        p += *p + 1;
    }
    return true;
}

void ProcessFile(PCWSTR filename)
{
    auto module = LoadLibraryEx(filename,
                      nullptr, LOAD_LIBRARY_AS_DATAFILE);
    if (!module) {
        fprintf(stderr, "Error %d loading %ls\n",
                GetLastError(), filename);
    }
    EnumResourceNames(module, RT_STRING,
                      StringEnumProc, (LONG_PTR)filename);
    FreeLibrary(module);
}

To process a file, we load it as a data file, and then enumerate the RT_STRING resources out of it.

We learned some time ago that string resources are stored in bundles of 16 strings each, with each string in the bundle prefixed by its length in code points. We walk through each string of the bundle, and look for suspicious strings.

If we see a U+FFFD, then that’s a SUBSTITUTION CHARACTER,¹ which means that the original file contained an encoding error, and there as obviously a problem with the original resource file.

Otherwise, we try to reverse-decode the character into code page 1252, and if it looks like the start of a UTF-8 multi-byte sequence, we peek ahead at the next few characters, and if they also reverse-decode into the completion of a UTF-8 multi-byte sequence, then we declare that the string is suspicious.

If we find a suspicious string, we print its resource ID and the string itself, taking care to escape any character outside the ASCII character set so it can be easier to identify in the original resource file.

Feed a file to this function, and it’ll report any strings that it thinks may have been the result of UTF-8-as-1252 mojibake.

¹ I find it oddly quaint that Unicode code point names are written in ALL CAPS. I don’t know why, but I never asked either.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

6 comments

Dimitriy Ryazantcev July 2, 2019

>I find it oddly quaint that Unicode code point names are written in ALL CAPS. I don’t know why, but I never asked either.

ISO/IEC 10646:2017. Information technology — Universal Coded Character Set (UCS) 26.2 Name formation An entity name shall consist only of the following characters: - LATIN CAPITAL LETTER A through LATIN CAPITAL LETTER Z, - DIGIT ZERO through DIGIT NINE, - SPACE, - HYPHEN-MINUS, and - FULL STOP if the entity being named is a collection The first character in an entity name shall be a Latin capital letter. The last character in an entity name shall be...

David Walker July 5, 2019

Does the > introduce something quoted earlier? Where is the end of the quote?
It seems that > starts a quote, and the quote ends with no indication… the reply just starts at some point. Confusing.

Piotr Siódmak July 1, 2019

What’s the point of the outer “for (int i = 0; i < 16; i++)” loop? You use “i” only in printf at the end. Is it a trap for copy-pasters?

Fabian Giesen July 1, 2019

It’s mentioned in the text!
“We learned some time ago that string resources are stored in bundles of 16 strings each, with each string in the bundle prefixed by its length in code points. We walk through each string of the bundle, and look for suspicious strings.”
- Piotr Siódmak July 2, 2019
  
  Didn’t notice that “p” changes at the end of the loop. Damn C/C++

Yuri Khan July 1, 2019

> I find it oddly quaint that Unicode code point names are written in ALL CAPS. I don’t know why, but I never asked either.
Possibly because the master file that describes Unicode code point names must remain readable if misinterpreted as any mildly ASCII-compatible encoding. Which there exist that do not have lowercase Latin letters (KOI-7 comes to mind).

Stay informed

Get notified when new posts are published.

Discussion is closed. Login to edit/delete existing comments.

Dimitriy Ryazantcev July 2, 2019

>I find it oddly quaint that Unicode code point names are written in ALL CAPS. I don’t know why, but I never asked either.

ISO/IEC 10646:2017. Information technology — Universal Coded Character Set (UCS) 26.2 Name formation An entity name shall consist only of the following characters: - LATIN CAPITAL LETTER A through LATIN CAPITAL LETTER Z, - DIGIT ZERO through DIGIT NINE, - SPACE, - HYPHEN-MINUS, and - FULL STOP if the entity being named is a collection The first character in an entity name shall be a Latin capital letter. The last character in an entity name shall be...
Read more
>I find it oddly quaint that Unicode code point names are written in ALL CAPS. I don’t know why, but I never asked either.

ISO/IEC 10646:2017. Information technology — Universal Coded Character Set (UCS) 26.2 Name formation An entity name shall consist only of the following characters: – LATIN CAPITAL LETTER A through LATIN CAPITAL LETTER Z, – DIGIT ZERO through DIGIT NINE, – SPACE, – HYPHEN-MINUS, and – FULL STOP if the entity being named is a collection The first character in an entity name shall be a Latin capital letter. The last character in an entity name shall be either a Latin capital letter or a Digit.

Read less
- David Walker July 5, 2019
  
  Does the > introduce something quoted earlier? Where is the end of the quote?
  It seems that > starts a quote, and the quote ends with no indication… the reply just starts at some point. Confusing.
Piotr Siódmak July 1, 2019

What’s the point of the outer “for (int i = 0; i < 16; i++)” loop? You use “i” only in printf at the end. Is it a trap for copy-pasters?
- Fabian Giesen July 1, 2019
  
  It’s mentioned in the text!
  “We learned some time ago that string resources are stored in bundles of 16 strings each, with each string in the bundle prefixed by its length in code points. We walk through each string of the bundle, and look for suspicious strings.”
  - Piotr Siódmak July 2, 2019
    
    Didn’t notice that “p” changes at the end of the loop. Damn C/C++
Yuri Khan July 1, 2019

> I find it oddly quaint that Unicode code point names are written in ALL CAPS. I don’t know why, but I never asked either.
Possibly because the master file that describes Unicode code point names must remain readable if misinterpreted as any mildly ASCII-compatible encoding. Which there exist that do not have lowercase Latin letters (KOI-7 comes to mind).

A program to detect mojibake that results from a UTF-8-encoded file being misinterpreted as code page 1252

Category

Topics

Author

6 comments

Read next

Why does my attempt to index a collection with the `x:Bind` markup extension keep telling me I have an invalid binding path due to an unexpected array indexer?

I set the `OFN_NONETWORKBUTTON` option in the `OPENFILENAME` structure, but it has no effect on the network item in the navigation pane

Category

Topics

Share

Author

6 comments

Read next

Why does my attempt to index a collection with the x:Bind markup extension keep telling me I have an invalid binding path due to an unexpected array indexer?

I set the OFN_NONETWORKBUTTON option in the OPENFILENAME structure, but it has no effect on the network item in the navigation pane

Stay informed

Why does my attempt to index a collection with the `x:Bind` markup extension keep telling me I have an invalid binding path due to an unexpected array indexer?

I set the `OFN_NONETWORKBUTTON` option in the `OPENFILENAME` structure, but it has no effect on the network item in the navigation pane