May 4th, 2012

How does the MultiByteToWideChar function treat invalid characters?

The MB_ERR_INVALID_CHARS flag controls how the Multi­Byte­To­Wide­Char function treats invalid characters. Some people claim that the following sentences in the documentation are contradictory:

  • “Starting with Windows Vista, the function does not drop illegal code points if the application does not set the flag.”

  • “Windows XP: If this flag is not set, the function silently drops illegal code points.”

  • “The function fails if MB_ERR_INVALID_CHARS is set and an invalid character is encountered in the source string.”

Actually, the three sentences are talking about different cases. The first two talk about what happens if you omit the flag; the third talks about what happens if you include the flag.

Since people seem to like tables, here’s a description of the MB_ERR_INVALID_CHARS flag in tabular form:

MB_ERR_INVALID_CHARS set? Operating system Treatment of invalid character
Yes Any Function fails
No XP and earlier Character is dropped
Vista and later Character is not dropped

Here’s a sample program that illustrates the possibilities:

#include <windows.h>
#include <ole2.h>
#include <windowsx.h>
#include <commctrl.h>
#include <strsafe.h>
#include <uxtheme.h>
void MB2WCTest(DWORD flags)
{
 WCHAR szOut[256];
 int cch = MultiByteToWideChar(CP_UTF8, flags,
                               "\xC0\x41\x42", 3, szOut, 256);
 printf("Called with flags %d\n", flags);
 printf("Return value is %d\n", cch);
 for (int i = 0; i < cch; i++) {
  printf("value[%d] = %d\n", i, szOut[i]);
 }
 printf("-----\n");
}
int __cdecl main(int argc, char **argv)
{
 MB2WCTest(0);
 MB2WCTest(MB_ERR_INVALID_CHARS);
 return 0;
}

If you run this on Windows XP, you get

Called with flags 0
Return value is 2
Value[0] = 65
Value[1] = 66
-----
Called with flags 8
Return value is 0
-----

This demonstrates that passing the MB_ERR_INVALID_CHARS flag causes the function to fail, and omitting it causes the invalid character \xC0 to be dropped.

If you run this on Windows Vista, you get

Called with flags 0
Return value is 3
Value[0] = 65533
Value[1] = 65
Value[2] = 66
-----
Called with flags 8
Return value is 0
-----

This demonstrates again that passing the MB_ERR_INVALID_CHARS flag causes the function to fail, but this time, if you omit the flag, the invalid character \xC0 is converted to U+FFFD, which is REPLACEMENT CHARACTER. (Note that it does not appear to be documented precisely what happens to invalid characters, aside from the fact that they are not dropped. Perhaps code pages other than CP_UTF8 convert them to some other default character.)

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.