June 15th, 2011

How do I convert a UTF-8 string to UTF-16 while rejecting illegal sequences?

By default, when you ask Multi­Byte­To­Wide­Char to convert a UTF-8 string to UTF-16 that contains illegal sequences (such as overlong sequences), it will try to muddle through as best as it can. If you want it to treat illegal sequences as an error, pass the MB_ERR_INVALID_CHARS flag.

The MSDN documentation on this subject is, to be honest, kind of hard to follow and even includes a double-negative: “The function does not drop illegal code points if the application does not set this flag.” Not only is this confusing, it doesn’t even say what happens to illegal code points when you omit this flag; all it says is what it doesn’t do, namely that it doesn’t drop them. Does it set them on fire? (Presumably, if you omit the flag, then it retains illegal code points, but how do you retain an illegal UTF-8 code point in UTF-16 output? It’s like saying about a function like atoi “If the value cannot be represented as an integer, it is left unchanged.” Huh? The function still has to return an integer. How do you return an unchanged string as an integer?)

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.