October 5th, 2023

How can I get WideCharToMultiByte to convert strings encoded in UTF-16BE?

A customer had a Windows program that receives data in UTF-16BE format, and they want to convert it to Shift JIS format. According to the customer liaison:

They convert the characters from UTF-16LE to Shift JIS by calling Wide­Char­To­Multi­Byte, and it works fine. However, trying to convert the characters from UTB-16BE to Shift JIS via Wide­Char­To­Multi­Byte produces garbage. How can we tell Wide­Char­To­Multi­Byte that the string is UTF-16BE? Is there any documentation that explains this?

In Windows, if a string is described as being in Unicode or UTF-16 format, the documentation means UTF-16LE format by default. Similarly, if a sequence of bytes is described as encoding a multi-byte integer, the documentation means little-endian twos-complement format by default.¹

The bias toward little-endian format in Windows is so strong that big-endian format is sometimes called “reverse byte order”, such as in the values returned by the Is­Text­Unicode format.

In this case, it’s not clear how the customer is using the Wide­Char­To­Multi­Byte function to convert UTF-16BE to Shift JIS. The Wide­Char­To­Multi­Byte function does not have any flag to specify the source encoding, so the system assumes the default, which is UTF-16LE. I’m guessing that they are just passing UTF-16BE data directly to the Wide­Char­To­Multi­Byte function and hoping that the function somehow employs psychic powers to realize “Oh, this time, the data should be treated as UTF-16BE.”

The Wide­Char­To­Multi­Byte function does not have psychic powers. It converts from UTF-16LE.

The customer must convert their source data from UTF-16BE to UTF-16LE, and then pass the UTF-16LE data to Wide­Char­To­Multi­Byte function. Fortunately, converting UTF-16BE to UTF-16LE is extremely straightforward.

¹ One example of how the default might not apply is when talking about data encoded in “network byte order”.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

2 comments

Discussion is closed. Login to edit/delete existing comments.

  • Zsigmond LÅ‘rinczy

    MultiByteToWideChar almost can help with parameter CodePage=1201 (unicodeFFFE). It is “almost” as it is described as “Unicode UTF-16, big endian byte order; available only to managed applications”

  • Reinhard Weiss · Edited

    That means Wide­Char­To­Multi­Byte() would not interpret the byte order mark and in the Shift JIS case handle it as invalid character.