How can I get WideCharToMultiByte to convert strings encoded in UTF-16BE?

Raymond Chen

A customer had a Windows program that receives data in UTF-16BE format, and they want to convert it to Shift JIS format. According to the customer liaison:

They convert the characters from UTF-16LE to Shift JIS by calling Wide­Char­To­Multi­Byte, and it works fine. However, trying to convert the characters from UTB-16BE to Shift JIS via Wide­Char­To­Multi­Byte produces garbage. How can we tell Wide­Char­To­Multi­Byte that the string is UTF-16BE? Is there any documentation that explains this?

In Windows, if a string is described as being in Unicode or UTF-16 format, the documentation means UTF-16LE format by default. Similarly, if a sequence of bytes is described as encoding a multi-byte integer, the documentation means little-endian twos-complement format by default.¹

The bias toward little-endian format in Windows is so strong that big-endian format is sometimes called “reverse byte order”, such as in the values returned by the Is­Text­Unicode format.

In this case, it’s not clear how the customer is using the Wide­Char­To­Multi­Byte function to convert UTF-16BE to Shift JIS. The Wide­Char­To­Multi­Byte function does not have any flag to specify the source encoding, so the system assumes the default, which is UTF-16LE. I’m guessing that they are just passing UTF-16BE data directly to the Wide­Char­To­Multi­Byte function and hoping that the function somehow employs psychic powers to realize “Oh, this time, the data should be treated as UTF-16BE.”

The Wide­Char­To­Multi­Byte function does not have psychic powers. It converts from UTF-16LE.

The customer must convert their source data from UTF-16BE to UTF-16LE, and then pass the UTF-16LE data to Wide­Char­To­Multi­Byte function. Fortunately, converting UTF-16BE to UTF-16LE is extremely straightforward.

¹ One example of how the default might not apply is when talking about data encoded in “network byte order”.

2 comments

Comments are closed. Login to edit/delete your existing comments

  • Reinhard Weiss 1

    That means Wide­Char­To­Multi­Byte() would not interpret the byte order mark and in the Shift JIS case handle it as invalid character.

  • Zsigmond Lőrinczy 0

    MultiByteToWideChar almost can help with parameter CodePage=1201 (unicodeFFFE). It is “almost” as it is described as “Unicode UTF-16, big endian byte order; available only to managed applications”

Feedback usabilla icon