How can I get WideCharToMultiByte to convert strings encoded in UTF-16BE?

A customer had a Windows program that receives data in UTF-16BE format, and they want to convert it to Shift JIS format. According to the customer liaison:

They convert the characters from UTF-16LE to Shift JIS by calling WideCharToMultiByte, and it works fine. However, trying to convert the characters from UTB-16BE to Shift JIS via WideCharToMultiByte produces garbage. How can we tell WideCharToMultiByte that the string is UTF-16BE? Is there any documentation that explains this?

In Windows, if a string is described as being in Unicode or UTF-16 format, the documentation means UTF-16LE format by default. Similarly, if a sequence of bytes is described as encoding a multi-byte integer, the documentation means little-endian twos-complement format by default.¹

The bias toward little-endian format in Windows is so strong that big-endian format is sometimes called “reverse byte order”, such as in the values returned by the IsTextUnicode format.

In this case, it’s not clear how the customer is using the WideCharToMultiByte function to convert UTF-16BE to Shift JIS. The WideCharToMultiByte function does not have any flag to specify the source encoding, so the system assumes the default, which is UTF-16LE. I’m guessing that they are just passing UTF-16BE data directly to the WideCharToMultiByte function and hoping that the function somehow employs psychic powers to realize “Oh, this time, the data should be treated as UTF-16BE.”

The WideCharToMultiByte function does not have psychic powers. It converts from UTF-16LE.

The customer must convert their source data from UTF-16BE to UTF-16LE, and then pass the UTF-16LE data to WideCharToMultiByte function. Fortunately, converting UTF-16BE to UTF-16LE is extremely straightforward.

¹ One example of how the default might not apply is when talking about data encoded in “network byte order”.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

2 comments

Zsigmond Lőrinczy October 10, 2023

MultiByteToWideChar almost can help with parameter CodePage=1201 (unicodeFFFE). It is “almost” as it is described as “Unicode UTF-16, big endian byte order; available only to managed applications”

Reinhard Weiss October 5, 2023 · Edited

That means WideCharToMultiByte() would not interpret the byte order mark and in the Shift JIS case handle it as invalid character.

Discussion is closed. Login to edit/delete existing comments.

Sort by :

Newest

Newest Popular Oldest

Zsigmond Lőrinczy October 10, 2023 0

MultiByteToWideChar almost can help with parameter CodePage=1201 (unicodeFFFE). It is “almost” as it is described as “Unicode UTF-16, big endian byte order; available only to managed applications”
Reinhard Weiss October 5, 2023 · Edited 1

That means WideCharToMultiByte() would not interpret the byte order mark and in the Shift JIS case handle it as invalid character.

How can I get WideCharToMultiByte to convert strings encoded in UTF-16BE?

Author

2 comments

Read next

A very belated improvement to the filtering of the Browse for Folder dialog so it shows only drive letters

Is there any performance advantage to marking a page read-only if I had no intention of writing to it anyway?

Author

2 comments

Read next

A very belated improvement to the filtering of the Browse for Folder dialog so it shows only drive letters

Is there any performance advantage to marking a page read-only if I had no intention of writing to it anyway?

Stay informed