It’s okay to be contrary, but you need to be consistently contrary: Going against the ambient character set

Raymond Chen

In Windows, you declare your character set preference implicitly by defining or not defining the symbol UNICODE before including the windows.h header file. (Related: TEXT vs. _TEXT vs. _T, and UNICODE vs. _UNICODE.) This determines whether undecorated function names redirect to the ANSI version or the Unicode version, but it doesn’t make the opposite-version inaccessible. You just have to call them by their explicit names. And it’s important that you be consistent about it. If you miss a spot, the characters get all messed up.

// UNICODE not defined
#include <windows.h>

void UpdateTitle(HWND hwnd, PCWSTR title)
{
    SetWindowTextW(hwnd, title);
}

In the above example, we did not define the symbol UNICODE, so the ambient character set is ANSI. Since we want to call the Unicode version of Set­Window­Text, we must use its explicit Unicode name Set­Window­TextW.

Most of the time, these errors are detected at compile time due to type mismatches. For example, if we forgot to put the trailing W on the function name, we would get the error

error C2664: 'BOOL SetWindowTextA(HWND,const char *)': cannot convert argument 2 from 'const wchar_t *' to 'const char *'
note: Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

And that’s your clue that you forgot to W-ize the Set­Window­Text call. You should have called the W version explicitly: Set­Window­TextW.

However, there’s a category of functions that elude this compile-time detection: The functions that have separate ANSI and Unicode versions but take only character-set-independent parameters. Common examples are Dispatch­Message, Translate­Message, Translate­Accelerator, Create­Accelerator­Table, and most notably, Def­Window­Proc.

For some reason, when I get called in to investigate this sort of problem, it’s usually the Def­Window­Proc that is the source of the problem.

But I don’t think it’s because people get the others right and miss the Def­Window­Proc. I think it’s because the mistakes in the other functions are much less noticeable. The mistakes are still there, and maybe you’ll get a bug report from a user in Japan when they run into it, but that’s not something that is going to be noticed in English-based testing as much as a string that is truncated down to its first letter.

10 comments

Comments are closed. Login to edit/delete your existing comments

  • Harold H 0

    The ANSI version of a function is called “FunctionNameA”. Why is the Unicode version of a function called “FunctionNameW” and not “FunctionNameU” ?

    • Solomon Ucko 0

      I’m pretty sure it stands for “wide” or “wchar” but I’m not sure why the inconsistency.

    • Michael Taylor 0

      Yes the `W` stands for wide since the character set is known as the `wide character` and hence Windows uses `WCHAR` as the type. Why didn’t they use `U` and `UCHAR` then? Couple of possibilities.

      1) While most people think of Unicode as 16-bits it is in fact either 8 or 16-bits depending upon whether you’re using UTF-8 or UTF-16. `WCHAR` is for UTF-16, hence wide. A `FuncU` function could potentially be called with either a UTF-8 or UTF-16 string if it meant “Unicode string”.
      2) `unsigned char` is a valid type in C/C++. `UCHAR` might be misinterpreted as unsigned char.

      • Solomon Ucko 0

        Or 32 bits per code unit for UTF-32.

    • Me Gusta 0

      Having a bit of a prod around the Windows NT SDK (yes, the SDK for the original version of Windows NT), UCHAR is defined as a typedef for unsigned char. So it may have been a bit iffy to use U in that case.
      But IIRC, this kind of naming came from the fact that characters that were made up of single byte units like ASCII (ISO 646), ISO 2022, ISO 8859 and the like were (maybe retroactively) seen as narrow characters, and characters that are made up of two or more bytes are seen as wide characters.
      Now, while it was officially added to the C standard library in 1995, the wchar_t type was being informally used for quite a while. Again that Windows NT SDK from 1992 had the wchar.h and wctype.h headers. Since this is actually getting into the region where I am too young to remember anything more than this, I would guess that wchar_t was informally added prior to NT and it was just repurposed for UCS-2 since it was convenient.
      But I would have to do a bit more digging to see, and that would finding and downloading the Microsoft C compiler from the late 80s to see how things were.

    • MNGoldenEagle 0

      The A/W dichotomy existed back in Windows 3.1, with the W functions being failing stubs (for most regions). Given that Windows 3.1 predated Unicode’s existence as a standard, that’s probably part of the reason why. Microsoft knew they wanted “wide character” support, but what charset they were going to use wasn’t defined yet, and wouldn’t be until the release of Windows NT which used UCS-2.

  • Neil Rashbrook 0

    When you invent that time machine, change LPARAM into an LPTSTR. Problem solved. While you’re there, make the header detect C++ and use overloaded functions to automatically select the right A or W function depending on the type of the arguments.

    • Raymond ChenMicrosoft employee 0

      C++ is not an ABI, however. Different compilers decorate differently. The overloads would have to be inline functions, but that creates function identity problems.

      • Me Gusta 0

        If I remember correctly, changing LPARAM and overloading that wouldn’t really fix the problem anyway because the window expecting ANSI/UNICODE is a property of the HWND. For example, if you register the class using RegisterClass(Ex)A and then use W functions after that, you would still have problems.

  • Paul Topping 0

    I prefer to use all explicit names line FunctionNameA or FunctionNameW. It’s too bad windows.h doesn’t have a macro that forces this by suppressing all the unadorned names like FunctionName. I understand that the unadorned names allow one to switch between Unicode and non-Unicode versions by changing a single macro definition but that’s rarely practical anyway. Of course, my new macro (#define NO_UNICODE_NAMES?) wouldn’t prevent anyone from doing that if that’s what they wanted.

Feedback usabilla icon