Windows adopted Unicode before most other operating systems.[citation needed] As a result, Windows’s solutions to many problems differ from solutions adopted by those who waited for the dust to settle.¹ The most notable example of this is that Windows used UCS-2 as the Unicode encoding. This was the encoding recommended by the Unicode Consortium because Unicode 1.0 supported only 65536 characters.² The Unicode Consortium changed their minds five years later, but by then it was far too late for Windows, which had already shipped Win32s, Windows NT 3.1, Windows NT 3.5, Windows NT 3.51, and Windows 95, all of which used UCS-2.³
But today we’re going to talk about printf
-style format strings.
Windows adopted Unicode before the C language did. This meant that Windows had to invent Unicode support in the C runtime. The result was functions like wcscmp
, wcschr
, and wprintf
. As for printf
-style format strings, here’s what we ended up with:
- The
%s
format specifier represents a string in the same width as the format string. - The
%S
format specifier represents a string in the opposite width as the format string. - The
%hs
format specifier represents a narrow string regardless of the width of the format string. - The
%ws
and%ls
format specifiers represent a wide string regardless of the width of the format string.
The idea behind this pattern was so that you could write code like this:
TCHAR buffer[256]; GetSomeString(buffer, 256); _tprintf(TEXT("The string is %s.\n"), buffer);
If the code is compiled as ANSI, the result is
char buffer[256]; GetSomeStringA(buffer, 256); printf("The string is %s.\n", buffer);
And if the code is compiled as Unicode, the result is⁴
wchar_t buffer[256]; GetSomeStringW(buffer, 256); wprintf(L"The string is %s.\n", buffer);
By following the convention that %s
takes a string in the same width as the format string itself, this code runs properly when compiled either as ANSI or as Unicode. It also makes converting existing ANSI code to Unicode much simpler, since you can keep using %s
, and it will morph to do what you need.
When Unicode support formally arrived in C99, the C standard committee chose a different model for printf
format strings.
- The
%s
and%hs
format specifiers represent an narrow string. - The
%ls
format specifier represents a wide string.
This created a problem. There were six years and untold billions of lines of code in the Windows ecosystem that used the old model. What should the Visual C and C++ compiler do?
They chose to stick with the existing nonstandard model, so as not to break every Windows program on the planet.
If you want your code to work both on runtimes that use the Windows classic printf
rules as well as those that use C standard printf
rules, you can limit yourself to %hs
for narrow strings and %ls
for wide strings, and you’ll get consistent results regardless of whether the format string was passed to sprintf
or wsprintf
.
#ifdef UNICODE #define TSTRINGWIDTH TEXT("l") #else #define TSTRINGWIDTH TEXT("h") #endif TCHAR buffer[256]; GetSomeString(buffer, 256); _tprintf(TEXT("The string is %") TSTRINGWIDTH TEXT("s\n"), buffer); char buffer[256]; GetSomeStringA(buffer, 256); printf("The string is %hs\n", buffer); wchar_t buffer[256]; GetSomeStringW(buffer, 256); wprintf("The string is %ls\n", buffer);
Encoding the TSTRINGWIDTH
separately lets you do things like
_tprintf(TEXT("The string is %10") TSTRINGWIDTH TEXT("s\n"), buffer);
Since people like tables, here’s a table.
Format | Windows classic | C standard | ||
---|---|---|---|---|
%s |
printf |
char* |
char* |
⇐ |
%s |
wprintf |
wchar_t* |
char* |
|
%S |
printf |
wchar_t* |
N/A | |
%S |
wprintf |
char* |
N/A | |
%hs |
printf |
char* |
char* |
⇐ |
%hs |
wprintf |
char* |
char* |
⇐ |
%ls |
printf |
wchar_t* |
wchar_t* |
⇐ |
%ls |
wprintf |
wchar_t* |
wchar_t* |
⇐ |
%ws |
printf |
wchar_t* |
N/A | |
%ws |
wprintf |
wchar_t* |
N/A |
I highlighted the rows where the C standard agrees with the Windows classic format.⁵ If you want your code to work the same under either format convention, you should stick to those rows.
¹ You’d think that adopting Unicode early would give Windows the first-mover advantage, but at least with respect to Unicode, it ended up being a first-mover disadvantage, because everybody else could sit back and wait for better solutions to emerge (such as UTF-8) before beginning their Unicode adoption efforts.
² I guess they thought that 65536 characters should be enough for anyone.
³ This was later upgraded to UTF-16. Fortunately, UTF-16 is backward compatible with UCS-2 for the code points that are representable in both.
⁴ Technically, the Unicode version was
unsigned short buffer[256];
GetSomeStringW(buffer, 256);
wprintf(L"The string is %s.\n", buffer);
because there was not yet a wchar_t
as an independent type. Prior to the introduction of wchar_t
to the standard, the wchar_t
type was just a synonym for unsigned short
. The changing fate of the wchar_t
type has its own story.
⁵ The Windows classic format came first, so the question is whether the C standard chose to align with the Windows classic format, rather than vice versa.
Incidentally this came up in some work I was doing recently with the Unreal Engine. Unreal uses wide strings everywhere on all platforms, and it has printf-style formatting functions such as FString::Printf().
They don't seem to document whether that function and others use the Windows-style or C99-style format specifiers for wide vs. narrow strings, but as it turns out they use the Windows-style format specifiers. On all platforms.
On Windows, FString::Printf() just forwards to vswprintf(),...
Thank you for the note. Now the reasons of all this confusion with "%s" are clear. The fact of the matter is that our users were constantly asking why PVS-Studio reacted differently to their as they thought "portable" code depending on whether they build their project under Linux or Windows. Therefore, we had to make a special section in the description of the V576 diagnostic about this issue (see the section "Wide character strings"). Now...
ws=wchar/word string, ls=long string?, hs=half string?
Wouldn’t ns=narrow string make more sense than hs?And why both ws and ls? ws should suffice…
Defining _CRT_STDIO_ISO_WIDE_SPECIFIERS works as of Visual Studio 2019 and enables C99-conforming format specifiers. I believe it was introduced in Visual Studio 2015 and was initially intended to be the new default, but that idea was subsequently abandoned: https://devblogs.microsoft.com/cppblog/c-runtime-crt-features-fixes-and-breaking-changes-in-visual-studio-14-ctp1/. Now it seems to be completely undocumented, so I guess don’t use it?
Who needs those “standard” printf specifiers anyways, given that Windows is perhaps the only major operating system on which wprintf is actually useful (everyone else using UTF-16). I guess there is no point in sticking to a standard which 1) makes no sense, and 2) no one else uses.
Don’t forget that there was something confusing _CRT_STDIO_ARBITRARY_WIDE_SPECIFIERS.
Although it finally disappeared.
If that wasn't confusing enough, Netscape 6 (or possibly earlier) included code, a derived version of which still exists in Firefox today, for a function . Despite its name not including , it takes a format string and prints into a array, however like it uses for a parameter and for a parameter...
Sadly, %hs is not standard. And why should it be, when %s is unambiguous (except for those pesky Microsoft folks who tried to be early adopters of Unicode and now get blamed for having baked the early Unicode stuff into the platform)?
Yikes, so that makes things rather difficult now…
Why not have an off-by-default flag that allows printf and friends to operate in a “standards complient” mode that follows what the C and C++ standards specify? (if such a thing doesn’t already exist somewhere). Existing code written for Windows doesn’t set the flag and keeps working but code that wants to be cross platform or standards complient can set the flag and get what they want.
And what if two components in the same process disagree on how the flag should be set?
I believe the idea was to choose at compile time
Same question applies. Suppose two DLLs both use the Visual C++ runtime. One is built with "Windows printf" and the other is built with "Standard printf". Which one should the runtime's printf follow? I guess one solution would be to make printf a macro. If compiling with "Windows printf" it redirects to "printf_wwin"; otherwise it redirects to "printf_std". Not sure what the standard says about that, and it may still have issues with function pointers.
Too complicated!
In the make file, you slam C:\Includes\StdLib\Ours at the head of the System include paths.
In the make file, you remove msvcrtxx.lib (any and all reference), replace it with your replacement library.
Since you're loading c99-stdlib.dll->printf and the DLL is loading msvcrt999u-mt-x-spec.dll->printf, there should be no issues. If you have a malloc from the process and a free in the DLL, that would be an issue - but as that always has the...
A macro is acceptable; this is explicitly allowed by the C standard. (I’m not sure about C++)
Over in the Linux world, the system headers solve a similar compatibility issue between classic GNU scanf and standard C99 scanf with a macro: # define scanf __isoc99_scanf
I was expecting the solution to be symbol redirects in .lib files. That is, which .dll function gets called depends on which version of the .lib stub got linked against.
> at least with respect to Unicode, it ended up being a first-mover disadvantage
At least we’re in good company with Java, Javascript, and ICU.