{"id":102823,"date":"2019-08-30T07:00:00","date_gmt":"2019-08-30T14:00:00","guid":{"rendered":"http:\/\/devblogs.microsoft.com\/oldnewthing\/?p=102823"},"modified":"2023-09-07T14:59:18","modified_gmt":"2023-09-07T21:59:18","slug":"20190830-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190830-00\/?p=102823","title":{"rendered":"The sad history of Unicode printf-style format specifiers in Visual C++"},"content":{"rendered":"<p>Windows adopted Unicode before most other operating systems.<sup>[citation needed]<\/sup> As a result, Windows&#8217;s solutions to many problems differ from solutions adopted by those who waited for the dust to settle.\u00b9 The most notable example of this is that Windows used UCS-2 as the Unicode encoding. This was the encoding recommended by the Unicode Consortium because Unicode 1.0 supported only 65536 characters.\u00b2 The Unicode Consortium changed their minds five years later, but by then it was far too late for Windows, which had already shipped Win32s, Windows NT 3.1, Windows NT 3.5, Windows NT 3.51, and Windows 95, all of which used UCS-2.\u00b3<\/p>\n<p>But today we&#8217;re going to talk about <code>printf<\/code>-style format strings.<\/p>\n<p>Windows adopted Unicode before the C language did. This meant that Windows had to invent Unicode support in the C runtime. The result was functions like <code>wcscmp<\/code>, <code>wcschr<\/code>, and <code>wprintf<\/code>. As for <code>printf<\/code>-style format strings, here&#8217;s what we ended up with:<\/p>\n<ul>\n<li>The <code>%s<\/code> format specifier represents a string in the same width as the format string.<\/li>\n<li>The <code>%S<\/code> format specifier represents a string in the opposite width as the format string.<\/li>\n<li>The <code>%hs<\/code> format specifier represents a narrow string regardless of the width of the format string.<\/li>\n<li>The <code>%ws<\/code> and <code>%ls<\/code> format specifiers represent a wide string regardless of the width of the format string.<\/li>\n<\/ul>\n<p>The idea behind this pattern was so that you could write code like this:<\/p>\n<pre>TCHAR buffer[256];\r\nGetSomeString(buffer, 256);\r\n_tprintf(TEXT(\"The string is %s.\\n\"), buffer);\r\n<\/pre>\n<p>If the code is compiled as ANSI, the result is<\/p>\n<pre>char buffer[256];\r\nGetSomeStringA(buffer, 256);\r\nprintf(\"The string is %s.\\n\", buffer);\r\n<\/pre>\n<p>And if the code is compiled as Unicode, the result is\u2074<\/p>\n<pre>wchar_t buffer[256];\r\nGetSomeStringW(buffer, 256);\r\nwprintf(L\"The string is %s.\\n\", buffer);\r\n<\/pre>\n<p>By following the convention that <code>%s<\/code> takes a string in the same width as the format string itself, this code runs properly when compiled either as ANSI or as Unicode. It also makes converting existing ANSI code to Unicode much simpler, since you can keep using <code>%s<\/code>, and it will morph to do what you need.<\/p>\n<p>When Unicode support formally arrived in C99, the C standard committee chose a different model for <code>printf<\/code> format strings.<\/p>\n<ul>\n<li>The <code>%s<\/code> and <code>%hs<\/code> format specifiers represent an narrow string.<\/li>\n<li>The <code>%ls<\/code> format specifier represents a wide string.<\/li>\n<\/ul>\n<p>This created a problem. There were six years and untold billions of lines of code in the Windows ecosystem that used the old model. What should the Visual C and C++ compiler do?<\/p>\n<p>They chose to stick with the existing nonstandard model, so as not to break every Windows program on the planet.<\/p>\n<p>If you want your code to work both on runtimes that use the Windows classic <code>printf<\/code> rules as well as those that use C standard <code>printf<\/code> rules, you can limit yourself to <code>%hs<\/code> for narrow strings and <code>%ls<\/code> for wide strings, and you&#8217;ll get consistent results regardless of whether the format string was passed to <code>sprintf<\/code> or <code>wsprintf<\/code>.<\/p>\n<pre>#ifdef UNICODE\r\n#define TSTRINGWIDTH TEXT(\"l\")\r\n#else\r\n#define TSTRINGWIDTH TEXT(\"h\")\r\n#endif\r\n\r\nTCHAR buffer[256];\r\nGetSomeString(buffer, 256);\r\n_tprintf(TEXT(\"The string is %\") TSTRINGWIDTH TEXT(\"s\\n\"), buffer);\r\n\r\nchar buffer[256];\r\nGetSomeStringA(buffer, 256);\r\nprintf(\"The string is %hs\\n\", buffer);\r\n\r\nwchar_t buffer[256];\r\nGetSomeStringW(buffer, 256);\r\nwprintf(\"The string is %ls\\n\", buffer);\r\n<\/pre>\n<p>Encoding the <code>TSTRINGWIDTH<\/code> separately lets you do things like<\/p>\n<pre>_tprintf(TEXT(\"The string is %10\") TSTRINGWIDTH TEXT(\"s\\n\"), buffer);\r\n<\/pre>\n<p>Since people like tables, here&#8217;s a table.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th style=\"border: solid 1px currentcolor;\" colspan=\"2\">Format<\/th>\n<th style=\"border: solid 1px currentcolor;\">Windows classic<\/th>\n<th style=\"border: solid 1px currentcolor;\">C standard<\/th>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-bottom: none; background-color: #bdecb6;\"><code>%s<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>printf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>char*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>char*<\/code><\/td>\n<td style=\"border: none;\">\u21d0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-top: none;\"><code>%s<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>wprintf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>char*<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-bottom: none;\"><code>%S<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>printf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\">N\/A<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-top: none;\"><code>%S<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>wprintf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>char*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\">N\/A<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-bottom: none; background-color: #bdecb6;\"><code>%hs<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>printf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>char*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>char*<\/code><\/td>\n<td style=\"border: none;\">\u21d0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-top: none; background-color: #bdecb6;\"><code>%hs<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>wprintf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>char*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>char*<\/code><\/td>\n<td style=\"border: none;\">\u21d0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-bottom: none; background-color: #bdecb6;\"><code>%ls<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>printf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: none;\">\u21d0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-top: none; background-color: #bdecb6;\"><code>%ls<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>wprintf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor; background-color: #bdecb6;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: none;\">\u21d0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-bottom: none;\"><code>%ws<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>printf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\">N\/A<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor; border-top: none;\"><code>%ws<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>wprintf<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\"><code>wchar_t*<\/code><\/td>\n<td style=\"border: solid 1px currentcolor;\">N\/A<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>I highlighted the rows where the C standard agrees with the Windows classic format.\u2075 If you want your code to work the same under either format convention, you should stick to those rows.<\/p>\n<p>\u00b9 You&#8217;d think that adopting Unicode early would give Windows the first-mover advantage, but at least with respect to Unicode, it ended up being a first-mover disadvantage, because everybody else could sit back and wait for better solutions to emerge (such as UTF-8) before beginning their Unicode adoption efforts.<\/p>\n<p>\u00b2 I guess they thought that 65536 characters <a href=\"https:\/\/groups.google.com\/forum\/#!msg\/alt.folklore.computers\/mpjS-h4jpD8\/9DW_VQVLzpkJ\"> should be enough for anyone<\/a>.<\/p>\n<p>\u00b3 This was later upgraded to UTF-16. Fortunately, UTF-16 is backward compatible with UCS-2 for the code points that are representable in both.<\/p>\n<p>\u2074 Technically, the Unicode version was<\/p>\n<pre><span style=\"border: solid 1px currentcolor;\">unsigned short<\/span> buffer[256];\r\nGetSomeStringW(buffer, 256);\r\nwprintf(L\"The string is %s.\\n\", buffer);\r\n<\/pre>\n<p>because there was not yet a <code>wchar_t<\/code> as an independent type. Prior to the introduction of <code>wchar_t<\/code> to the standard, the <code>wchar_t<\/code> type was just a synonym for <code>unsigned short<\/code>. The changing fate of the <code>wchar_t<\/code> type <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20161201-00\/?p=94836\"> has its own story<\/a>.<\/p>\n<p>\u2075 The Windows classic format came first, so the question is whether the C standard chose to align with the Windows classic format, rather than vice versa.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Getting ahead of the standard, which went a different way.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-102823","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Getting ahead of the standard, which went a different way.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102823","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=102823"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102823\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=102823"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=102823"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=102823"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}