{"id":105723,"date":"2021-09-23T07:00:00","date_gmt":"2021-09-23T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105723"},"modified":"2021-09-23T06:46:12","modified_gmt":"2021-09-23T13:46:12","slug":"20210923-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210923-00\/?p=105723","title":{"rendered":"Why is there trailing garbage when I try to decode the bytes of a HttpContent object?"},"content":{"rendered":"<p>A customer was having trouble extracting text from an HTTP response.<\/p>\n<pre>winrt::HttpRequest request = ...;\r\n\r\nauto result = co_await request.Content().ReadAsStringAsync();\r\n<\/pre>\n<p>This version produced a string that looked mostly okay, but some parts were corrupted.<\/p>\n<blockquote class=\"q\"><p><tt>{\"name\":\"\u00f0<span style=\"background-color: black; color: white; padding: 0 1px; margin: 0 1px;\">APC<\/span><span style=\"background-color: black; color: white; padding: 0 1px; margin: 0 1px;\">DCS<\/span>\u00b1meow\"}<\/tt><\/p><\/blockquote>\n<p>From inspection, it&#8217;s clear that what we have here is mojibake, wherein a UTF-8 string is being misinterpreted in some other 8-bit character set.<\/p>\n<p>According to <a href=\"https:\/\/tools.ietf.org\/html\/rfc2616#section-3.7.1\">RFC 2616 section 3.7.1<\/a>, if there is no explicit character set for a <tt>text<\/tt> media subtype, the default character set is ISO-8859-1. Evidently, this server returned a string encoded as UTF-8 but failed to indicate this character set when it reported its <tt>Content-Type<\/tt>. As a result, the string defaults to ISO-8859-1.<\/p>\n<p>Oops.<\/p>\n<p>Now, <a href=\"https:\/\/tools.ietf.org\/html\/rfc2616#section-3.4.1\"> section 3.4.1 of RFC 2616<\/a> acknowledges that it is common for HTTP clients to interpret the lack of an explicit character set as an opportunity to make a best guess. The Windows Runtime does perform some guessing of the character set if no character set is provided:<\/p>\n<ul>\n<li>If the buffer begins with a UTF-8 BOM, a UTF-16LE BOM, a UTF-16BE BOM, or a GB18030 BOM, then the buffer is decoded according to that character set.<\/li>\n<li>If the content type is <tt>application\/json<\/tt> or is of the form <tt>*+json<\/tt>, then it is decoded as UTF-8.<\/li>\n<li>Otherwise, it is decoded as ISO-8859-1.<\/li>\n<\/ul>\n<p>The fact that we made it all the way to the last step means that the server didn&#8217;t use a UTF-8 BOM, nor did it set the content type to <tt>application\/json<\/tt> even though it was returning JSON.<\/p>\n<p>Oops.<\/p>\n<p>Okay, so let&#8217;s try to work around this (apparently very broken) server by taking the response bytes and explicitly decoding them as UTF-8.<\/p>\n<pre>std::wstring Utf8ToUtf16(char const* str)\r\n{\r\n    std::wstring result;\r\n    if (str) {\r\n        auto resultLen = MultiByteToWideChar(\r\n            CP_UTF8, MB_ERR_INVALID_CHARS, str, -1, nullptr, 0);\r\n        if (resultLen) {\r\n            result.resize(resultLen);\r\n            MultiByteToWideChar(\r\n                CP_UTF8, MB_ERR_INVALID_CHARS, str, -1,\r\n                result.data(), resultLen);\r\n        }\r\n    }\r\n    return result;\r\n}\r\n\r\nwinrt::HttpRequest request = ...;\r\n\r\nauto buffer = co_await request.Content().ReadAsBufferAsync();\r\nauto result = Utf8ToUtf16((char const*)buffer.data());\r\n<\/pre>\n<p>This version worked better, but it had garbage at the end:<\/p>\n<blockquote class=\"q\"><p><tt>{\"name\":\"\ud83d\udc31meow\"}<span style=\"background-color: black; color: white; padding: 0 1px; margin: 0 1px;\">SOH<\/span><\/tt><\/p><\/blockquote>\n<p>In this case, the problem was not in the acquisition of the buffer but rather in the conversion of the buffer to a string. The <code>buffer.data()<\/code> method returns a pointer to the start of the buffer, and the code passes this as the source string to <code>Multi\u00adByte\u00adTo\u00adWide\u00adChar<\/code> with <code>-1<\/code> as the string length.<\/p>\n<p>The special value <code>-1<\/code> means that the pointer should be treated as the start of a null-terminated string. But the <code>Buffer<\/code> that is produced by <code>Read\u00adAs\u00adBuffer\u00adAsync<\/code> is just raw bytes returned from the server, and the server isn&#8217;t going to put a null terminator at the end. The server says, &#8220;The response is 19 bytes long,&#8221; and it sends the 19 bytes and that&#8217;s that.<\/p>\n<p>So the extra garbage is a read buffer overflow, where the code just reads past the end of the buffer until it finally runs into a zero byte somewhere.<\/p>\n<p>You want to decode the bytes in the buffer, so you need to specify the number of bytes in the buffer, rather than saying &#8220;Just keep decoding until you hit a zero byte.&#8221;<\/p>\n<pre>std::wstring Utf8ToUtf16(char const* str<span style=\"color: blue;\">, int32_t inputLen<\/span>)\r\n{\r\n    std::wstring result;\r\n    if (str) {\r\n        auto resultLen = MultiByteToWideChar(\r\n            CP_UTF8, MB_ERR_INVALID_CHARS, str, <span style=\"color: blue;\">inputLen<\/span>, nullptr, 0);\r\n        if (resultLen) {\r\n            result.resize(resultLen);\r\n            MultiByteToWideChar(\r\n                CP_UTF8, MB_ERR_INVALID_CHARS, str, <span style=\"color: blue;\">inputLen<\/span>,\r\n                result.data(), resultLen);\r\n        }\r\n    }\r\n    return result;\r\n}\r\n\r\nwinrt::HttpRequest request = ...;\r\n\r\nauto buffer = co_await request.Content().ReadAsBufferAsync();\r\nauto result = Utf8ToUtf16(\r\n    (char const*)buffer.data(),\r\n    <span style=\"color: blue;\">static_cast&lt;int32_t&gt;(buffer.Length())<\/span>);\r\n<\/pre>\n<p>This produces the desired string, decoded as UTF-8, with no garbage.<\/p>\n<p>Now, it turns out that you don&#8217;t have to write the code to take a <code>Buffer<\/code> containing a string encoded in UTF-8 and convert it to a UTF-16 string. The Windows Runtime already provides a helper function to do this:<\/p>\n<pre>winrt::HttpRequest request = ...;\r\n\r\nauto buffer = co_await request.Content().ReadAsBufferAsync();\r\nauto result = <span style=\"color: blue;\">CryptographicBuffer::ConvertBinaryToString(\r\n    BinaryStringEncoding::Utf8, buffer);<\/span>\r\n<\/pre>\n<p>But since we&#8217;re using C++\/WinRT, we can avoid all that and use <a title=\"Converting between UTF-8 strings and UTF-16 strings in C++\/WinRT\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210922-00\/?p=105717\"> the conversion built into C++\/WinRT we learned last time<\/a>. The hard part is getting a <code>std::string_view<\/code> out of a <code>buffer<\/code>.<\/p>\n<pre>winrt::HttpRequest request = ...;\r\n\r\nauto buffer = co_await request.Content().ReadAsBufferAsync();\r\nauto result = <span style=\"color: blue;\">winrt::to_hstring(\r\n    std::string_view{\r\n        static_cast&lt;char const*&gt;(buffer.data()),\r\n        buffer.Length() });<\/span>\r\n<\/pre>\n<p>So there you go, reading the raw buffer and converting it from UTF-8 to a UTF-16 string.<\/p>\n<p>In the meantime, go fix your server already.<\/p>\n<p><b>Epilogue<\/b>: The customer found that indeed their server was misconfigured. The response was being generated via a loopback server, and it was putting the <code>Content-Type<\/code> header in the <code>ResponseHeaders<\/code> instead of the <code>ContentHeaders<\/code>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>You need to know when to stop.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-105723","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>You need to know when to stop.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105723","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105723"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105723\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105723"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105723"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105723"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}