{"id":110048,"date":"2024-07-26T07:00:00","date_gmt":"2024-07-26T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=110048"},"modified":"2024-07-26T09:51:50","modified_gmt":"2024-07-26T16:51:50","slug":"20240726-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20240726-00\/?p=110048","title":{"rendered":"What can I do if IMlangConvertCharset is unable to convert from code page 28591 directly to UTF-8?"},"content":{"rendered":"<p>A customer wanted to do a character set conversion from code page 28591 directly to UTF-8. They found that when they ask <code>IMulti\u00adLanguage::<wbr \/>Create\u00adConvert\u00adCharset<\/code> to create such a converter, it returns <code>S_FALSE<\/code>, meaning that no such conversion is available.<\/p>\n<pre>auto mlang = wil::CoCreateInstance&lt;IMultiLanguage&gt;(\r\n        CLSID_CMultiLanguage);\r\n\r\n\/\/ This next call returns S_FALSE, indicating no conversion.\r\nwil::com_ptr&lt;IMLangConvertCharset&gt; convert;\r\nmlang-&gt;CreateConvertCharset(28591, CP_UTF8, 0, &amp;convert);\r\n<\/pre>\n<p>Oh no, what shall we ever do?<\/p>\n<p>Okay, so <code>CMultiLanguage<\/code> can&#8217;t convert from 28591 directly to UTF-8. But you can just convert through UTF-16.<\/p>\n<pre>HRESULT ConvertStringFrom28591ToUtf8(\r\n    char const* input, \r\n    int inputLength,\r\n    char * output,\r\n    int outputCapacity,\r\n    int* actualOutput)\r\n{\r\n    *actualOutput = 0;\r\n\r\n    \/\/ Ensure we are not working with negative numbers.\r\n    RETURN_HR_IF(E_INVALIDARG, inputLength &lt; 0 ||\r\n                               outputCapacity &lt; 0);\r\n\r\n    \/\/ Empty string converts to empty string.\r\n    if (inputLength == 0)\r\n    {\r\n        return S_OK;\r\n    }\r\n\r\n    \/\/ Avoid edge cases if outputCapacity = 0.\r\n    \/\/ This also short-circuits cases where we know that the\r\n    \/\/ output buffer isn't big enough to hold the converted input.\r\n    RETURN_HR_IF(HRESULT_FROM_WIN32(ERROR_INSUFFICIENT_BUFFER),\r\n                inputLength &gt; outputCapacity);\r\n\r\n    \/\/ Code page 28591 resides completely in the BMP.\r\n    auto bufferCapacity = std::min(inputLength, outputLength);\r\n    auto buffer = wil::make_unique_hlocal_nothrow&lt;wchar_t[]&gt;(\r\n        bufferCapacity);\r\n    RETURN_IF_NULL_ALLOC(buffer);\r\n\r\n    \/\/ Convert from 28591 to UTF-16LE.\r\n    auto result = MultibyteToWideChar(28591, MB_ERR_INVALID_CHARS,\r\n        input, inputLength, buffer.get(), maximumOutput);\r\n    RETURN_IF_WIN32_BOOL_FALSE(result != 0);\r\n\r\n    \/\/ Convert from UTF-16LE to UTF-8.\r\n    *actualOutput = WideCharToMultiByte(CP_UTF8, 0,\r\n        buffer.get(), bufferCapacity,\r\n        output, outputCapacity, nullptr, nullptr);\r\n    RETURN_IF_WIN32_BOOL_FALSE(*actualOutput != 0);\r\n\r\n    return S_OK;\r\n},\r\n<\/pre>\n<p>After dealing with some edge cases, we allocate a temporary UTF-16LE buffer. That buffer needs to be big enough to hold the converted input, but doesn&#8217;t need to be so big that the caller-provided output couldn&#8217;t possibly hold the result.<\/p>\n<p>Since all the characters of code page 28591 have code points less than U+10000, they will convert to a single UTF-16LE code unit. Therefore, we will need at most <code>inputLength<\/code> UTF-16LE code units to hold the intermediate UTF-16LE output.<\/p>\n<p>And since all the code points of the intermediate buffer will be less than U+10000, there will never be a need for more UTF-16LE code units than corresponding UTF-8 code units. Therefore, any intermediate buffer bigger than <code>outputCapacity<\/code> wouldn&#8217;t fit in the caller-provided buffer anyway, so we can just return the &#8220;insufficient buffer&#8221; error right away without having to do any work.<\/p>\n<p>The rest is anticlimactic: We convert the input buffer to our temporary buffer, and then we convert the temporary buffer to the output buffer.<\/p>\n<p>In the general case, a single input byte could result in two UTF-16LE code units, if it represents a character outside the BMP. (We assume that no code page has a single input byte that converts to multiple Unicode characters.) And the worst-case expansion from UTF-8 bytes to UTF-16LE code units is just 1:1. So in the general case, the required temporary buffer capacity is <code>std::min(<wbr \/>2 * inputLength, outputCapacity)<\/code>.<\/p>\n<p>The whole <code>IMultiLanguage<\/code> interface was a red herring. You never needed it. The conversion was in front of you the whole time.<\/p>\n<p><b>Bonus chatter<\/b>: The entire MultiLanguage API family has been deprecated <a title=\"Converting between LCIDs and RFC 1766 language codes\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20060105-00\/?p=32753\"> since at least 2008, possibly longer<\/a>, so it&#8217;s a good thing we migrated away from it.<\/p>\n<p><b>Bonus bonus chatter<\/b>: The International Components for Unicode (ICU) <a title=\"International Components for Unicode (ICU)\" href=\"https:\/\/learn.microsoft.com\/windows\/win32\/intl\/international-components-for-unicode--icu-\"> have been included with Windows since Windows 10 version 1703<\/a>, so if you don&#8217;t need to support anything older than that, you can just use the copy of ICU built into Windows. <a href=\"https:\/\/unicode-org.github.io\/icu-docs\/apidoc\/released\/icu4c\/ucnv_8h.html#a8c2852929b99ca983ccd1f33a203cc2a\"> The <code>ucnv_convertEx<\/code> function<\/a> lets you convert from one encoding to another. Mind you, it &#8220;pivots&#8221; through UTF-16, so it&#8217;s internally doing the same thing we are, but at least it done for you. You can consult the ICU documentation for <a href=\"https:\/\/unicode-org.github.io\/icu\/userguide\/conversion\/converters.html\"> more information about converters<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>You can do the conversion in two steps using things you already have.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-110048","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>You can do the conversion in two steps using things you already have.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110048","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=110048"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110048\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=110048"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=110048"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=110048"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}