{"id":44613,"date":"2015-02-23T07:00:00","date_gmt":"2015-02-23T22:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2015\/02\/23\/further-adventures-in-trying-to-guess-what-encoding-a-file-is-in\/"},"modified":"2019-03-13T12:13:11","modified_gmt":"2019-03-13T19:13:11","slug":"20150223-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20150223-00\/?p=44613","title":{"rendered":"Further adventures in trying to guess what encoding a file is in"},"content":{"rendered":"<p>The <code>Is&shy;Text&shy;Unicode<\/code> function tries to guess the encoding of a block of memory purporting to contain text, but it can only say &#8220;Looks like Unicode&#8221; or &#8220;Doesn&#8217;t look like Unicode&#8221;, and there <a>some notorious examples<\/a> of <a HREF=\"http:\/\/blogs.msdn.com\/b\/oldnewthing\/archive\/2007\/04\/17\/2158334.aspx\">where it guesses wrong<\/a>. <\/p>\n<p>A more flexible alternative is <a HREF=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ie\/aa740985%28v=vs.85%29.aspx\"><code>IMulti&shy;Language2::Detect&shy;Code&shy;page&shy;In&shy;IStream<\/code><\/a> and its buffer-based equivalent <a HREF=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ie\/aa740986%28v=vs.85%29.aspx\"><code>IMulti&shy;Language2::Detect&shy;Input&shy;Code&shy;page<\/code><\/a>. Not only can these methods detect a much larger range of code pages, they also can report multiple code pages, each with a corresponding confidence level. <\/p>\n<p>Here&#8217;s a Little Program that takes the function out for a spin. (Remember, Little Programs do little to no error checking.) <\/p>\n<pre>\n#define UNICODE\n#define _UNICODE\n#include &lt;windows.h&gt;\n#include &lt;shlwapi.h&gt;\n#include &lt;ole2.h&gt;\n#include &lt;mlang.h&gt;\n#include &lt;shlwapi.h&gt;\n#include &lt;atlbase.h&gt;\n#include &lt;stdio.h&gt;\n\nbool IsHtmlFile(PCWSTR pszFile)\n{\n PCWSTR pszExtension = PathFindExtensionW(pszFile);\n return\n  CompareStringOrdinal(pszExtension, -1,\n                       L\".htm\", -1, TRUE) == CSTR_EQUAL ||\n  CompareStringOrdinal(pszExtension, -1,\n                        L\".html\", -1, TRUE) == CSTR_EQUAL;\n}\n\nint __cdecl wmain(int argc, wchar_t **argv)\n{\n if (argc &lt; 2) return 0;\n <a HREF=\"http:\/\/blogs.msdn.com\/b\/oldnewthing\/archive\/2004\/05\/20\/135841.aspx\">CCoInitialize<\/a> init;\n CComPtr&lt;IStream&gt; spstm;\n SHCreateStreamOnFileEx(argv[1], STGM_READ, 0, FALSE, nullptr, &amp;spstm);\n\n CComPtr&lt;IMultiLanguage2&gt; spml;\n CoCreateInstance(CLSID_CMultiLanguage, NULL,\n     CLSCTX_ALL, IID_PPV_ARGS(&amp;spml));\n\n DetectEncodingInfo info[10];\n INT cInfo = ARRAYSIZE(info);\n\n DWORD dwFlag = IsHtmlFile(argv[1]) ? MLDETECTCP_HTML\n                                    : MLDETECTCP_NONE;\n HRESULT hr = spml-&gt;DetectCodepageInIStream(\n     dwFlag, 0, spstm, info, &amp;cInfo);\n if (hr == S_OK) {\n  for (int i = 0; i &lt; cInfo; i++) {\n   wprintf(L\"info[%d].nLangID = %d\\n\", i, info[i].nLangID);\n   wprintf(L\"info[%d].nCodePage = %d\\n\", i, info[i].nCodePage);\n   wprintf(L\"info[%d].nDocPercent = %d\\n\", i, info[i].nDocPercent);\n   wprintf(L\"info[%d].nConfidence = %d\\n\", i, info[i].nConfidence);\n  }\n } else {\n  wprintf(L\"Cannot determine the encoding (error: 0x%08x)\\n\", hr);\n }\n return 0;\n}\n<\/pre>\n<p>Run the program with a file name as the command line argument, and the program will report all the detected code pages. <\/p>\n<p>One thing that may not be obvious is that the program passes the <code>MLDETECTCP_HTML<\/code> flag if the file extension is <code>.htm<\/code> or <code>.html<\/code>. That is a hint to the detector that it shouldn&#8217;t get faked out by text like <code>&lt;body&gt;<\/code> and think it found an English word. <\/p>\n<p>Here&#8217;s the output of the program when run on its own source code: <\/p>\n<pre>\ninfo[0].nLangID = 9\ninfo[0].nCodePage = 20127\ninfo[0].nDocPercent = 100\ninfo[0].nConfidence = 83\ninfo[1].nLangID = -1\ninfo[1].nCodePage = 65001\ninfo[1].nDocPercent = -1\ninfo[1].nConfidence = -1\n<\/pre>\n<p>This says that its first guess is that the text is in language 9, which is <a HREF=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/dd318693(v=vs.85).aspx\"><code>LANG_ENGLISH<\/code><\/a>, code page 20127, which is <a HREF=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/dd317756(v=vs.85).aspx\">US-ASCII<\/a>, That text occupies 100% of the file, and the confidence level is 83. <\/p>\n<p>The second guess is that the text is in code page 65001, which is UTF-8, but the confidence level for that is low. <\/p>\n<p>The language-guessing part of the function is not very sophisticated. For a higher-quality algorithm for guessing what language some text is in, use <a HREF=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/dd319066%28v=vs.85%29.aspx\">Extended Linguistic Services<\/a>. I won&#8217;t bother writing a sample application <a HREF=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/dd319110%28v=vs.85%29.aspx\">because MSDN already contains one<\/a>. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today&#8217;s candidate: IMultiLanguage2::DetectCodepageInIStream<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-44613","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Today&#8217;s candidate: IMultiLanguage2::DetectCodepageInIStream<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/44613","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=44613"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/44613\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=44613"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=44613"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=44613"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}