{"id":34973,"date":"2005-07-11T10:00:14","date_gmt":"2005-07-11T10:00:14","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2005\/07\/11\/converting-from-traditional-to-simplified-chinese-part-1-loading-the-dictionary\/"},"modified":"2005-07-11T10:00:14","modified_gmt":"2005-07-11T10:00:14","slug":"converting-from-traditional-to-simplified-chinese-part-1-loading-the-dictionary","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20050711-14\/?p=34973","title":{"rendered":"Converting from traditional to simplified Chinese, part 1: Loading the dictionary"},"content":{"rendered":"<p><P>\nOne step we had glossed over in our haste to get something\ninteresting on the screen in our\nChinese\/English dictionary program\nwas the conversion from traditional to simplified Chinese\ncharacters.\n<\/P>\n<P>\nThe format of the <CODE>hcutf8.txt<\/CODE> file is a series of lines,\neach of which is a UTF-8 encoded string consisting of a simplified\nChinese character followed by its traditional equivalents.\nOften, multiple traditional characters map to a single\nsimplified character.\nMuch more rarely&mdash;only twice in our data set&mdash;multiple\nsimplified characters map to a single traditional character.\nUnfortunately, one of the cases is the common syllable\n&#x9ebc;, which has two simplifications, either\n&#x4e48; or &#x9ebd;, the first of which is far more productive.\nWe&#8217;ll have to keep an eye out for that one.\n<\/P>\n<P>\n(Note also that in real life,\n<A HREF=\"http:\/\/weblogs.asp.net\/peterty\/archive\/2005\/03\/04\/385063.aspx\">\nthe mapping is more complicated\nthan a character-for-character substitution<\/A>,\nbut I&#8217;m willing to forego that level of complexity\nbecause this is just for my personal use and people will have\nrealized I&#8217;m not a native speaker long before I get caught up\nin language subtleties like that.)\n<\/P>\n<P>\nOne could try to work out a fancy data structure to represent\nthis mapping table compactly, but it turns out that simple is\nbetter here: an array of 65536 <CODE>WCHAR<\/CODE>s, each producing\nthe corresponding simplification.\nMost of the array will lie unused,\nsince the characters we are interested in lie in the range\nU+4E00 to U+9FFF.\nConsequently, the active part of the table is only about 40Kb,\nwhich easily fits inside the L2&nbsp;cache.\n<\/P>\n<P>\nIt is important to know when\na simple data structure is better than a complex one.\n<\/P>\n<P>\nThe <CODE>hcutf8.txt<\/CODE> file contains a lot of fluff that we\naren&#8217;t interested in. Let&#8217;s strip that out ahead of time so that\nwe don&#8217;t waste our time parsing it at run-time.\n<\/P>\n<PRE>\n#!perl\n$_ = &lt;&gt; until \/^# Start zi\/; # ignore uninteresting characters\nwhile (&lt;&gt;) {\n s\/\\r\/\/g;\n next if length($_) == 7 &amp;&amp;\n         substr($_, 0, 3) eq substr($_, 3, 3); # ignore NOPs\n print;\n}\n<\/PRE>\n<P>\nRun the <CODE>hcutf8.txt<\/CODE> file through this filter to clean\nit up a bit.\n<\/P>\n<P>\nNow we can write our &#8220;traditional to simplified&#8221; dictionary.\n<\/P>\n<PRE>\nclass Trad2Simp\n{\npublic:\n Trad2Simp();\n WCHAR Map(WCHAR chTrad) const { return _rgwch[chTrad]; }<\/p>\n<p>private:\n WCHAR _rgwch[65536]; \/\/ woohoo!\n};<\/p>\n<p>Trad2Simp::Trad2Simp()\n{\n ZeroMemory(_rgwch, sizeof(_rgwch));<\/p>\n<p> MappedTextFile mtf(TEXT(&#8220;hcutf8.txt&#8221;));\n const CHAR* pchBuf = mtf.Buffer();\n const CHAR* pchEnd = pchBuf + mtf.Length();\n while (pchBuf &lt; pchEnd) {\n  const CHAR* pchCR = std::find(pchBuf, pchEnd, &#8216;\\r&#8217;);\n  int cchBuf = (int)(pchCR &#8211; pchBuf);\n  WCHAR szMap[80];\n  DWORD cch = MultiByteToWideChar(CP_UTF8, 0, pchBuf, cchBuf,\n                                  szMap, 80);\n  if (cch &gt; 1) {\n   WCHAR chSimp = szMap[0];\n   for (DWORD i = 1; i &lt; cch; i++) {\n    if (szMap[i] != chSimp) {\n     _rgwch[szMap[i]] = chSimp;\n    }\n   }\n   pchBuf = std::find(pchCR, pchEnd, &#8216;\\n&#8217;) + 1;\n  }\n }\n _rgwch[0x9EBC] = 0x4E48;\n}\n<\/PRE>\n<P>\nWe read the file one line at a time, convert it from UTF-8,\nand for each nontrivial mapping, record it in our dictionary.\nAt the end, we do our little &#x4e48; special-case patch-up.\n<\/P>\n<P>\nNext time, we&#8217;ll use this mapping table to generate simplified\nChinese characters into our dictionary.\n<\/P><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One step we had glossed over in our haste to get something interesting on the screen in our Chinese\/English dictionary program was the conversion from traditional to simplified Chinese characters. The format of the hcutf8.txt file is a series of lines, each of which is a UTF-8 encoded string consisting of a simplified Chinese character [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-34973","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>One step we had glossed over in our haste to get something interesting on the screen in our Chinese\/English dictionary program was the conversion from traditional to simplified Chinese characters. The format of the hcutf8.txt file is a series of lines, each of which is a UTF-8 encoded string consisting of a simplified Chinese character [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/34973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=34973"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/34973\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=34973"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=34973"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=34973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}