{"id":94477,"date":"2016-10-10T07:00:00","date_gmt":"2016-10-10T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=94477"},"modified":"2019-03-13T10:32:24","modified_gmt":"2019-03-13T17:32:24","slug":"20161010-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20161010-00\/?p=94477","title":{"rendered":"A Little Program to fix one particular type of mojibake"},"content":{"rendered":"<p><a HREF=\"https:\/\/www.youtube.com\/watch?v=3eMCURWpNAg\">Has this ever happened to you<\/a>? You&#8217;re downloading your daughter&#8217;s Chinese homework assignment, but the file name gets all up in your <a HREF=\"https:\/\/en.wikipedia.org\/wiki\/Mojibake\">mojibake<\/a>, and the results are nonsense. <\/p>\n<p>Time to do some reverse-mojibake. <\/p>\n<p>The first step in reversing mojibake is figuring out what wrong turn the encoding went through. I took an educated guess and assumed that the file name was encoded in UTF-8, which was then misinterpreted as ANSI. I suspect this type of error is pretty common, so it was my first stab. <\/p>\n<p>To reverse it, therefore, we need to take the Unicode file name, convert it to ANSI bytes, then reinterpret those bytes as UTF-8. Let&#8217;s try it: <\/p>\n<pre>\nusing System.Text;\n\nclass Program\n{\n  static public void Main(string[] args)\n  {\n    foreach (var file in args)\n    {\n      var bytes = Encoding.Default.GetBytes(file);\n      var s = Encoding.UTF8.GetString(bytes);\n      System.IO.File.Move(file, s);\n    }\n  }\n}\n<\/pre>\n<p>I&#8217;ll take the file name on the command line, convert it via the default system code page into bytes, then take those bytes and convert them back into a string by reinterpret them as UTF-8. I then rename the file with the &#8220;fixed&#8221; name. <\/p>\n<p>Fortunately, this worked. The file name got unscrambled. <\/p>\n<table CLASS=\"cp3\" CELLPADDING=\"3\" BORDER=\"1\" STYLE=\"border-collapse: collapse;text-align: center\">\n<tr STYLE=\"font-size: 7pt\">\n<td>U+00E5<\/td>\n<td>U+00AE<\/td>\n<td>U+00B6<\/td>\n<td>U+00E5<\/td>\n<td>U+00BA<\/td>\n<td>U+00AD<\/td>\n<td>U+00E8<\/td>\n<td>U+0081<\/td>\n<td>U+00AF<\/td>\n<td>U+00E7<\/td>\n<td>U+00B5<\/td>\n<td>U+00A1<\/td>\n<td>U+00E5<\/td>\n<td>U+2013<\/td>\n<td>U+00AE<\/td>\n<td>U+002E<\/td>\n<td>U+0070<\/td>\n<td>U+0064<\/td>\n<td>U+0066<\/td>\n<\/tr>\n<tr>\n<td>&#xE5;<\/td>\n<td>&#xAE;<\/td>\n<td>&#xB6;<\/td>\n<td>&#xE5;<\/td>\n<td>&#xBA;<\/td>\n<td>&#xAD;<\/td>\n<td>&#xE8;<\/td>\n<td>&#x81;<\/td>\n<td>&#xAF;<\/td>\n<td>&#xE7;<\/td>\n<td>&#xB5;<\/td>\n<td>&#xA1;<\/td>\n<td>&#xE5;<\/td>\n<td>&#x2013;<\/td>\n<td>&#xAE;<\/td>\n<td>&#x2E;<\/td>\n<td>&#x70;<\/td>\n<td>&#x64;<\/td>\n<td>&#x66;<\/td>\n<\/tr>\n<\/table>\n<p>Converted to bytes via code page 1252 Windows Western European Latin 1 (which is the default code page for the United States): <\/p>\n<table CLASS=\"cp3\" CELLPADDING=\"3\" BORDER=\"1\" STYLE=\"border-collapse: collapse;text-align: center\">\n<tr>\n<td>E5<\/td>\n<td>AE<\/td>\n<td>B6<\/td>\n<td>E5<\/td>\n<td>BA<\/td>\n<td>AD<\/td>\n<td>E8<\/td>\n<td>81<\/td>\n<td>AF<\/td>\n<td>E7<\/td>\n<td>B5<\/td>\n<td>A1<\/td>\n<td>E5<\/td>\n<td>96<\/td>\n<td>AE<\/td>\n<td>2E<\/td>\n<td>70<\/td>\n<td>64<\/td>\n<td>66<\/td>\n<\/tr>\n<\/table>\n<p>And then converted back to Unicode via UTF-8: <\/p>\n<table CLASS=\"cp3\" CELLPADDING=\"3\" BORDER=\"1\" STYLE=\"border-collapse: collapse;text-align: center\">\n<tr STYLE=\"font-size: 7pt\">\n<td>U+5BB6<\/td>\n<td>U+5EAD<\/td>\n<td>U+806F<\/td>\n<td>U+7D61<\/td>\n<td>U+55AE<\/td>\n<td>U+002E<\/td>\n<td>U+0070<\/td>\n<td>U+0064<\/td>\n<td>U+0066<\/td>\n<\/tr>\n<tr>\n<td>&#x5BB6;<\/td>\n<td>&#x5EAD;<\/td>\n<td>&#x806F;<\/td>\n<td>&#x7D61;<\/td>\n<td>&#x55AE;<\/td>\n<td>&#x2E;<\/td>\n<td>&#x70;<\/td>\n<td>&#x64;<\/td>\n<td>&#x66;<\/td>\n<\/tr>\n<\/table>\n<p>Et voil&agrave;. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Keep your eye on the code page.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-94477","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Keep your eye on the code page.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/94477","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=94477"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/94477\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=94477"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=94477"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=94477"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}