{"id":43553,"date":"2014-11-24T07:00:00","date_gmt":"2014-11-24T07:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2014\/11\/24\/the-crazy-world-of-stripping-diacritics\/"},"modified":"2014-11-24T07:00:00","modified_gmt":"2014-11-24T07:00:00","slug":"the-crazy-world-of-stripping-diacritics","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20141124-00\/?p=43553","title":{"rendered":"The crazy world of stripping diacritics"},"content":{"rendered":"<p>\nToday&#8217;s Little Program strips diacritics from a Unicode string.\nWhy?\nHey, I said that Little Programs require little to no motivation.\nIt might come in handy in a spam filter, since it was popular,\nat least for a time, to put random accent marks on spam subject\nlines in order to sneak past keyword filters.\n(It doesn&#8217;t seem to be popular any more.)\n<\/p>\n<p>\nThis is basically a C-ization of\n<a HREF=\"http:\/\/www.siao2.com\/2005\/02\/19\/376617.aspx\">\nthe C# code originally written by Michael Kaplan<\/a>.\nDon&#8217;t forget to read\n<a HREF=\"http:\/\/www.siao2.com\/2007\/05\/14\/2629747.aspx\">\nthe follow-up discussion that notes that this can result in strange\nresults<\/a>.\n<\/p>\n<p>\nFirst, let&#8217;s create our dialog box.\nNote that I intentionally give it a huge font\nso that the diacritics are easier to see.\n<\/p>\n<pre>\n\/\/ scratch.h\n#define IDD_SCRATCH 1\n#define IDC_SOURCE 100\n#define IDC_SOURCEPOINTS 101\n#define IDC_DEST 102\n#define IDC_DESTPOINTS 103\n\/\/ scratch.rc\n#include &lt;windows.h&gt;\n#include \"scratch.h\"\nIDD_SCRATCH DIALOGEX 0, 0, 320, 88\nSTYLE DS_MODALFRAME | WS_POPUP | WS_CAPTION | WS_SYSMENU\nCaption \"Stripping diacritics\"\nFONT 20, \"MS Shell Dlg\"\nBEGIN\n    LTEXT \"Original:\", -1, 4, 8, 38, 10\n    EDITTEXT IDC_SOURCE, 46, 6, 270, 12, ES_AUTOHSCROLL\n    LTEXT \"\", IDC_SOURCEPOINTS, 46, 22, 270, 12\n    LTEXT \"Modified:\", -1, 4, 40, 38, 10\n    EDITTEXT IDC_DEST, 46, 38, 270, 12, ES_AUTOHSCROLL\n    LTEXT \"\", IDC_DESTPOINTS, 46, 54, 270, 12\n    DEFPUSHBUTTON \"OK\", IDOK, 266, 70, 50, 14\nEND\n<\/pre>\n<p>\nNow the program that uses the dialog box.\n<\/p>\n<pre>\n\/\/ scratch.cpp\n#define STRICT\n#define UNICODE\n#define _UNICODE\n#include &lt;windows.h&gt;\n#include &lt;windowsx.h&gt;\n#include &lt;strsafe.h&gt;\n#include \"scratch.h\"\n#define MAXSOURCE 64\nvoid SetDlgItemCodePoints(HWND hwnd, int idc, PCWSTR psz)\n{\n  wchar_t szResult[MAXSOURCE * 4 * 5];\n  szResult[0] = 0;\n  PWSTR pszResult = szResult;\n  size_t cchResult = ARRAYSIZE(szResult);\n  HRESULT hr = S_OK;\n  for (; SUCCEEDED(hr) &amp;&amp; *psz; psz++) {\n    wchar_t szPoint[6];\n    hr = StringCchPrintf(szPoint, ARRAYSIZE(szPoint), L\"%04x \", *psz);\n    if (SUCCEEDED(hr)) {\n      hr = StringCchCatEx(pszResult, cchResult, szPoint, &amp;pszResult, &amp;cchResult, 0);\n    }\n  }\n  SetDlgItemText(hwnd, idc, szResult);\n}\n<\/pre>\n<p>\nThe <code>Set&shy;Dlg&shy;Item&shy;Code&shy;Points<\/code>\nfunction takes a UTF-16 string and prints all the code points.\nThis is just to help visualize the result;\nit&#8217;s not part of the actual diacritic-removal algorithm.\n<\/p>\n<pre>\nvoid OnUpdate(HWND hwnd)\n{\n  wchar_t szSource[MAXSOURCE];\n  GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));\n  wchar_t szDest[MAXSOURCE * 4];\n  int cchActual = NormalizeString(NormalizationKD,\n                                  szSource, -1,\n                                  szDest, ARRAYSIZE(szDest));\n  if (cchActual &lt;= 0) szDest[0] = 0;\n  WORD rgType[ARRAYSIZE(szDest)];\n  GetStringTypeW(CT_CTYPE3, szDest, -1, rgType);\n  PWSTR pszWrite = szDest;\n  for (int i = 0; szDest[i]; i++) {\n    if (!(rgType[i] &amp; C3_NONSPACING)) {\n      *pszWrite++ = szDest[i];\n    }\n  }\n  *pszWrite = 0;\n  SetDlgItemText(hwnd, IDC_DEST, szDest);\n  SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);\n  SetDlgItemCodePoints(hwnd, IDC_DESTPOINTS, szDest);\n}\n<\/pre>\n<p>\nOkay, here&#8217;s where the actual work happens.\nWe put the source string into Normalization Form KD.\nThis decomposes the diacritics so that we can identify them\nwith <code>Get&shy;String&shy;TypeW<\/code>\nand then strip them out.\n<\/p>\n<p>\nOf course, in real life, you wouldn&#8217;t hard-code the array sizes\nlike I did here, but this is just a Little Program,\nand Little Programs are allowed to take shortcuts.\n<\/p>\n<p>\nThe rest of the program is just a framework to get into that\nfunction.\n<\/p>\n<pre>\nINT_PTR CALLBACK DlgProc(HWND hwnd, UINT wm,\n                         WPARAM wParam, LPARAM lParam)\n{\n  switch (wm)\n  {\n  case WM_INITDIALOG:\n    return TRUE;\n  case WM_COMMAND:\n    switch (GET_WM_COMMAND_ID(wParam, lParam)) {\n    case IDC_SOURCE:\n      switch (GET_WM_COMMAND_CMD(wParam, lParam)) {\n    case EN_UPDATE:\n      OnUpdate(hwnd);\n      break;\n    }\n    break;\n    case IDOK:\n      EndDialog(hwnd, 0);\n      return TRUE;\n  }\n  break;\n  case WM_CLOSE:\n    EndDialog(hwnd, 0);\n    return TRUE;\n  }\n  return FALSE;\n}\nint WINAPI wWinMain(HINSTANCE hinst, HINSTANCE hinstPrev,\n                   LPWSTR lpCmdLine, int nShowCmd)\n{\n  DialogBox(hinst, MAKEINTRESOURCE(IDD_SCRATCH), nullptr, DlgProc);\n  return 0;\n}\n<\/pre>\n<p>\nOkay, let&#8217;s take this program for a spin.\nHere are some interesting characters to try:\n<\/p>\n<table BORDER=\"1\" STYLE=\"border: solid 1px black;border-collapse: collapse\" CELLPADDING=\"3\" CELLSPACING=\"0\">\n<tr>\n<th COLSPAN=\"3\">Original character<\/th>\n<th COLSPAN=\"3\">Resulting character<\/th>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&ordf;<\/td>\n<td>00AA<\/td>\n<td>Feminine ordinal indicator<\/td>\n<td ALIGN=\"center\">a<\/td>\n<td>0061<\/td>\n<td>Latin small letter a<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&sup1;<\/td>\n<td>00B1<\/td>\n<td>Superscript one<\/td>\n<td ALIGN=\"center\">1<\/td>\n<td>0031<\/td>\n<td>Digit one<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&frac12;<\/td>\n<td>00BD<\/td>\n<td>Vulgar fraction one half<\/td>\n<td ALIGN=\"center\">1&#x2044;2<\/td>\n<td>0031 2044 0032<\/td>\n<td>Digit one + Fraction slash + Digit two<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x131;<\/td>\n<td>0131<\/td>\n<td>Latin small letter dotless i<\/td>\n<td ALIGN=\"center\">&#x131;<\/td>\n<td>0131<\/td>\n<td>Latin small letter dotless i<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&Oslash;<\/td>\n<td>00D8<\/td>\n<td>Latin capital letter O with stroke<\/td>\n<td ALIGN=\"center\"><\/td>\n<td><\/td>\n<td>Disappears!<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#322;<\/td>\n<td>0142<\/td>\n<td>Latin small letter l with stroke<\/td>\n<td ALIGN=\"center\">&#322;<\/td>\n<td>0142<\/td>\n<td>Latin small letter l with stroke<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#320;<\/td>\n<td>0140<\/td>\n<td>Latin small letter l with middle dot<\/td>\n<td ALIGN=\"center\">l&middot;<\/td>\n<td>006C 00B7<\/td>\n<td>Latin small letter l + middle dot<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&aelig;<\/td>\n<td>00E6<\/td>\n<td>Latin small letter ae<\/td>\n<td ALIGN=\"center\">&aelig;<\/td>\n<td>00E6<\/td>\n<td>Latin small letter ae<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x389;<\/td>\n<td>0389<\/td>\n<td>Greek capital letter Eta with tonos<\/td>\n<td ALIGN=\"center\">&Eta;<\/td>\n<td>0397<\/td>\n<td>Greek capital letter Eta<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x410;<\/td>\n<td>0410<\/td>\n<td>Cyrillic capital letter &#x410;<\/td>\n<td ALIGN=\"center\">&#x410;<\/td>\n<td>0410<\/td>\n<td>Cyrillic capital letter &#x410;<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&Aring;<\/td>\n<td>00C5<\/td>\n<td>Latin capital letter A with ring above<\/td>\n<td ALIGN=\"center\">A<\/td>\n<td>0041<\/td>\n<td>Latin capital letter A<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#xff21;<\/td>\n<td>FF21<\/td>\n<td>Fullwidth Latin capital letter A<\/td>\n<td ALIGN=\"center\">A<\/td>\n<td>0041<\/td>\n<td>Latin capital letter A<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2460;<\/td>\n<td>2460<\/td>\n<td>Circled digit one<\/td>\n<td ALIGN=\"center\">1<\/td>\n<td>0031<\/td>\n<td>Digit one<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2780;<\/td>\n<td>2780<\/td>\n<td>Dingbat circled sans-serif digit one<\/td>\n<td ALIGN=\"center\">&#x2780;<\/td>\n<td>2780<\/td>\n<td>Dingbat circled sans-serif digit one<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&reg;<\/td>\n<td>00AE<\/td>\n<td>Registered sign<\/td>\n<td ALIGN=\"center\">&reg;<\/td>\n<td>00AE<\/td>\n<td>Registered sign<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x24c7;<\/td>\n<td>24c7<\/td>\n<td>Circled Latin capital letter R<\/td>\n<td ALIGN=\"center\">R<\/td>\n<td>0052<\/td>\n<td>Latin capital letter R<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x1d595;<\/td>\n<td>D835 DD95<\/td>\n<td>Mathematical bold Fraktur small p<\/td>\n<td ALIGN=\"center\">p<\/td>\n<td>0070<\/td>\n<td>Latin small letter p<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#xff6c;<\/td>\n<td>FF6C<\/td>\n<td>Halfwidth Katakana letter small Ya<\/td>\n<td ALIGN=\"center\">&#x30e3;<\/td>\n<td>30E3<\/td>\n<td>Katakana letter small Ya<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x30e3;<\/td>\n<td>30E3<\/td>\n<td>Katakana letter small Ya<\/td>\n<td ALIGN=\"center\">&#x30e3;<\/td>\n<td>30E3<\/td>\n<td>Katakana letter small Ya<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x30b4;<\/td>\n<td>30B4<\/td>\n<td>Katakana letter Go<\/td>\n<td ALIGN=\"center\">&#x30b3;<\/td>\n<td>30B3<\/td>\n<td>Katakana letter Ko<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201c;<\/td>\n<td>201C<\/td>\n<td>Left double quotation mark<\/td>\n<td ALIGN=\"center\">&#x201c;<\/td>\n<td>201C<\/td>\n<td>Left double quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201d;<\/td>\n<td>201D<\/td>\n<td>Right double quotation mark<\/td>\n<td ALIGN=\"center\">&#x201D;<\/td>\n<td>201D<\/td>\n<td>Right double quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201E;<\/td>\n<td>201E<\/td>\n<td>Double low-9 quotation mark<\/td>\n<td ALIGN=\"center\">&#x201E;<\/td>\n<td>201E<\/td>\n<td>Double low-9 quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201F;<\/td>\n<td>201F<\/td>\n<td>Double high-reversed-9 quotation mark<\/td>\n<td ALIGN=\"center\">&#x201F;<\/td>\n<td>201F<\/td>\n<td>Double high-reversed-9 quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2033;<\/td>\n<td>2033<\/td>\n<td>Double prime<\/td>\n<td ALIGN=\"center\">&#x2032;&#x2032;<\/td>\n<td>2032 2032<\/td>\n<td>Prime + Prime<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2035;<\/td>\n<td>2035<\/td>\n<td>Reverse prime<\/td>\n<td ALIGN=\"center\">&#x2035;<\/td>\n<td>2035<\/td>\n<td>Reverse prime<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2039;<\/td>\n<td>2039<\/td>\n<td>Single left-pointing angle quotation mark<\/td>\n<td ALIGN=\"center\">&#x2039;<\/td>\n<td>2039<\/td>\n<td>Single left-pointing angle quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#xAB;<\/td>\n<td>00AB<\/td>\n<td>Left-pointing double angle quotation mark<\/td>\n<td ALIGN=\"center\">&#xAB;<\/td>\n<td>00AB<\/td>\n<td>Left-pointing double angle quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2014;<\/td>\n<td>2014<\/td>\n<td>Em-dash<\/td>\n<td ALIGN=\"center\">&#x2014;<\/td>\n<td>2014<\/td>\n<td>Em-dash<\/td>\n<\/tr>\n<td ALIGN=\"center\">&#x203C;<\/td>\n<td>203C<\/td>\n<td>Double exclamation mark<\/td>\n<td ALIGN=\"center\">!!<\/td>\n<td>0021 0021<\/td>\n<td>Exclamation mark + Exclamation mark<\/td>\n<\/tr>\n<\/table>\n<p>\nThere are some interesting quirks here.\nMind you, this is what the Unicode Consortium says,\nso if you think they are wrong,\nyou can take it up with them.\n<\/p>\n<p>\nThe superscript-like characters are converted to their plain\nversions.\nEnclosed alphabetics are also converted,\nbut not the &reg; symbol.\nFullwidth forms of Latin letters\nare converted to their halfwidth equivalents.\nOn the other hand, halfwidth Katakana characters are expanded to their\nfullwidth equivalents.\nBut small Katakana does not convert to their large equivalents.\n<\/p>\n<p>\nThe &Oslash; disappears completely! What&#8217;s up with that?\nThe character code for &Oslash; is reported as\n<code>C3_ALPHA | C3_NONSPACING | C3_DIACRITIC<\/code>,\nand since we are removing nonspacing characters,\nthis causes it to be removed.\n(Why is &Oslash; nonspacing? It occupies space!)\nFor whatever reason, it does not decompose into\nO + Combining Solidus Overlay.\nOn the other hand, the Polish &#322; remains intact\nbecause it is reported as\n<code>C3_ALPHA | C3_DIACRITIC<\/code>.\nPoland wins and Norway loses?\n<\/p>\n<p>\nThe diacritic removal ignores linguistic rules.\nThe Swedish &Aring; decomposes into a capital A and a combining\nring above,\neven though in Swedish, the character is considered nondecomposable.\n(Just like the capital letter Q in English\ndoes not decompose into an O and a tail.)\nKatakana Go suffers a similar ignoble fate,\n<a HREF=\"http:\/\/www.siao2.com\/2007\/05\/14\/2629747.aspx\">\nconverting to Katakana Ko<\/a>,\nwhich is linguistically nonsensical.\nBut then again, removing diacritics\n<i>is already linguistically nonsensical<\/i>.\nNonsensical operation is nonsensical.\n<\/p>\n<p>\nThere is no attempt to unify look-alike characters from different\nscripts.\nLook-alike characters in the Greek and Cyrillic alphabets\nare not mapped to their Latin doppelg&auml;ngers.\n<\/p>\n<p>\nThe infamous Turkish dotless i does not turn into a dotted i.\n(And the lowercase Latin i does not decompose into a combining dot and a dotless i.)\n<\/p>\n<p>\nFinally, I tried a selection of punctuation marks.\nMost of them pass through unchanged,\nwith the exception of the double prime and double exclamation mark\nwhich each decompose into a pair of singles.\n(But double quotation marks do not decompose into a pair of singles.)\n<\/p>\n<p>\nOkay, but the goal of this exercise was spam detection,\nso we are actually interested in mapping as far as possible\nall the way down to plain ASCII.\nWe&#8217;d like to convert, for example,\nthe look-alike characters in the Cyrillic and Greek alphabets\nto the Latin characters they resemble.\n<\/p>\n<p>\nSo let&#8217;s try something else.\nIf we want to convert to ASCII,\nthen just convert to ASCII!\n<\/p>\n<pre>\n#define CP_ASCII 20127\nvoid OnUpdate(HWND hwnd)\n{\n  wchar_t szSource[MAXSOURCE];\n  GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));\n  char szDest[MAXSOURCE * 2];\n  int cchActual = WideCharToMultiByte(CP_ASCII, 0, szSource, -1,\n                              szDest, ARRAYSIZE(szDest), 0, 0);\n  if (cchActual &lt;= 0) szDest[0] = 0;\n  SetDlgItemTextA(hwnd, IDC_DEST, szDest);\n  SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);\n}\n<\/pre>\n<p>\nWe can extend the table above with a new column.\n<\/p>\n<table BORDER=\"1\" STYLE=\"border: solid 1px black;border-collapse: collapse\" CELLPADDING=\"3\" CELLSPACING=\"0\">\n<tr>\n<th COLSPAN=\"3\">Original character<\/th>\n<th COLSPAN=\"3\">KD character<\/th>\n<th COLSPAN=\"3\">ASCII character<\/th>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&ordf;<\/td>\n<td>00AA<\/td>\n<td>Feminine ordinal indicator<\/td>\n<td ALIGN=\"center\">a<\/td>\n<td>0061<\/td>\n<td>Latin small letter a<\/td>\n<td ALIGN=\"center\">a<\/td>\n<td>0061<\/td>\n<td>Latin small letter a<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&sup1;<\/td>\n<td>00B1<\/td>\n<td>Superscript one<\/td>\n<td ALIGN=\"center\">1<\/td>\n<td>0031<\/td>\n<td>Digit one<\/td>\n<td ALIGN=\"center\">1<\/td>\n<td>0031<\/td>\n<td>Digit one<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&frac12;<\/td>\n<td>00BD<\/td>\n<td>Vulgar fraction one half<\/td>\n<td ALIGN=\"center\">1&#x2044;2<\/td>\n<td>0031 2044 0032<\/td>\n<td>Digit one + Fraction slash + Digit two<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x131;<\/td>\n<td>0131<\/td>\n<td>Latin small letter dotless i<\/td>\n<td ALIGN=\"center\">&#x131;<\/td>\n<td>0131<\/td>\n<td>Latin small letter dotless i<\/td>\n<td ALIGN=\"center\">i<\/td>\n<td>0069<\/td>\n<td>Latin small letter i<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&Oslash;<\/td>\n<td>00D8<\/td>\n<td>Latin capital letter O with stroke<\/td>\n<td ALIGN=\"center\"><\/td>\n<td><\/td>\n<td>Disappears!<\/td>\n<td ALIGN=\"center\">O<\/td>\n<td>004F<\/td>\n<td>Latin capital letter O<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#322;<\/td>\n<td>0142<\/td>\n<td>Latin small letter l with stroke<\/td>\n<td ALIGN=\"center\">&#322;<\/td>\n<td>0142<\/td>\n<td>Latin small letter l with stroke<\/td>\n<td ALIGN=\"center\">l<\/td>\n<td>006C<\/td>\n<td>Latin small letter l<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#320;<\/td>\n<td>0140<\/td>\n<td>Latin small letter l with middle dot<\/td>\n<td ALIGN=\"center\">l&middot;<\/td>\n<td>006C 00B7<\/td>\n<td>Latin small letter l + middle dot<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&aelig;<\/td>\n<td>00E6<\/td>\n<td>Latin small letter ae<\/td>\n<td ALIGN=\"center\">&aelig;<\/td>\n<td>00E6<\/td>\n<td>Latin small letter ae<\/td>\n<td ALIGN=\"center\">a<\/td>\n<td>0061<\/td>\n<td>Latin small letter a<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x389;<\/td>\n<td>0389<\/td>\n<td>Greek capital letter Eta with tonos<\/td>\n<td ALIGN=\"center\">&Eta;<\/td>\n<td>0397<\/td>\n<td>Greek capital letter Eta<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x410;<\/td>\n<td>0410<\/td>\n<td>Cyrillic capital letter &#x410;<\/td>\n<td ALIGN=\"center\">&#x410;<\/td>\n<td>0410<\/td>\n<td>Cyrillic capital letter &#x410;<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&Aring;<\/td>\n<td>00C5<\/td>\n<td>Latin capital letter A with ring above<\/td>\n<td ALIGN=\"center\">A<\/td>\n<td>0041<\/td>\n<td>Latin capital letter A<\/td>\n<td ALIGN=\"center\">A<\/td>\n<td>0041<\/td>\n<td>Latin capital letter A<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#xff21;<\/td>\n<td>FF21<\/td>\n<td>Fullwidth Latin capital letter A<\/td>\n<td ALIGN=\"center\">A<\/td>\n<td>0041<\/td>\n<td>Latin capital letter A<\/td>\n<td ALIGN=\"center\">A<\/td>\n<td>0041<\/td>\n<td>Latin capital letter A<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2460;<\/td>\n<td>2460<\/td>\n<td>Circled digit one<\/td>\n<td ALIGN=\"center\">1<\/td>\n<td>0031<\/td>\n<td>Digit one<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2780;<\/td>\n<td>2780<\/td>\n<td>Dingbat circled sans-serif digit one<\/td>\n<td ALIGN=\"center\">&#x2780;<\/td>\n<td>2780<\/td>\n<td>Dingbat circled sans-serif digit one<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&reg;<\/td>\n<td>00AE<\/td>\n<td>Registered sign<\/td>\n<td ALIGN=\"center\">&reg;<\/td>\n<td>00AE<\/td>\n<td>Registered sign<\/td>\n<td ALIGN=\"center\">R<\/td>\n<td>0052<\/td>\n<td>Latin capital letter R<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x24c7;<\/td>\n<td>24c7<\/td>\n<td>Circled Latin capital letter R<\/td>\n<td ALIGN=\"center\">R<\/td>\n<td>0052<\/td>\n<td>Latin capital letter R<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x1d595;<\/td>\n<td>D835 DD95<\/td>\n<td>Mathematical bold Fraktur small p<\/td>\n<td ALIGN=\"center\">p<\/td>\n<td>0070<\/td>\n<td>Latin small letter p<\/td>\n<td ALIGN=\"center\">??<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#xff6c;<\/td>\n<td>FF6C<\/td>\n<td>Halfwidth Katakana letter small Ya<\/td>\n<td ALIGN=\"center\">&#x30e3;<\/td>\n<td>30E3<\/td>\n<td>Katakana letter small Ya<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x30e3;<\/td>\n<td>30E3<\/td>\n<td>Katakana letter small Ya<\/td>\n<td ALIGN=\"center\">&#x30e3;<\/td>\n<td>30E3<\/td>\n<td>Katakana letter small Ya<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x30b4;<\/td>\n<td>30B4<\/td>\n<td>Katakana letter Go<\/td>\n<td ALIGN=\"center\">&#x30b3;<\/td>\n<td>30B3<\/td>\n<td>Katakana letter Ko<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201c;<\/td>\n<td>201C<\/td>\n<td>Left double quotation mark<\/td>\n<td ALIGN=\"center\">&#x201c;<\/td>\n<td>201C<\/td>\n<td>Left double quotation mark<\/td>\n<td ALIGN=\"center\">&#8220;<\/td>\n<td>0022<\/td>\n<td>Quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201d;<\/td>\n<td>201D<\/td>\n<td>Right double quotation mark<\/td>\n<td ALIGN=\"center\">&#x201D;<\/td>\n<td>201D<\/td>\n<td>Right double quotation mark<\/td>\n<td ALIGN=\"center\">&#8220;<\/td>\n<td>0022<\/td>\n<td>Quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201E;<\/td>\n<td>201E<\/td>\n<td>Double low-9 quotation mark<\/td>\n<td ALIGN=\"center\">&#x201E;<\/td>\n<td>201E<\/td>\n<td>Double low-9 quotation mark<\/td>\n<td ALIGN=\"center\">&#8220;<\/td>\n<td>0022<\/td>\n<td>Quotation mark<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x201F;<\/td>\n<td>201F<\/td>\n<td>Double high-reversed-9 quotation mark<\/td>\n<td ALIGN=\"center\">&#x201F;<\/td>\n<td>201F<\/td>\n<td>Double high-reversed-9 quotation mark<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2033;<\/td>\n<td>2033<\/td>\n<td>Double prime<\/td>\n<td ALIGN=\"center\">&#x2032;&#x2032;<\/td>\n<td>2032 2032<\/td>\n<td>Prime + Prime<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2032;<\/td>\n<td>2032<\/td>\n<td>Prime<\/td>\n<td ALIGN=\"center\">&#x2032;<\/td>\n<td>2032<\/td>\n<td>Prime<\/td>\n<td ALIGN=\"center\">&#8216;<\/td>\n<td>0027<\/td>\n<td>Apostrophe<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2035;<\/td>\n<td>2035<\/td>\n<td>Reverse prime<\/td>\n<td ALIGN=\"center\">&#x2035;<\/td>\n<td>2035<\/td>\n<td>Reverse prime<\/td>\n<td ALIGN=\"center\">`<\/td>\n<td>0060<\/td>\n<td>Grave accent<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2039;<\/td>\n<td>2039<\/td>\n<td>Single left-pointing angle quotation mark<\/td>\n<td ALIGN=\"center\">&#x2039;<\/td>\n<td>2039<\/td>\n<td>Single left-pointing angle quotation mark<\/td>\n<td ALIGN=\"center\">&lt;<\/td>\n<td>003C<\/td>\n<td>Less-than sign<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#xAB;<\/td>\n<td>00AB<\/td>\n<td>Left-pointing double angle quotation mark<\/td>\n<td ALIGN=\"center\">&#xAB;<\/td>\n<td>00AB<\/td>\n<td>Left-pointing double angle quotation mark<\/td>\n<td ALIGN=\"center\">&lt;<\/td>\n<td>003C<\/td>\n<td>Less-than sign<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"center\">&#x2014;<\/td>\n<td>2014<\/td>\n<td>Em-dash<\/td>\n<td ALIGN=\"center\">&#x2014;<\/td>\n<td>2014<\/td>\n<td>Em-dash<\/td>\n<td ALIGN=\"center\">&#8211;<\/td>\n<td>002D<\/td>\n<td>Hyphen-minus<\/td>\n<\/tr>\n<td ALIGN=\"center\">&#x203C;<\/td>\n<td>203C<\/td>\n<td>Double exclamation mark<\/td>\n<td ALIGN=\"center\">!!<\/td>\n<td>0021 0021<\/td>\n<td>Exclamation mark + Exclamation mark<\/td>\n<td ALIGN=\"center\">?<\/td>\n<td><\/td>\n<td>No conversion<\/td>\n<\/tr>\n<\/table>\n<p>\nThere are some interesting differences here.\n<\/p>\n<p>\nSome characters fail to convert to ASCII outright.\nThis is not unexpected for the Japanese characters,\nis mildly unexpected for the\nlook-alikes in the Cyrillic and Greek alphabets,\nand is surprising for some characters like double prime,\ndouble exclamation point,\nenclosed alphanumerics,\nand vulgar fractions\nbecause they had ASCII decompositions in Normalization Form KD,\nbut converting directly into ASCII refused to use them.\n<\/p>\n<p>\nBut the dotless i gets its dot back.\n<\/p>\n<p>\nAnother weird thing you might notice is that the &aelig; converts\nto just the a.\nThis goes contrary to the expectations of\nAmerican English,\nbecause words which historically use the &aelig; and &oelig;\nare largely\nrespelled in American English to use just the e.\n(Encyclop&aelig;dia &rarr; encyclopedia,\nf&oelig;tus &rarr; fetus.)\nMysteries abound.\n<\/p>\n<p>\nIf your real goal is to map every character to its nearest ASCII\nlook-alike,\nthen all these code page games are just beating around the bush.\nThe way to go is to use the Unicode Confusables database.\nThere is a\n<a HREF=\"http:\/\/www.unicode.org\/Public\/security\/revision-05\/confusables.txt\">\nhuge data file<\/a>\nand\n<a HREF=\"http:\/\/www.unicode.org\/reports\/tr39\/#Confusable_Detection\">\ninstructions on how to use it<\/a>.\nThere&#8217;s also\n<a HREF=\"http:\/\/unicode.org\/cldr\/utility\/confusables.jsp\">\na nice Web site<\/a>\nthat lets you explore the confusables database interactively.\n<\/p>\n<p>\nOr you could just take the sledgehammer approach:\nIf there are a significant number of characters outside the Latin alphabet\nand punctuation and you are expecting English text,\nthen just reject it as likely spam.\n<\/p>\n<p>\n&#xca0;_&#xca0;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today&#8217;s Little Program strips diacritics from a Unicode string. Why? Hey, I said that Little Programs require little to no motivation. It might come in handy in a spam filter, since it was popular, at least for a time, to put random accent marks on spam subject lines in order to sneak past keyword filters. [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-43553","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Today&#8217;s Little Program strips diacritics from a Unicode string. Why? Hey, I said that Little Programs require little to no motivation. It might come in handy in a spam filter, since it was popular, at least for a time, to put random accent marks on spam subject lines in order to sneak past keyword filters. [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43553","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=43553"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43553\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=43553"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=43553"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=43553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}