{"id":110447,"date":"2024-11-01T07:00:00","date_gmt":"2024-11-01T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=110447"},"modified":"2024-11-01T09:57:55","modified_gmt":"2024-11-01T16:57:55","slug":"20241101-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20241101-00\/?p=110447","title":{"rendered":"On locale-aware substring matching, either case-sensitive or case-insensitive"},"content":{"rendered":"<p>Say you want to do a case-insensitive substring search in a locale-aware manner. For example, maybe you have a list of names, and you want the user to be able to search for them by typing any name fragment. A search for &#8220;ber&#8221; could find &#8220;Bert&#8221; as well as &#8220;Roberta&#8221;.<\/p>\n<p>I&#8217;ve seen people solve this problem by converting the string to lowercase, and then doing a code unit-based substring search. This technique doesn&#8217;t work for multiple reasons.<\/p>\n<p>One reason is that some languages (like English) do not consider diacritics significant in collation. The word <i>naive<\/i> and <i>na\u00efve<\/i> are considered equivalent for searching purposes. But a code unit substring search considers them different.<\/p>\n<p>For languages in which diacritics are significant, you have the problem of composed and decomposed characters. For example, the lowercase a with ring in the Swedish word <i>n\u00e5gon<\/i> could be represented either as<\/p>\n<ul>\n<li>Two code points: U+0061 (LATIN SMALL LETTER A) followed by U+030A (COMBINING RING ABOVE), or<\/li>\n<li>A single code point: U+03E5 (LATIN SMALL LETTER A WITH RING ABOVE)<\/li>\n<\/ul>\n<p>The number of possibilities increases if you have characters with multiple diacritics. And then you also have ligatures, where the <i>\ufb01<\/i> &#8220;fi&#8221; ligature is equivalent to two separate characters <i>f<\/i> and <i>i<\/i>.<\/p>\n<p>So what&#8217;s the right thing to do?<\/p>\n<p>In Windows, you can use the <code>Find\u00adNLS\u00adString\u00adEx<\/code> function to do a locale-aware substring search. Use the <code>LINGUISTIC_<wbr \/>IGNORE\u00adDIACRITIC<\/code> flag to say that you want to honor diacritics only when they are significant to the locale.\u00b9 (A better name would have been <code>LINGUISTIC_<wbr \/>IGNORE\u00ad<span style=\"border: solid 1px currentcolor;\">INSIGNIFICANT<\/span>\u00adDIACRITICS<\/code>.)<\/p>\n<p>On other platforms, and even on Windows,\u00b2 you can use the ICU library&#8217;s <a href=\"https:\/\/unicode-org.github.io\/icu\/userguide\/collation\/string-search\"> string search service<\/a> and search with primary weight. (Primary weight honors diacritics which are significant to the locale.)<\/p>\n<p><b>Bonus reading<\/b>: <a title=\"A popular but wrong way to convert a string to uppercase or lowercase\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20241007-00\/?p=110345\"> A popular but wrong way to convert a string to uppercase or lowercase<\/a>. <a title=\"What has case distinction but is neither uppercase nor lowercase?\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20241031-00\/?p=110443\"> What has case distinction but is neither uppercase nor lowercase<\/a>?<\/p>\n<p>\u00b9 Throw in one of the <code>IGNORE\u00adCASE<\/code> flags if you want a case-sensitive substring search.<\/p>\n<p>\u00b2 The Windows globalization team now recommends that people use ICU, which has been part of Windows since Windows 10 version 1703 (build 15063). <a title=\"How can I convert between IANA time zones and Windows registry-based time zones?\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210527-00\/?p=105255\"> More details and gotchas here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It&#8217;s surprisingly complicated, but fortunately, somebody has done it for you.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-110447","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>It&#8217;s surprisingly complicated, but fortunately, somebody has done it for you.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110447","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=110447"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110447\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=110447"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=110447"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=110447"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}