Say you want to do a case-insensitive substring search in a locale-aware manner. For example, maybe you have a list of names, and you want the user to be able to search for them by typing any name fragment. A search for “ber” could find “Bert” as well as “Roberta”.
I’ve seen people solve this problem by converting the string to lowercase, and then doing a code unit-based substring search. This technique doesn’t work for multiple reasons.
One reason is that some languages (like English) do not consider diacritics significant in collation. The word naive and naïve are considered equivalent for searching purposes. But a code unit substring search considers them different.
For languages in which diacritics are significant, you have the problem of composed and decomposed characters. For example, the lowercase a with ring in the Swedish word någon could be represented either as
- Two code points: U+0061 (LATIN SMALL LETTER A) followed by U+030A (COMBINING RING ABOVE), or
- A single code point: U+03E5 (LATIN SMALL LETTER A WITH RING ABOVE)
The number of possibilities increases if you have characters with multiple diacritics. And then you also have ligatures, where the fi “fi” ligature is equivalent to two separate characters f and i.
So what’s the right thing to do?
In Windows, you can use the FindNLSStringEx
function to do a locale-aware substring search. Use the LINGUISTIC_
flag to say that you want to honor diacritics only when they are significant to the locale.¹ (A better name would have been LINGUISTIC_
.)
On other platforms, and even on Windows,² you can use the ICU library’s string search service and search with primary weight. (Primary weight honors diacritics which are significant to the locale.)
Bonus reading: A popular but wrong way to convert a string to uppercase or lowercase. What has case distinction but is neither uppercase nor lowercase?
¹ Throw in one of the IGNORECASE
flags if you want a case-sensitive substring search.
² The Windows globalization team now recommends that people use ICU, which has been part of Windows since Windows 10 version 1703 (build 15063). More details and gotchas here.
Or just use .NET which has standardized methods for any sort of substring matching. No need to burden the application with an external library when the environment can handle it fine.
Surprise! .NET’s culture-sensitive IndexOf calls… FindNLSStringEx!
The Windows filesystem seems to treat multiple representations of the same character (composed vs. decomposed, etc.) as different, which can cause a false impression of “having multiple files under the same name” since they tend to render the same in File Explorer.
It can be argued either way, but macOS seems to normalize strings before hitting the disk. On the other end we also have *nix, which basically treats filesystem paths as opaque byte arrays.
All three behaviors are defensible, and they all have issues as well:
macOS reduces confusion when users enter the same text in multiple different ways (for example, by copying and pasting text from the internet), but there are four different kinds of Unicode normalizations and you would ideally pick one of them for the whole system (Google is giving me conflicting answers as to what macOS actually does, so I'm speaking hypothetically). NFC and NFD are...
So, given a ‘platform-independent’ system which internally passes strings as (possibly-invalid) UTF-16 & accepts user input that potentially in other encodings, including non-unicode encodings, how would you deal w/ filenames that needed to be passed in those strings?
Right now I just choose not to think about it… too much. We just use the data directly w/ CreateFileW on windows & convert to the ICU-detected ‘platform encoding’ on other platforms & use fopen
macOS likes NFD, and at the same time, a lot of macOS software (not necessarily first-party, but prominent applications like Microsoft Word’s spellchecker) get confused by that.
One small correction: Windows (NTFS) paths do not need to be valid UTF-16. Any sequence of 16-bit words (except reserved characters) will work. Unassigned code units, mismatched surrogate pair code units, anything. You can put 510-character long UTF-8 string in filename, and the OS will be totally fine with it.