Can INI files be Unicode? Yes, they can, but it has to be your idea

Raymond Chen

INI files were introduced by 16-bit Windows, and 16-bit Windows predates Unicode, so INI files naturally did not support Unicode at the time they were introduced. The relatively simple format of INI files means that many people parse (and sometimes even modify) them directly, without using the INI file manager. This in turn means that the format of INI files is pretty much locked and cannot be extended, since there is no mechanism for extending them in a way that won’t break those manual INI file parsers.

This “locked in” nature of the INI file format means that even if you call the Unicode version, Write­Private­Profile­StringW, the resulting INI file will not be Unicode. It will be a best approximation of your Unicode data in the ambient ANSI code page. The system doesn’t know whether the INI file is going to be processed by somebody’s homemade INI file parser, and writing it out in Unicode would break them.

You might think, “Aw, c’mon. If you use the Unicode Write­Private­Profile­StringW function, then clearly the resulting INI file can be Unicode. After all, this is a new function, so there’s no need to preserve legacy behavior.” However, Michael Kaplan noted that this would mean that converting a program from ANSI to Unicode (which was a frequent occurrence back in the day) would invisibly modify file formats, and your program may not be ready for that.

But that doesn’t mean that INI files could never support Unicode.

Because if the INI file was already Unicode, then there would be no harm in keeping it in Unicode. The decision to create a Unicode INI file came from somewhere else, and we’re just following somebody else’s decision.

So the rule is that the INI file functions will preserve Unicode-ness, but will never take it upon themselves to create a Unicode INI file. In particular, if you use a Write function to create an INI file, that INI file will be created as ANSI, for backward compatibility.

This behavior is called out in the documentation:

If the file was created using Unicode characters, the function writes Unicode characters to the file. Otherwise, the function writes ANSI characters.

Michael says, “I have almost no idea what this text is trying to say, but I am 100% sure it is wrong.”

What it’s trying to say is what Michael inferred: The function writes Unicode characters to the file if the file is already Unicode.

I think what confused Michael was the phrase “If the file was created”. This is not referring to the creation of the file by the Write­Private­Profile­StringW function itself, but rather to whether the file had already been created as a Unicode file before Write­Private­Profile­StringW was called.

Arguably, the text could be made a little clearer:

If the file already exists and consists of Unicode characters, the function writes Unicode characters to the file. Otherwise, the function writes ANSI characters.


Leave a comment

  • IS4 0

    So if I create an “empty” file consisting of just the UTF-8 BOM, is that enough to make all INI operations on the file use Unicode and preserve the BOM?

  • Rutger Ellen 2

    Rip Michael Kaplan, I read most of his blogs in the days, nice to see him being remembered

    • Ian Boyd 1

      Rip. His blog was, like this one is, an excellent source of knowledge.

      Once you understand the why, the what makes sense. And Michael was great for that. I read it in real time, and reference it all the time still.

  • Marco Comerci 0

    What do you think about UTF8? I developed a programmable VST plugin and supported UTF8 for configuration and instrument files, that can store some strings (the language tokens are ANSI characters only). I developed UTF8 helpers like SendMessageU8, that, unsurprisingly, converts to UTF16 and calls SendMessageW.

  • Paul Jackson 1

    How does the API determine that the file “consists of Unicode characters”? I assume the answer is BOM, but you didn’t mention it.

    • Bill Godfrey 0

      (If I may speculate…)

      An INI file will always have to start with an ASCII character, either ‘;’ marking a comment or ‘[‘ introducing a section. (I’d be interested to learn if that assertion I made is actually correct.)

      Read the first two bytes and look for NUL bytes. If neither are NUL then its ASCII-only. If there’s one, it’s UTF-16 and you know the byte order.

      (Edited. The original stated that ‘#’ marked a comment.)

      • Paul Jackson 0

        I see that Raymond answered it in the linked page.

        > the code determines this is through our favorite dodgy API – IsTextUnicode. (The BOM serves as a big hint.)

  • Simon Geard 0

    Java had a similar problem with their “properties” format, which despite being used for localisation since the beginning, was defined as 7-bit ASCII… requiring either a build step to transform sensibly-encoded files into a mess of escape characters, or writing your own frontend to deal with IO and encodings, bypassing the actual ResourceBundle classes.

    Fortunately in later Java versions, they’ve taken the intelligent step of declaring that the encoding is actually UTF8… retaining compatibility, while bringing sanity to the default behaviour.

Feedback usabilla icon