We’ve been looking at how Windows performs automatic conversion between the three text formats CF_, CF_, and CF_. The use of UTF-8 as the 8-bit character set is growing in popularity, and Windows gives you a way to specify that your program wants UTF-8 as both its ANSI and OEM 8-bit character sets.
We saw from our many conversion diagrams and charts that the conversion between UTF-16LE and the 8-bit encodings is mediated by the CF_LOCALE clipboard format, and the conversion between ANSI and OEM is mediated by LOCALE_.
The default for the CF_LOCALE comes from the user’s keyboard layout language, and there is no keyboard whose language is “UTF-8”. So if you are putting UTF-8-encoded text on the clipboard, the default isn’t going to help you.
Even worse: There is no locale whose default ANSI or OEM code page is UTF-8. So even if you could create a custom keyboard layout for UTF-8, there is no locale you could assign it to!
Even if there were, it wouldn’t help because you would set your UTF-8-encoded string as CF_ or perhaps CF_, and then some other non-UTF-8 program would read the string and interpret it as their ANSI or OEM code page, which will not be UTF-8.
Originally, the ANSI and OEM code pages were system-wide decisions. Multilingual support shifted them to being per-user. This wasn’t really a problem for the clipboard because different users can’t read each other’s clipboards. But now we have per-process ANSI and OEM code pages thanks to the activeÂCodeÂPage manifest settings, and that means that any text on the clipboard that uses ANSI or OEM text formats is at risk of creating mismatches because different processes may disagree on what “ANSI” means.
The upshot for UTF-8-based programs is that you should just put your text on the clipboard in CF_ format. It’s the only format that makes sense. Sorry it’s not the format you would prefer.
Bonus chatter: There are programs that go against advice of counsel and put binary data on the clipboard in a “text” format and expect it to be read back unmodified, so we can’t do any conversions when data is placed in an 8-bit format and read back in the same format.
As far as I can tell, there has never been a 32 bit windows where one season is one oem code page. Function SetCP is very old. Seems like this is a 30 year old bug.
When saving into clipboard in OEMCP, save code page with it. When retrieving, if code page doesn’t match, convert. In most cases you have to go through a Unicode encoding to convert. Nobody wants n squared conversion tables.
I use my own code editor (yes, I know, but I started developing it before Notepad++ even existed), and have found that copying and pasting text from a popular microcontroller IDE of Italian ascent produces UTF-8-flavored mojibake (i.e., UTF-8 text is placed in the clipboard, declared as Windows-1252). I had a bug open for that case, thinking that maybe it was legal and I should detect it on my side. It sounded strange, but hey, I wasn't completely sure on how this new thing of UTF-8-as-a-codepage worked, and maybe I was losing something.
Now that I know The Truthâ„¢, I'll...