December 17th, 2025
0 reactions

Deducing the consequences of Windows clipboard text formats on UTF-8

We’ve been looking at how Windows performs automatic conversion between the three text formats CF_TEXT, CF_OEM­TEXT, and CF_UNICODE­TEXT. The use of UTF-8 as the 8-bit character set is growing in popularity, and Windows gives you a way to specify that your program wants UTF-8 as both its ANSI and OEM 8-bit character sets.

We saw from our many conversion diagrams and charts that the conversion between UTF-16LE and the 8-bit encodings is mediated by the CF_LOCALE clipboard format, and the conversion between ANSI and OEM is mediated by LOCALE_USER_DEFAULT.

The default for the CF_LOCALE comes from the user’s keyboard layout language, and there is no keyboard whose language is “UTF-8”. So if you are putting UTF-8-encoded text on the clipboard, the default isn’t going to help you.

Even worse: There is no locale whose default ANSI or OEM code page is UTF-8. So even if you could create a custom keyboard layout for UTF-8, there is no locale you could assign it to!

Even if there were, it wouldn’t help because you would set your UTF-8-encoded string as CF_TEXT or perhaps CF_OEM­TEXT, and then some other non-UTF-8 program would read the string and interpret it as their ANSI or OEM code page, which will not be UTF-8.

Originally, the ANSI and OEM code pages were system-wide decisions. Multilingual support shifted them to being per-user. This wasn’t really a problem for the clipboard because different users can’t read each other’s clipboards. But now we have per-process ANSI and OEM code pages thanks to the active­Code­Page manifest settings, and that means that any text on the clipboard that uses ANSI or OEM text formats is at risk of creating mismatches because different processes may disagree on what “ANSI” means.

The upshot for UTF-8-based programs is that you should just put your text on the clipboard in CF_UNICODE­TEXT format. It’s the only format that makes sense. Sorry it’s not the format you would prefer.

Bonus chatter: There are programs that go against advice of counsel and put binary data on the clipboard in a “text” format and expect it to be read back unmodified, so we can’t do any conversions when data is placed in an 8-bit format and read back in the same format.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

2 comments

Sort by :
  • Joshua Hudson 10 seconds ago

    As far as I can tell, there has never been a 32 bit windows where one season is one oem code page. Function SetCP is very old. Seems like this is a 30 year old bug.

    When saving into clipboard in OEMCP, save code page with it. When retrieving, if code page doesn’t match, convert. In most cases you have to go through a Unicode encoding to convert. Nobody wants n squared conversion tables.

  • Antonio Rodríguez 33 minutes ago · Edited

    I use my own code editor (yes, I know, but I started developing it before Notepad++ even existed), and have found that copying and pasting text from a popular microcontroller IDE of Italian ascent produces UTF-8-flavored mojibake (i.e., UTF-8 text is placed in the clipboard, declared as Windows-1252). I had a bug open for that case, thinking that maybe it was legal and I should detect it on my side. It sounded strange, but hey, I wasn't completely sure on how this new thing of UTF-8-as-a-codepage worked, and maybe I was losing something.
    Now that I know The Truthâ„¢, I'll...

    Read more