Deducing the consequences of Windows clipboard text formats on UTF-8

Raymond Chen

We’ve been looking at how Windows performs automatic conversion between the three text formats CF_TEXT, CF_OEMTEXT, and CF_UNICODETEXT. The use of UTF-8 as the 8-bit character set is growing in popularity, and Windows gives you a way to specify that your program wants UTF-8 as both its ANSI and OEM 8-bit character sets.

We saw from our many conversion diagrams and charts that the conversion between UTF-16LE and the 8-bit encodings is mediated by the CF_LOCALE clipboard format, and the conversion between ANSI and OEM is mediated by LOCALE_USER_DEFAULT.

The default for the CF_LOCALE comes from the user’s keyboard layout language, and there is no keyboard whose language is “UTF-8”. So if you are putting UTF-8-encoded text on the clipboard, the default isn’t going to help you.

Even worse: There is no locale whose default ANSI or OEM code page is UTF-8. So even if you could create a custom keyboard layout for UTF-8, there is no locale you could assign it to!

Even if there were, it wouldn’t help because you would set your UTF-8-encoded string as CF_TEXT or perhaps CF_OEMTEXT, and then some other non-UTF-8 program would read the string and interpret it as their ANSI or OEM code page, which will not be UTF-8.

Originally, the ANSI and OEM code pages were system-wide decisions. Multilingual support shifted them to being per-user. This wasn’t really a problem for the clipboard because different users can’t read each other’s clipboards. But now we have per-process ANSI and OEM code pages thanks to the activeCodePage manifest settings, and that means that any text on the clipboard that uses ANSI or OEM text formats is at risk of creating mismatches because different processes may disagree on what “ANSI” means.

The upshot for UTF-8-based programs is that you should just put your text on the clipboard in CF_UNICODETEXT format. It’s the only format that makes sense. Sorry it’s not the format you would prefer.

Bonus chatter: There are programs that go against advice of counsel and put binary data on the clipboard in a “text” format and expect it to be read back unmodified, so we can’t do any conversions when data is placed in an 8-bit format and read back in the same format.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

3 comments

Discussion is closed. Login to edit/delete existing comments.

LB December 19, 2025

Thanks for this, it’s the outcome I expected but it’s good to have it clarified by you.
Joshua Hudson December 17, 2025

As far as I can tell, there has never been a 32 bit windows where one season is one oem code page. Function SetCP is very old. Seems like this is a 30 year old bug.

When saving into clipboard in OEMCP, save code page with it. When retrieving, if code page doesn’t match, convert. In most cases you have to go through a Unicode encoding to convert. Nobody wants n squared conversion tables.
Antonio Rodríguez December 17, 2025 · Edited

I use my own code editor (yes, I know, but I started developing it before Notepad++ even existed), and have found that copying and pasting text from a popular microcontroller IDE of Italian ascent produces UTF-8-flavored mojibake (i.e., UTF-8 text is placed in the clipboard, declared as Windows-1252). I had a bug open for that case, thinking that maybe it was legal and I should detect it on my side. It sounded strange, but hey, I wasn't completely sure on how this new thing of UTF-8-as-a-codepage worked, and maybe I was losing something.
Now that I know The Truth™, I'll...
Read more
I use my own code editor (yes, I know, but I started developing it before Notepad++ even existed), and have found that copying and pasting text from a popular microcontroller IDE of Italian ascent produces UTF-8-flavored mojibake (i.e., UTF-8 text is placed in the clipboard, declared as Windows-1252). I had a bug open for that case, thinking that maybe it was legal and I should detect it on my side. It sounded strange, but hey, I wasn’t completely sure on how this new thing of UTF-8-as-a-codepage worked, and maybe I was losing something.
Now that I know The Truth™, I’ll give the user the option to autodetect UTF-8 in the clipboard and convert it. I know it isn’t the “right” way to solve that, but at least it will work (and as a local fix, it won’t affect anybody else). Case closed.

Read less

Stay informed

Get notified when new posts are published.

Email *

Country/Region *

I would like to receive the The Old New Thing Newsletter. Privacy Statement.

Follow this blog

Deducing the consequences of Windows clipboard text formats on UTF-8

Author

3 comments

Read next

Concluding thoughts on our deep dive into Windows clipboard text conversion

A shortcut gives me a weird path for a program shortcut that doesn’t point to the executable, so what is it?