December 15th, 2025
likemind blown3 reactions

The Windows clipboard automatic text conversion algorithm is path-dependent

We closed last time with this table:

To get First try Then try And then try
CF_TEXT CF_TEXT CF_UNICODETEXT + WC2MB(ANSI CP) CF_OEMTEXT + OemToAnsi
CF_OEMTEXT CF_OEMTEXT CF_UNICODETEXT + WC2MB(OEM CP) CF_TEXT + AnsiToOem
CF_UNICODETEXT CF_UNICODETEXT CF_TEXT + MB2WC(ANSI CP) CF_OEMTEXT + MB2WC(OEM CP)

I noted that there is something odd, possibly even disturbing, about this table.

Let’s redraw the table as a diagram.

  CF_TEXT
(CF_LOCALE)   (LOCALE_
  CF_UNICODETEXT   | USER_
    ↖↘ DEFAULT)
  (CF_LOCALE)   CF_OEMTEXT  

Each of the three boxes represents a clipboard format: CF_UNICODE­TEXT, CF_TEXT, or CF_OEM­TEXT.

The lengths of the arrows connecting the boxes represent the priorities: Shorter arrows are preferred over longer arrows. The shortest arrow is the one connecting CF_UNICODE­TEXT to CF_TEXT. In the middle is the arrow connecting CF_UNICODE­TEXT to CF_OEM­TEXT. And the longest arrow is the one connecting CF_TEXT to CF_OEM­TEXT.

Finally, the label on each arrow represents the code page that is used for the conversion. The conversions to and from CF_UNICODE­TEXT use the CF_LOCALE clipboard format to tell them what locale to use, whereas the conversion between CF_TEXT and CF_OEM­TEXT uses LOCALE_USER_DEFAULT.

What’s interesting is that if you want to get from one box to another, say from CF_TEXT to CF_OEM­TEXT, you have two options. You can either use the direct line from CF_TEXT to CF_OEM­TEXT, or you can take the scenic route from CF_TEXT to CF_UNICODE­TEXT to CF_OEM­TEXT. And the two options produce different results! (In category theory, you would say that the diagram is not commutative.)

If you take the direct route from CF_TEXT to CF_OEM­TEXT, then the conversion uses LOCALE_USER_DEFAULT, but if you take the scenic route, then the conversion to CF_UNICODE­TEXT uses the local specified by CF_LOCALE, as does the conversion from CF_UNICODE­TEXT to CF_OEM­TEXT. If the local specified by CF_LOCALE is different from LOCALE_USER_DEFAULT, then you could very well get different results!

In my test program, I wrote the string "\xD0" to the clipboard as ANSI, and when I read it back as OEM, I expected to receive "\x44" because my system is running with US-English, and the character D0 in code page 1252 is Ð (U+00D0), whose best fit in code page 437 is D (U+0044).

I set the CF_LOCALE clipboard format to 0x0419, which is the locale ID for ru-ru. Receiving character 90 would make sense if the ANSI and OEM code pages were taken from the ru-ru locale: Character D0 in the ru-ru ANSI code page 1251 is Р (U+0420), and that maps neatly to character 90 in the ru-ru OEM code page 866, which is also Р (U+0420).

So it seems that Windows is taking the scenic route, and rather than using Ansi­To­Oem, it’s going through CF_UNICODE­TEXT. Is the table wrong?

No, the table is correct.

We’ll study the problem some more next time.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

1 comment

Sort by :
  • Tudor Zagreanu

    My guess is Clipboard History is reading the clipboard in between and asking for CF_UNICODETEXT, causing Windows to take the scenic route.