Last time, we encountered a mystery where the synthesis of CF_ from CF_ did not use AnsiÂToÂOem. Today we will begin the investigation.
Recall that we have a table showing how Windows synthesizes each of the various text formats from the other two. But in the case where the clipboard has two formats available, and you ask for the third, there are two ways that the third format could be synthesized: It could convert the first, or it could convert the second. How does Windows decide?
The preference table is
| To get | First try | Then try | And then try |
|---|---|---|---|
| CF_TEXT | CF_TEXT | CF_UNICODETEXT | CF_OEMTEXT |
| CF_OEMTEXT | CF_OEMTEXT | CF_UNICODETEXT | CF_TEXT |
| CF_UNICODETEXT | CF_UNICODETEXT | CF_TEXT | CF_OEMTEXT |
In words, first look for a perfect match. If that’s not available, then try (in order) CF_, then CF_, then CF_. (One of those last three checks is redundant with the perfect match check.)
Combining that with our previous table produces this conversion table with priorities:
| To get | First try | Then try | And then try |
|---|---|---|---|
| CF_TEXT | CF_TEXT | CF_UNICODETEXT + WC2MB(ANSI CP) | CF_OEMTEXT + OemToAnsi |
| CF_OEMTEXT | CF_OEMTEXT | CF_UNICODETEXT + WC2MB(OEM CP) | CF_TEXT + AnsiToOem |
| CF_UNICODETEXT | CF_UNICODETEXT | CF_TEXT + MB2WC(ANSI CP) | CF_OEMTEXT + MB2WC(OEM CP) |
Again, “ANSI CP” means “the code page reported by calling GetÂLocaleÂInfo with the LCID in the CF_ clipboard format, and the LOCALE_ locale attribute”. Similarly for “OEM CP”, using LOCALE_ instead of LOCALE_.
If you stare at this table, you might notice something odd, possibly even disturbing. And that is part of the answer to the mystery. We’ll talk about it next time.
@Raymond Chen
Instead of moving the cheese around, maybe, just maybe, Microsoft developers could actually add something new and useful (at least for new programs) like, I don’t know,
?
“Perfect is the enemy of good.” CF_UTF8TEXT support would have been nice, but would you say “You can’t ship UTF-8 support until you support CF_UTF8TEXT”? There are so many corners that if you insisted that all of them be identified and cleaned up, the feature would probably never meet your standards for shipping.
"Microsoft developers had plenty of time from Windows 10 1803 release until 2025 to implement and ship proper UTF-8 support for a major component of user workflow such as Clipboard. Instead, they dicked around making Clipboard History..."
There are three teams involved here. There's the NLS team (which is doing activeCodePage to get per-process CP_ACP). Then there's the window manager team (which doesn't want to change the clipboard behavior because it is heavily used and carries lots of compatibility baggage). And there's the Emoji Panel team (which decided that a Clipboard History feature would be a neat thing to add to...
No I wouldn't have said that.
What I would have said is "Microsoft developers had plenty of time from Windows 10 1803 release until 2025 to implement and ship proper UTF-8 support for a major component of user workflow such as Clipboard. Instead, they dicked around making Clipboard History which increases OS attack surface by adding new background services, compromises user privacy, and as it turns out from your recent post even prevents expected codepage conversion flow while running."
And Clipboard History is just one of a myriad of things that were added in the meantime which should've had lower priority than...
The question that I’d be asking, given the table, is “what guarantees that ‘MB2WC(ANSI CP) followed by WC2MB(OEM CP)’ or ‘MB2WC(OEM CP) followed by WC2MB(ANSI CP)’ is the same as AnsiToOem or OemToAnsi?”.
I’m guessing that the answer is not only “nothing”, but that AnsiToOem/OemToAnsi sometimes does clever things based on the locale in use that MB2WC followed by WC2MB does not.
If the table were accurate as shown, it could easily result in infinite recursion, e.g. if there is (only) `CF_OEMTEXT` on the clipboard and the program wants to get `CF_TEXT`, it would try getting `CF_UNICODETEXT` which would fall back to `CF_TEXT` again.
That WOULD be a problem, if it recursed.
It doesn’t. It’s not a recursive function, it’s a flat lookup table.
If there’s only CF_OEMTEXT on the clipboard, and a program wants CF_TEXT, it will check if each format is available, in the order:
1. CF_TEXT
2. CF_UNICODETEXT
3. CF_OEMTEXT
If a program ONLY put OEMTEXT on the clipboard, then the first two checks will show that the requested format is not available, and it will perform the conversion from OEMTEXT.
Different processes can definitely be in different OEM code pages. Not sure if that’s what you are looking for.