We closed last time with this table:
| To get | First try | Then try | And then try |
|---|---|---|---|
| CF_TEXT | CF_TEXT | CF_UNICODETEXT + WC2MB(ANSI CP) | CF_OEMTEXT + OemToAnsi |
| CF_OEMTEXT | CF_OEMTEXT | CF_UNICODETEXT + WC2MB(OEM CP) | CF_TEXT + AnsiToOem |
| CF_UNICODETEXT | CF_UNICODETEXT | CF_TEXT + MB2WC(ANSI CP) | CF_OEMTEXT + MB2WC(OEM CP) |
I noted that there is something odd, possibly even disturbing, about this table.
Let’s redraw the table as a diagram.
| CF_TEXT | ||||
| (CF_LOCALE) | ⇅ | ↑ | (LOCALE_ | |
| CF_UNICODETEXT | | | USER_ | ||
| ↖↘ | ↓ | DEFAULT) | ||
| (CF_LOCALE) | CF_OEMTEXT | |||
Each of the three boxes represents a clipboard format: CF_, CF_, or CF_.
The lengths of the arrows connecting the boxes represent the priorities: Shorter arrows are preferred over longer arrows. The shortest arrow is the one connecting CF_ to CF_. In the middle is the arrow connecting CF_ to CF_. And the longest arrow is the one connecting CF_ to CF_.
Finally, the label on each arrow represents the code page that is used for the conversion. The conversions to and from CF_ use the CF_ clipboard format to tell them what locale to use, whereas the conversion between CF_ and CF_ uses LOCALE_.
What’s interesting is that if you want to get from one box to another, say from CF_ to CF_, you have two options. You can either use the direct line from CF_ to CF_, or you can take the scenic route from CF_ to CF_ to CF_. And the two options produce different results! (In category theory, you would say that the diagram is not commutative.)
If you take the direct route from CF_ to CF_, then the conversion uses LOCALE_, but if you take the scenic route, then the conversion to CF_ uses the local specified by CF_, as does the conversion from CF_ to CF_. If the local specified by CF_ is different from LOCALE_, then you could very well get different results!
In my test program, I wrote the string "\xD0" to the clipboard as ANSI, and when I read it back as OEM, I expected to receive "\x44" because my system is running with US-English, and the character D0 in code page 1252 is Ð (U+00D0), whose best fit in code page 437 is D (U+0044).
I set the CF_ clipboard format to 0x0419, which is the locale ID for ru-ru. Receiving character 90 would make sense if the ANSI and OEM code pages were taken from the ru-ru locale: Character D0 in the ru-ru ANSI code page 1251 is Р (U+0420), and that maps neatly to character 90 in the ru-ru OEM code page 866, which is also Р (U+0420).
So it seems that Windows is taking the scenic route, and rather than using AnsiToOem, it’s going through CF_. Is the table wrong?
No, the table is correct.
We’ll study the problem some more next time.
My guess is Clipboard History is reading the clipboard in between and asking for CF_UNICODETEXT, causing Windows to take the scenic route.