Studying the various locale mismatch scenarios in Windows clipboard text format synthesis

Raymond Chen

So far, we’ve learned that the conversion between Unicode and the 8-bit ANSI and OEM code pages is performed with the assistance of the CF_LOCALE clipboard format, which itself comes from the active keyboard layout. We left with the question of whether this is the right thing, giving as an example the case of highlighting some text in Hebrew and copying it to the clipboard. Shouldn’t that be set with a Hebrew LCID?

First of all, you have to specify what you mean by “copy it to the clipboard.” Suppose the English-language user selected some Hebrew text and the program set it to the clipboard as CF_UNICODETEXT with a Hebrew LCID. A program which reads the CF_UNICODETEXT will read the original Unicode text, with Hebrew characters intact. The LCID plays no role since no conversion was performed. So in the case where the string was placed as Unicode and retrieved as Unicode, everything is fine.

If the string were placed as Unicode but read as CF_TEXT, the retrieving program will get the string translated to code page 1252, since that is the ANSI code page used by the US-English LCID. Is this the correct code page? Well, if the retrieving program is using CF_TEXT, then it is a program that uses the 8-bit ANSI character set as its string encoding, and if you’re running on a US-English system, then the 8-bit ANSI character set is code page 1252. So translating the Hebrew text to ANSI via code page 1252 is correct. You need to translate the string into the ANSI code page that the retrieving program is using.

Conversely, if the Hebrew string were placed on the clipboard as 8-bit ANSI in code page 1252, then… wait, that’s a trick question! Code page 1252 doesn’t have any Hebrew characters! If a program uses the US-English 8-bit ANSI character set, it cannot represent Hebrew characters at all, so the scenario itself is flawed: There can’t be any Hebrew text on the screen to be selected since the program has no way of displaying it.

Now, I guess it could be possible if a program internally supported enough Unicode to display Hebrew characters, but still chose to put text on the clipboard in ANSI format. But in that case, it would be putting question marks on the clipboard since there are no Hebrew characters in code page 1252. Any program that does this intentionally is clearly being pathological: Why do all the work to display characters in Unicode, yet copy those character to the clipboard in 8-bit ANSI?

But wait, let’s rewind to a simpler scenario where there are no character set conversions at all. A program sets text on the clipboard in 8-bit ANSI, and another program reads it. If we consult our table, we see that the entry for this is “N/A”: There is no conversion. This holds true even if the program that put the text on the clipboard and the program that reads the text from the clipboard disagree on what the 8-bit ANSI code page is.

Prior to the introduction of the activeCodePage manifest declaration, the identity of the 8-bit ANSI code page was the same for all applications running in the same desktop. There was no opportunity for mismatch, so if one program put the text on the clipboard in 8-bit ANSI, and another read it out in 8-bit ANSI, they necessarily agreed on what the 8-bit ANSI code page was, since there was only one. But now that we have the ability for different programs to have a different value for the 8-bit ANSI code page, this nop-transformation will result in mojibake if the reader and writer have different ideas about what the 8-bit ANSI code page is.

You have the same problem with the AnsiToOem conversion: Historically, all programs agreed on what the 8-bit ANSI and 8-bit OEM code pages are, so the system maintains a single “ANSI⇆OEM” conversion table that is shared by all processes. But now that programs can choose (indirectly) their ANSI and OEM code pages, you have a problem if those choices don’t match those the system would have chosen.

The people who added activeCodePage support hooked it up to the GetACP() and GetOEMCP() functions, as well as the to the A-suffixed functions which convert their 8-bit ANSI string parameters to Unicode before forwarding the result to the W-suffixed functions. But there are other places that didn’t get updated because doing so would require larger architectural changes, would affect performance of programs that didn’t use the activeCodePage feature, would introduce regression risk, and could lead to compatibility problems. Not saying that they couldn’t have done it, but it would have taken longer, and maybe it’s better to have a good-enough feature than a perfect one.

While doing fact-checking on this series of articles, I wrote some test programs that tried to trigger the CF_TEXT-to-CF_OEMTEXT conversion, and they didn’t behave as I expected.

// Note: Test program doesn't do error-checking.

// Put the ANSI string "\xD0\x00" on the clipboard,
// with the locale 1049 (ru-ru).
int main()
{
    if (OpenClipboard(hwnd)) {
        EmptyClipboard();

        // Put an ANSI string on the clipboard.
        HGLOBAL glob = GlobalAlloc(GMEM_MOVEABLE, 2);
        PSTR message = (PSTR)GlobalLock(glob);
        message[0] = 0xD0;
        message[1] = 0x00;
        GlobalUnlock(glob);
        SetClipboardData(CF_TEXT, glob);

        // Mark it as locale 0x0419 = 1049 = ru-ru
        glob = GlobalAlloc(GMEM_MOVEABLE, sizeof(LCID));
        *(LCID*)GlobalLock(glob) = 0x0419;
        GlobalUnlock(glob);
        SetClipboardData(CF_LOCALE, glob);

        CloseClipboard();
    }
}

And here’s the program to read the string back out in the OEM code page.

int main()
{
    if (OpenClipboard(hwnd)) {
        HGLOBAL glob = GetClipboardData(CF_OEMTEXT);
        PSTR message = (PSTR)GlobalLock(glob);
        printf("%0x02x\n", message[0]);
        GlobalUnlock(glob);

        CloseClipboard();
    }
}

I ran this on a US-English system, so the LCID is 0x0409 = 1033, the ANSI code page is 1252, and the OEM code page is 437. The character D0 in code page 1252 is Ð = U+00D0. This character does not exist in code page 437, so AnsiToOem uses the best-fit character D = U+0044, which is in position 44 in code page 437.

When I ran this program, I expected the CF_OEMTEXT string to have the byte 44, but it didn’t. It had the byte 90. We will start unraveling this mystery next time.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

11 comments

Discussion is closed. Login to edit/delete existing comments.

Solomon Ucko December 13, 2025 · Edited

Decoding 0xD0 in codepage 1251 (Cyrillic) gives U+0420 ‘CYRILLIC CAPITAL LETTER ER’ (“Р”), and then encoding this in codepage 808 (OEM Russian) gives 0x90 (determined using CyberChef)

I’m therefore guessing that the CF_LOCALE causes the ANSI–OEM conversion to use that locale’s codepages, but I’m not sure how.
Victor Khimenko December 12, 2025

When I read “We’ll start studying all the multi-locale issues next time” and, especially, when I head about Hebrew keyboard I hoped to see the the fact that “most Hebrew keyboards are bilingual, with Latin characters, usually in a US Qwerty layout” (as Wikipedia tells us) brought into the discussion. What happens is person types text in Herbew, selects English keyboard layout (is is normal when you use Greek, Hebrew or Russian) and THEN copies it? Would the text be mangled or not? What does “current keyboard layout” even MEAN, in these countries?
- Кирилл Ярин December 18, 2025
  
  In that quite common situation the text is mangled as expected.
  
  I just rechecked that on W11 build 26200.7462, my own old VB6 application as a pre-Unicode program in Win-1251, Firefox as an Unicode one, and three keyboard layouts (English UK language with US keyboard layout, Russian language with default Russian keyboard layout, Ukrainian language with Ukrainian Enhanced keyboard layout). If I copy a text with Cyrillic characters from any plain text control in my VB6 app and insert it in Firefox's address bar, it works normally with Russian and Ukrainian keyboard layouts, but switching to English before copying makes it...
  Read more
  In that quite common situation the text is mangled as expected.
  
  I just rechecked that on W11 build 26200.7462, my own old VB6 application as a pre-Unicode program in Win-1251, Firefox as an Unicode one, and three keyboard layouts (English UK language with US keyboard layout, Russian language with default Russian keyboard layout, Ukrainian language with Ukrainian Enhanced keyboard layout). If I copy a text with Cyrillic characters from any plain text control in my VB6 app and insert it in Firefox’s address bar, it works normally with Russian and Ukrainian keyboard layouts, but switching to English before copying makes it mangled.
  
  Interestingly enough, if I copy the same text from a Rich Text control (which was from richtx32.ocx), it is not mangled.
  
  Read less
- Raymond Chen Author December 12, 2025
  
  “Would the text be mangled or not?” I thought I worked out the scenarios in the article.
  - Victor Khimenko December 15, 2025
    
    I understand that what is written on keyboard is irrelevant. But the question wasn't about that. In countries where Hebrew, Greek or Russian is used it's typical to have TWO layouts selected, NOT one. US-English + Hebrew, US-English + Greek, US-English + Russian are so typical that most keyboards to be used that way! It fact that's what Windows, itself, does: not one layout, but two, automatially, just try to install even Windows 11 with these languages!
    
    And from the description in the articles so far the conclusion is that clipboard would either mangle or not mangle text depending on which...
    Read more
    I understand that what is written on keyboard is irrelevant. But the question wasn’t about that. In countries where Hebrew, Greek or Russian is used it’s typical to have TWO layouts selected, NOT one. US-English + Hebrew, US-English + Greek, US-English + Russian are so typical that most keyboards to be used that way! It fact that’s what Windows, itself, does: not one layout, but two, automatially, just try to install even Windows 11 with these languages!
    
    And from the description in the articles so far the conclusion is that clipboard would either mangle or not mangle text depending on which layout is selected. There was even MS Excel bug where one may or may not use “Ж” letter in the name of table depending on which layout was active when someone clicked keyboard the very first time ( https://habr.com/ru/articles/264313/ — use your translator to read about details: 43 tables that Excel creates when you click on keyboard, trouble with two layouts, etc). But if I remember correctly the actual text is never mangled in clipboard in that case which means when users have US-English + Hebrew layouts and copy-paste while in US-English somehow Hebrew layout is used.
    
    Read less
  - Me Gusta December 15, 2025
    
    @Victor Khimenko
    
    It is the one that the user has selected. Right now, my hardware keboard is a UK keyboard and so my keboard layout is set to en-GB. I can go into the language settings and add en-US. The face glyphs won’t match the keyboard at all, but Windows will interpret the scan codes as the US keyboard layout. This is then what the clipboard would use.
    
    The hardware keyboard is irrelevant in the equation, it is purely down to the user settings.
  - Victor Khimenko December 15, 2025
    
    You are saying that locale comes from keyboard layout, but what keyboard layout is active when someone in countries where two layouts are in use switch to US-English and use Copy+Paste ? They sell keyboards with two sets of glyphs on the for a reason, and its not like in CJK countries where English in embedded into gigantic IME-based Far East locale, people that use Hebrew, Greek or Russia simply install two locales and switch between them (well, Windows does it for them, but question still remains).
Joshua Hudson December 11, 2025

I find myself a little confused. On Windows 7 I had a program that could launch another program in a different ANSI codepage. It didn’t work on Windows 10 for some reason.
LB December 11, 2025

Really interested to see how this turns out, and if it means applications using the UTF-8 code page support have to still convert clipboard text to UTF-16 first or not. I suppose they do, based on what was said in this article… though there is also that beta option in the Windows language settings to use UTF-8 for applications that don’t opt-in to UTF-8 code page already…
- Antonio Rodríguez December 11, 2025 · Edited
  
  Forcing UTF-8 support in ANSI applications can break havoc for non-English users. Most (all?) European languages use characters in the 128-255 range of the local codepage. In particular, in Windows-1252, used for Western European languages, most vocals with diacritics lie in the 192-255 range, which UTF-8 uses for subrogate pairs. As soon as a program tries to display a hard-coded ANSI text with one of those characters, it will be discarded as malformed UTF-8.
  
  In particular, in Spanish the Edit menu is called "Edición" (with "ó" being code 243 in Windows-1252), which means that this will usually happen when the main...
  Read more
  Forcing UTF-8 support in ANSI applications can break havoc for non-English users. Most (all?) European languages use characters in the 128-255 range of the local codepage. In particular, in Windows-1252, used for Western European languages, most vocals with diacritics lie in the 192-255 range, which UTF-8 uses for subrogate pairs. As soon as a program tries to display a hard-coded ANSI text with one of those characters, it will be discarded as malformed UTF-8.
  
  In particular, in Spanish the Edit menu is called “Edición” (with “ó” being code 243 in Windows-1252), which means that this will usually happen when the main window is displayed. Will the process be terminated, the string discarded (and not displayed), the offending character replaced with a question mark, or will Windows employ some kind of auto-detection (which, of course, will sometimes fail)? Each of those options introduce their own problems.
  
  If it ain’t broken, don’t try to fix it. Native UTF-8 support is the way to go for modern applications, but just can’t be retrofitted in legacy ones.
  
  Read less
- Me Gusta December 11, 2025
  
  This still falls foul of activeCodePage. On Windows 11, activeCodePage is able to set a legacy Code Page. So it is possible to have the default be UTF-8, but then override some applications to use a specific Code Page.