For the past few articles (starting with conversion between CF_ and CF_), we’ve been looking at how Windows performs text conversion among its three clipboard text formats: CF_, CF_, and CF_. A lot of the weirdness dates back to adding Unicode support to what originally supported only 8-bit code page-based encodings.
You might take away from this that the clipboard text conversion system is a mess, and you should simply avoid putting text on the clipboard. But really, all the problems boil down to inconsistent conversions to and from the 8-bit formats. If you stick with CF_, then everything works great!
For over two decades, Windows has been pushing application developers to move to Unicode, with support for 8-bit code pages being retained for backward compatibility with old programs that haven’t had a chance to update.
So don’t be an old program. Be a new program that uses Unicode, specifically the UTF-16LE encoding, which is what “Unicode” typically means in the context of Windows.
If you prefer to use UTF-8 internally, that’s fine, but convert to UTF-16LE when interacting with the clipboard. If you try to put 8-bit UTF-8 data on the clipboard as CF_, you are jumping into the ugly mess that is 8-bit code pages.
Bonus chatter: “But why didn’t they fix this when they added support for UTF-8 as CP_?”
This is a case of perfect being the enemy of good.
The ability to specify a custom activeCodePage as CP_ was scoped primarily to allowing CP_ to be customized on a per-process basis. This magically takes care of functions like MultiÂByteÂToÂWideÂChar(, as well as any functions built on top of those functions. In particular, the magic extends to functions that have both A and W versions since they internally use MultiÂByteÂToÂWideÂChar to convert the 8-bit string to UTF-16LE before passing it to the W version.
But there are lots of other places with hidden dependencies on weird quirks of the code page system, such as the clipboard. Chasing down every last one of them would have taken a long time, and then the activeCodePage team would also have to convince all the affected components to add additional code to support dynamic CP_, which in turn could force a larger redesign of that component that the team felt was too risky.
At least the current version of activeCodePage is clear about what it does: It lets you customize the value of CP_.
It’s often better to have a simple set of easy-to-remember rules, even if they don’t cover all the cases, rather than to have a complex set of rules that tries to cover more cases but inevitably still fails to get them all. At least with the simple set of rules, you can predict where it will work and where it will fall short.
Given the prevalence of UTF-8 nowadays, it honestly seems like a new clipboard format would solve many of the pain points, with implicit conversions to , of course. (I wouldn't add additional conversions to and — go through and let the existing tested code handle the remainder of the journey. Converting out of Unicode is already painful enough as it is without having slight incompatibilities arising from having two ways to do so.)
UTF-8 is a problem for modern apps, not for legacy software from the early 2000s, so a new API (in the form of a...
Or perhaps the clipboard could just remember which actual code-page was used when setting CF_TEXT, and use that when converting to CF_UNICODETEXT, or to CF_TEXT of different process (with different ACP).
No, that's not a sensible argument. UTF-16 is bad, everything using it is obsolete, and there's no way around it. I've been around a long time. As soon as UTF-16 failed of its promise of constant time indexing by adding surrogate pairs it became obsolete, and the W variant of the Windows API surface with it.
The UTF-8 everywhere team is right. I can sit here and measure this. Unless literally dealing with a wall of Chinese it's faster to literally handle everything internally as UTF-8 and literally convert to UTF-16 a few hundred codepoints at a time at screen draw...
It is what it is. Windows internals aren’t going to change.
If you store and do a lot of text manipulation, sure UTF-8 …but if I get UTF-16 string from one API, append something, and pass it to another UTF-16 API, adding two extra conversions seems both waste of performance and of engineering time.
> If you store and do a lot of text manipulation...
You mean like:
1. Web service which sends and receive UTF-8
2. Database which stores and retrieves UTF-8
3. Text editors which load, edit, and save UTF-8
4. Loggers which log UTF-8 so your logs aren't pointlessly 2x the size and can be read on other platforms
5. Browsers which use UTF-8 to request URL you typed, fetch pages in UTF-8, and show them to you
6. Email clients which use UTF-8 (not Outlook, Outlook is retarded especially the new one which doesn't even register mailto: protocol properly)
7. Hardware devices...
> If you prefer to use UTF-8 internally, that’s fine, but convert to UTF-16LE when interacting with the clipboard.
This is not just a matter of preference -- it is the matter of cross-platform compatibility.
Try writing code which is supposed to work on Windows and on Linux where everything is UTF-8 and you will see what I mean.
> This is a case of perfect being the enemy of good. ...
Keep telling that to yourself if it makes you sleep better at night, but it is still a lame excuse to keep status quo instead of actually modernizing the API surface AND...
> actually modernizing the API surface AND underlying behavior.
Oh sure, let's break literally every application that currently exists on Windows just to allow easier cross-platform development. That makes a ton of sense.
"That's not what I'm saying. Make the core of Windows UTF-8, like the rest of the world, and add a compatibility layer for anything that needs UTF-16."
One, that would involve rewriting everything inside of Windows that even touches strings TWICE. ONCE to get the core to UTF-8, and AGAIN to make the MASSIVE compatibility layer just to let UTF-8 Windows pretend to be what it already is today!
Two, compatibility...
Pro-tip: Next time you respond to someone, try addressing what they actually said.
1. I never suggested breaking anything OS-wide—that’s a hallucination on your part.
2. The only change I proposed was adding CF_UTF8 clipboard format for new programs. Nothing else.
As for rewriting all APIs to UTF-8, I never suggested that either. But if you want to discuss it: the real work would be updating console, file I/O, and CRT APIs first, since they’re the most visible surfaces for most cross-platform apps. Everything else could come later—or never—considering GUI desktop apps are mostly just skinned web browsers today.
So no, it’s not...
It also seems like a minor issue - AFAICT, it would be trivial to write a SetClipboardUtf8 function that takes UTF-8 as input, checks that CP_ACP is set to UTF-8, and then use MultiÂByteÂToÂWideÂChar(CP_ACP) to convert your input and then call SetClipboardData(CF_UNICODETEXT, converted). This then works for both delayed render (SetClipboardData(CF_UNICODETEXT, NULL), then call this function in response to a WM_RENDERFORMAT), and immediate render (just call the function).
Similar applies in reverse to a GetClipboardUtf8 function - GetClipboardData(CF_UNICODETEXT) and convert.
The lack of such a function in Win32 implies either that it's not worth Microsoft supplying because it's so rarely needed (e.g....
Indeed, the helper doesn’t even need to check whether CP_ACP is UTF-8. It can just call MultiByteToWideChar(CP_UTF8). The point about this being reducible to an open source helper function is important though. Making something more convenient is nice, but it’s more important to make it possible.