Suppose a user pastes some plain text into a document. In principle, the text can contain any Unicode character. That includes virtually all characters used in the current languages of the world along with many ancient scripts and a plethora of symbols, mathematical and otherwise, that donβt belong to any language. The question arises as to what font(s) to use for the pasted characters. In general, the same font cannot be used for all characters, since TrueType glyph indices are 16-bit numbers thereby limiting fonts to 65536 glyphs. Meanwhile Unicode has over 140,000 named characters. Furthermore, even if a font could contain glyphs for all Unicode characters, it wouldnβt be able to render them all without compromises in quality. East Asian characters, for example, ideally have different baselines from Latin characters. This post describes the two ways RichEdit chooses fonts for characters not present in the active font. This process is call βfont bindingβ.
The first section describes RichEdit character repertoires. The second section explains how a character is assigned to a character repertoire. The third section describes how to find out what character repertoires are supported by a font. The fourth section shows how these two kinds of information can be combined to bind fonts to characters in a context-dependent way. The fifth section describes the alternative font binding technology (IProvideFontInfo) used by the XAML RichEditBox and TextBox controls as well as Office applications that run RichEdit in the D2D/DirectWrite mode.
RichEdit Character Repertoires
RichEditβs built-in font binding facility is an extension of the GDI CreateFont() functionality that ensures the created font matches a given charset. If the font named in the call supports the charset, then the font is used, but if not, GDI instantiates a font that does. Before Unicode became popular, charsets defined character encodings for character repertoires typically associated with language systems, like Western European languages and Japanese. As such they were used for two purposes: 1) to define the encodings, and 2) to define character repertoires supported by fonts. The GDI CreateFont charset functionality addresses the latter purpose. This facility, which is a kind of βfont fallbackβ, is very handy, since itβs usually easy to choose the charsets for characters that have charsets. In contrast itβs harder to choose the correct languages for characters in general.
Charsets correspond to code pages. The Windows code pages are described here. For reference, the Windows charsets supported by RichEdit are
ANSI_CHARSET | EASTEUROPE_CHARSET | RUSSIAN_CHARSET |
GREEK_CHARSET | TURKISH_CHARSET | HEBREW_CHARSET |
ARABIC_CHARSET | BALTIC_CHARSET | VIETNAMESE_CHARSET |
DEFAULT_CHARSET | SYMBOL_CHARSET | THAI_CHARSET |
SHIFTJIS_CHARSET | GB2312_CHARSET | HANGUL_CHARSET |
CHINESEBIG5_CHARSET | PC437_CHARSET | OEM_CHARSET |
MAC_CHARSET |
When Windows 2000 added support for Indic and several other Southeast Asian scripts, the decision was made not to add charsets for new scripts since it was clear that Unicode was the best way to represent characters on computers. Unfortunately, that decision limited GDIβs convenient font βfallbackβ mechanism to character repertoires that have charsets. RichEdit needed to generalize this usage of charset. Accordingly, we defined the charrep, a character repertoire index. It usually corresponds to an ISO script, but there are charreps with no corresponding ISO script and vice versa. In addition to the charrep for each Windows charset, there are charreps for
Armenian | Syriac | Thaana | Devanagari | Bengali |
Gurmukhi | Gujarati | Oriya | Tamil | Telugu |
Kannada | Malayalam | Sinhala | Lao | Tibetan |
Myanmar | Georgian | Jamo | Ethiopic | Cherokee |
Aboriginal | Ogham | Runic | Khmer | Mongolian |
Braille | Yi | Math General | Math Alphanumeric | Limbu |
Taile | Newtailu | Sylotinagr | Kharoshthi | Kayahli |
Unicode symbol | Emoji | Glagolitic | Lisu | Vai |
NβKo | Osmanya | Phagspa | Gothic | Deseret |
Tifinagh | Old italic | Old Turkic | Bopomofo | Cyrillic xb |
Javanese | Olchiki | Sorasompeng | Buginese | Coptic |
Meroitic | Enc. alpha-num | Brahmi | Carian | Cuneiform |
Cypriot | Egyp. hieroglyph | Aramaic | Pahlavi | Parthian |
Lycian | Lydian | Old Persian | Old Sarabian | Phoenician |
Shavian | Ugaritic | Adlam | Osage |
Most of these are described in The Unicode Standard. The charreps can be used for font binding and they are also used to improve performance by avoiding unnecessary text analysis. Ideally there would be charreps for all ISO scripts that Windows supports. The IProvideFontInfo font binding described in the last section of this post attempts to do just that.
Determining a Characterβs Charrep
RichEdit does a kind of binary range search to find out which charrep a character nominally belongs to. ASCII is treated specially, since almost all fonts support βlow ASCIIβ, the βneutralβ range U+0020..U+003F. High ASCII, the range U+0040..U+007F is supported by most fonts as well. The ANSI_CHARSET includes ASCII and the Western European ANSI set contained in Windows code page 1252. For the range U+00A0..U+00FF, thereβs a charrep named βhigh Latin 1β.
East Asian (CJKβChinese Japanese Korean) fonts all support ASCII, but typically donβt support high Latin 1, since at least some of those code positions are used for lead bytes of double-byte character sets or for kana characters. Chinese characters are used in Japanese, in both simplified and traditional Chinese, and in Korean. The CJK fonts often have a lot of Unicode symbol characters (not SYMBOL_CHARSET discussed shortly), so a CJK charrep may be returned for those. An exception occurs in math zones, where a math charrep is preferred.
The default charrep is assigned to the Unicode Private Use Area characters, U+E000..U+F8FF, since these characters have no standardized scripts. Special attention is given to SYMBOL_CHARSET fonts, which donβt use Unicode. But characters in the range U+F020..U+F0FF are assigned the SYMBOL_CHARSET charrep, since Microsoft TrueType SYMBOL_CHARSET fonts use those locations as aliases for U+0020..U+00FF. The charrep assignments are similar to some assignment models based on scripts. But note that natural language isnβt used. This is because in general itβs easier and more reliable to figure out a reasonable charrep for a character than a natural language for a character. A RichEdit client can find out what charrep is assigned to a range of text by calling ITextFont2::GetCharRep();
Determining what Charreps a Font Supports
Windows GDI has a handy structure known as the FONTSIGNATURE, which has bits claiming support for various code pages and Unicode ranges. Some fonts donβt have reliable values, so buyers beware. Nevertheless, itβs fast and useful, so RichEdit uses it to fill in a bit mask for supported character repertoires and has some back-up code to handle errant fonts. Some fonts claim to support a given character repertoire, but only support it partially. For example Japanese fonts claim to support Greek and Cyrillic, but they only have glyphs for basic Greek and Cyrillic characters. RichEdit classifies other characters in these repertoires as βextendedβ, which means that they need βcmapβ verification. The cmap is a TrueType fontβs character-to-glyph map. If it returns 0 for a character, the character is missing and will display a missing-character glyph, usually an empty box. The cmap approach is valuable for other cases in which the FONTSIGNATURE may be inadequate. For example, a font may not claim to support Latin 1, but it nevertheless supports low and/or high ASCII. By checking the cmap for β0β and βaβ, respectively, one can find out the amount of ASCII support available. This approach can also be useful for finding out about new Unicode ranges not represented in the FONTSIGNATURE, e.g., emoji.
Binding Fonts
Given the information in the two preceding sections, we need to ensure appropriate fonts are bound to characters. Letβs start with the basic algorithm and then consider some of many fix-ups.
The simple algorithm is: assign a character flag (bit) to each character repertoire (charrep) and AND the resulting bit mask for a character against the bit mask for the current font. If a nonzero value results, the font claims to support the characterβs charrep and the font can continue to be used. If the result is zero, font binding is needed.
To keep the current user font choices when font binding, RichEdit scans the text runs backward from the insertion point looking for a font that supports the desired charrep. If one is found, it is used unless it was introduced by font binding. Otherwise, the default font for the charrep is used. The RichEdit client can change the default font for a charrep by using the EM_SETCHARFORMAT message with the SCF_ASSOCIATEFONT flag and an LCID (locale ID) that corresponds to the desired charrep.
Special considerations apply to math zones and Chinese characters (among other scenarios). In math zones, a math font is used whenever it can handle the characters. This includes not only Unicode math symbols and math alphanumerics, but also Latin, Greek and Cyrillic text. Many Chinese characters are used in Japanese, but a Chinese font may not look pleasing to a Japanese person. In particular, the simplified Chinese fonts look quite different from the more traditional look of the same characters rendered with a Japanese font. So to bind appropriately, we scan forward and backward around an inserted Chinese character to see if any Hiragana or Katakana characters are present. If so, itβs most likely to be Japanese text and a Japanese font should be used. Similarly, if Hangul characters are found, a Korean font should be used. Other heuristics involve noting what the user locale is.
If even with its many heuristics RichEdit still cannot find a reasonable font and a Microsoft Office application is running, RichEdit queries the Office mso.dll if itβs loaded. Also, various kinds of βfont fallbackβ may kick in at display time to save a character from being rendered as a missing-character glyph. Font binding is an area of active research since new character scripts continue to be added to Windows and Unicode continues to add new characters. In general, RichEditβs built-in font binding does a good job of it, but it could be better.
DirectWrite font binding
Alternatively, RichEdit controls can call methods on the IProvideFontInfo interface to assign fonts to text runs. During RichEdit control initialization, the client may supply the IProvideFontInfo pointer via a callback to ITextHost::QueryInterface() as done by the XAML RichEditBox and TextBox controls. Or if the client has loaded the Microsoft Office shared library mso.dll, a D2D/DirectWrite RichEdit control will try to create an enhanced IProvideFontInfo object. The latter is a wrapper around the DirectWrite IDWriteFontFallback interface. In principle, this font binding handles all ISO scripts that Windows supports. But it doesnβt handle math-zone font binding.
The methods of IProvideFontInfo are:
- Get default font (the BSTR returned will be freed by the caller of this function)
virtual BSTR GetDefaultFont() = 0;
- Get font face ID to be used for new characters
virtual DWORD GetRunFontFaceId(Β Β _In_z_ const wchar_t* pCurrentFontName, // Current font _In_ DWRITE_FONT_WEIGHT weight,Β Β // Bold, Extra Bold, ... _In_ DWRITE_FONT_STRETCH stretch, // Condensed, expanded, ... _In_ DWRITE_FONT_STYLE style,Β Β Β Β // Italic, Oblique, Normal _In_ LCID lcid,Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β // Locale id _In_opt_count_(charCount) const wchar_t* pText, Β Β Β Β Β Β Β Β Β Β Β // Input characters _In_ unsigned int charCount,Β Β Β Β Β // Character count of pText _In_ DWORD fontFaceIdCurrent,Β Β Β Β // Current font face Id _Out_ unsigned int& runCount)=0;Β // Character count for subset of pText covered
- Check if a different font should be used for new characters
virtual IDWriteFontFace* GetFontFace( _In_ DWORD fontFaceId) = 0; // Font ID
- Get name of a font face belonging to a font ID
virtual BSTR GetSerializableFontName( _In_ DWORD fontFaceId) = 0;
The GUID for IProvideFontInfo is 7502135B-17C1-4A25-BDC9-55E6BCB8598A.
The Office IProvideFontInfo instantiation includes two extra methods that belong to IProvideFontInfo2 (which inherits from IProvideFontInfo):
- Get Default Font face without saving a fontID
virtual HRESULT GetDefaultFontFace( IDWriteFontFace** pDwriteFontFace) = 0;
- Refresh font-face cache
virtual HRESULT RefreshFontFaceCache( const std::wstring& gdiName) = 0;
The GUID for IProvideFontInfo2 is F71EE023-E909-4F63-B569-EA08956D0004.
Hi, thanks for all your posts π
My question is not related to this one but to the “RichEdit HTML Support” one (comments are closed on this).
I’ve been experimenting ITextRange2::SetText2(tomConvertHtml, “blah”) but it just returns E_NOTMPL. If I debug I can see that when I enter msftedit’s SetText2 it checks the flags against the 0x7F828E82 value and returns with this error (since tomConvertHtml is not part of it).
So it looks my version just doesn’t implements it. Could you tell me what version of Windows supports this? I’m running off the latest Windows 10.
filever on msftedit.dll says this:
Language 0x0409 (English (United States))
CharSet 0x04b0 Unicode
CompanyName Microsoft Corporation
FileDescription Rich Text Edit Control, v8.5
InternalName MsftEdit
OriginalFilenam MsftEdit.DLL.MUI
ProductName MicrosoftΒ« WindowsΒ« Operating System
ProductVersion 10.0.19041.1
FileVersion 10.0.19041.1 (WinBuild.160101.0800)
LegalCopyright β Microsoft Corporation. All rights reserved.
Fixed File Info (VS_FIXEDFILEINFO) for c:\Windows\System32\msftedit.dll
Signature: feef04bd
Struc Ver: 00010000
FileVer: 00060002:4a610315 (6.2:19041.789)
ProdVer: 000a0000:4a610315 (10.0:19041.789)
FlagMask: 0000003f
Flags: 00000000
OS: 00040004 NT Win32
FileType: 00000002 Dll
SubType: 00000000
FileDate: 00000000:00000000
Thanks!
PS: also official doc for ITextRange2::SetText2 has not been updated with this new flag either.
Many recent RichEdit enhancements only appear in the Microsoft Office riched20.dll. The RichEdit HTML reader uses the Office HTML parser so it isn’t likely to be ported to the msftedit.dll in the near future.
Ok, thanks for that answer… that’s what I thought. That’s sad.
Anyway, I have tried with Office riched20.dll but If I use SetText2(tomConvertHtml, ComBSTR(L”<html><body><p>hello world</p></body></html>”)) then now it kinda works (returns S_FALSE, not S_OK…), but nothing is rendered, and when I reread it using GetText2(tomConvertHtml) it gives back (with a leading CRLF)
In other words, it eats it but returns it as empty (and if I read the plain text it’s CR, and the Rtf text corresponds to the HTML).
So, is there any trick or any flag to add to tomConvertHtml (0x00900000)?
Thanks again!
In my unit tests for tomConvertHtml, I’ve used UTF-8 strings rather than UTF-16. You might want to allocate a BSTR with bytes rather than wchar_t’s. But it should work with either. Here’s an example with the equation πΈ=ππΒ² in a unit test with bytes:
const char szHtml5[] =
“”
“”
“body{font-family:Arial,sans-serif;font-size:10pt;}”
“.cf0{font-family:Segoe UI;}.cf1{font-weight:bold;font-family:Segoe UI;}.cf2{font-style:italic;font-family:Segoe UI;}.cf3{font-style:italic;text-decoration:underline;font-family:Cambria Math;color:#0000FF;}.cf4{font-style:italic;font-family:Cambria Math;}.cf5{font-style:italic;font-family:Cambria Math;}.cf6{font-family:Calibri;}.pf0{}”
“”
“
”
“this is ”
“bold ”
“and ”
“italic ”
“and a link ”
“MSW”
“ and an equation
”
“
”
“”
“E=m”
“c2
”
““;
UINT size = static_cast(sizeof(szHtml5));
Resp::BStrHolder bstr(SysAllocStringByteLen(szHtml5, size)));
TestAssert::HrSucceeded(psel->SetRange(0, tomForward));
TestAssert::HrSucceeded(psel->SetText2(tomConvertHtml, bstr));
Thanks, I’ve tried the UTF8 way too, but although SetText2 does work ok, when I reread it using GetText2, the returned HTML string doesn’t contain the initial string (other strings format like plain or RTF don’t either). It’s just like SetText2 doesn’t parse it at all.
Here is a small complete reproducing project: https://github.com/smourier/richedhtml and the source is here: https://github.com/smourier/richedhtml/blob/main/RichedHtml/RichedHtml.cpp#L426 maybe I’m not getting the selection right? or does it depend on some external / host / ambient settings?
PS: Yes, quite ironically, the editor here on devblogs has no respect for editor gurus… π
Wow, the editor messed that up π The problem is that I’d have to escape every less-than and greater-than character so that the HTML editor would work. Hopefully you get the idea