RichEdit Font Binding

Murray Sargent

July 25th, 20216 1

Suppose a user pastes some plain text into a document. In principle, the text can contain any Unicode character. That includes virtually all characters used in the current languages of the world along with many ancient scripts and a plethora of symbols, mathematical and otherwise, that don’t belong to any language. The question arises as to what font(s) to use for the pasted characters. In general, the same font cannot be used for all characters, since TrueType glyph indices are 16-bit numbers thereby limiting fonts to 65536 glyphs. Meanwhile Unicode has over 140,000 named characters. Furthermore, even if a font could contain glyphs for all Unicode characters, it wouldn’t be able to render them all without compromises in quality. East Asian characters, for example, ideally have different baselines from Latin characters. This post describes the two ways RichEdit chooses fonts for characters not present in the active font. This process is call “font binding”.

The first section describes RichEdit character repertoires. The second section explains how a character is assigned to a character repertoire. The third section describes how to find out what character repertoires are supported by a font. The fourth section shows how these two kinds of information can be combined to bind fonts to characters in a context-dependent way. The fifth section describes the alternative font binding technology (IProvideFontInfo) used by the XAML RichEditBox and TextBox controls as well as Office applications that run RichEdit in the D2D/DirectWrite mode.

RichEdit Character Repertoires

RichEdit’s built-in font binding facility is an extension of the GDI CreateFont() functionality that ensures the created font matches a given charset. If the font named in the call supports the charset, then the font is used, but if not, GDI instantiates a font that does. Before Unicode became popular, charsets defined character encodings for character repertoires typically associated with language systems, like Western European languages and Japanese. As such they were used for two purposes: 1) to define the encodings, and 2) to define character repertoires supported by fonts. The GDI CreateFont charset functionality addresses the latter purpose. This facility, which is a kind of “font fallback”, is very handy, since it’s usually easy to choose the charsets for characters that have charsets. In contrast it’s harder to choose the correct languages for characters in general.

Charsets correspond to code pages. The Windows code pages are described here. For reference, the Windows charsets supported by RichEdit are

ANSI_CHARSET	EASTEUROPE_CHARSET	RUSSIAN_CHARSET
GREEK_CHARSET	TURKISH_CHARSET	HEBREW_CHARSET
ARABIC_CHARSET	BALTIC_CHARSET	VIETNAMESE_CHARSET
DEFAULT_CHARSET	SYMBOL_CHARSET	THAI_CHARSET
SHIFTJIS_CHARSET	GB2312_CHARSET	HANGUL_CHARSET
CHINESEBIG5_CHARSET	PC437_CHARSET	OEM_CHARSET
MAC_CHARSET

When Windows 2000 added support for Indic and several other Southeast Asian scripts, the decision was made not to add charsets for new scripts since it was clear that Unicode was the best way to represent characters on computers. Unfortunately, that decision limited GDI’s convenient font “fallback” mechanism to character repertoires that have charsets. RichEdit needed to generalize this usage of charset. Accordingly, we defined the charrep, a character repertoire index. It usually corresponds to an ISO script, but there are charreps with no corresponding ISO script and vice versa. In addition to the charrep for each Windows charset, there are charreps for

Armenian	Syriac	Thaana	Devanagari	Bengali
Gurmukhi	Gujarati	Oriya	Tamil	Telugu
Kannada	Malayalam	Sinhala	Lao	Tibetan
Myanmar	Georgian	Jamo	Ethiopic	Cherokee
Aboriginal	Ogham	Runic	Khmer	Mongolian
Braille	Yi	Math General	Math Alphanumeric	Limbu
Taile	Newtailu	Sylotinagr	Kharoshthi	Kayahli
Unicode symbol	Emoji	Glagolitic	Lisu	Vai
N’Ko	Osmanya	Phagspa	Gothic	Deseret
Tifinagh	Old italic	Old Turkic	Bopomofo	Cyrillic xb
Javanese	Olchiki	Sorasompeng	Buginese	Coptic
Meroitic	Enc. alpha-num	Brahmi	Carian	Cuneiform
Cypriot	Egyp. hieroglyph	Aramaic	Pahlavi	Parthian
Lycian	Lydian	Old Persian	Old Sarabian	Phoenician
Shavian	Ugaritic	Adlam	Osage

Most of these are described in The Unicode Standard. The charreps can be used for font binding and they are also used to improve performance by avoiding unnecessary text analysis. Ideally there would be charreps for all ISO scripts that Windows supports. The IProvideFontInfo font binding described in the last section of this post attempts to do just that.

Determining a Character’s Charrep

RichEdit does a kind of binary range search to find out which charrep a character nominally belongs to. ASCII is treated specially, since almost all fonts support “low ASCII”, the “neutral” range U+0020..U+003F. High ASCII, the range U+0040..U+007F is supported by most fonts as well. The ANSI_CHARSET includes ASCII and the Western European ANSI set contained in Windows code page 1252. For the range U+00A0..U+00FF, there’s a charrep named “high Latin 1”.

East Asian (CJK—Chinese Japanese Korean) fonts all support ASCII, but typically don’t support high Latin 1, since at least some of those code positions are used for lead bytes of double-byte character sets or for kana characters. Chinese characters are used in Japanese, in both simplified and traditional Chinese, and in Korean. The CJK fonts often have a lot of Unicode symbol characters (not SYMBOL_CHARSET discussed shortly), so a CJK charrep may be returned for those. An exception occurs in math zones, where a math charrep is preferred.

The default charrep is assigned to the Unicode Private Use Area characters, U+E000..U+F8FF, since these characters have no standardized scripts. Special attention is given to SYMBOL_CHARSET fonts, which don’t use Unicode. But characters in the range U+F020..U+F0FF are assigned the SYMBOL_CHARSET charrep, since Microsoft TrueType SYMBOL_CHARSET fonts use those locations as aliases for U+0020..U+00FF. The charrep assignments are similar to some assignment models based on scripts. But note that natural language isn’t used. This is because in general it’s easier and more reliable to figure out a reasonable charrep for a character than a natural language for a character. A RichEdit client can find out what charrep is assigned to a range of text by calling ITextFont2::GetCharRep();

Determining what Charreps a Font Supports

Windows GDI has a handy structure known as the FONTSIGNATURE, which has bits claiming support for various code pages and Unicode ranges. Some fonts don’t have reliable values, so buyers beware. Nevertheless, it’s fast and useful, so RichEdit uses it to fill in a bit mask for supported character repertoires and has some back-up code to handle errant fonts. Some fonts claim to support a given character repertoire, but only support it partially. For example Japanese fonts claim to support Greek and Cyrillic, but they only have glyphs for basic Greek and Cyrillic characters. RichEdit classifies other characters in these repertoires as “extended”, which means that they need “cmap” verification. The cmap is a TrueType font’s character-to-glyph map. If it returns 0 for a character, the character is missing and will display a missing-character glyph, usually an empty box. The cmap approach is valuable for other cases in which the FONTSIGNATURE may be inadequate. For example, a font may not claim to support Latin 1, but it nevertheless supports low and/or high ASCII. By checking the cmap for ‘0’ and ‘a’, respectively, one can find out the amount of ASCII support available. This approach can also be useful for finding out about new Unicode ranges not represented in the FONTSIGNATURE, e.g., emoji.

Binding Fonts

Given the information in the two preceding sections, we need to ensure appropriate fonts are bound to characters. Let’s start with the basic algorithm and then consider some of many fix-ups.

The simple algorithm is: assign a character flag (bit) to each character repertoire (charrep) and AND the resulting bit mask for a character against the bit mask for the current font. If a nonzero value results, the font claims to support the character’s charrep and the font can continue to be used. If the result is zero, font binding is needed.

To keep the current user font choices when font binding, RichEdit scans the text runs backward from the insertion point looking for a font that supports the desired charrep. If one is found, it is used unless it was introduced by font binding. Otherwise, the default font for the charrep is used. The RichEdit client can change the default font for a charrep by using the EM_SETCHARFORMAT message with the SCF_ASSOCIATEFONT flag and an LCID (locale ID) that corresponds to the desired charrep.

Special considerations apply to math zones and Chinese characters (among other scenarios). In math zones, a math font is used whenever it can handle the characters. This includes not only Unicode math symbols and math alphanumerics, but also Latin, Greek and Cyrillic text. Many Chinese characters are used in Japanese, but a Chinese font may not look pleasing to a Japanese person. In particular, the simplified Chinese fonts look quite different from the more traditional look of the same characters rendered with a Japanese font. So to bind appropriately, we scan forward and backward around an inserted Chinese character to see if any Hiragana or Katakana characters are present. If so, it’s most likely to be Japanese text and a Japanese font should be used. Similarly, if Hangul characters are found, a Korean font should be used. Other heuristics involve noting what the user locale is.

If even with its many heuristics RichEdit still cannot find a reasonable font and a Microsoft Office application is running, RichEdit queries the Office mso.dll if it’s loaded. Also, various kinds of “font fallback” may kick in at display time to save a character from being rendered as a missing-character glyph. Font binding is an area of active research since new character scripts continue to be added to Windows and Unicode continues to add new characters. In general, RichEdit’s built-in font binding does a good job of it, but it could be better.

DirectWrite font binding

Alternatively, RichEdit controls can call methods on the IProvideFontInfo interface to assign fonts to text runs. During RichEdit control initialization, the client may supply the IProvideFontInfo pointer via a callback to ITextHost::QueryInterface() as done by the XAML RichEditBox and TextBox controls. Or if the client has loaded the Microsoft Office shared library mso.dll, a D2D/DirectWrite RichEdit control will try to create an enhanced IProvideFontInfo object. The latter is a wrapper around the DirectWrite IDWriteFontFallback interface. In principle, this font binding handles all ISO scripts that Windows supports. But it doesn’t handle math-zone font binding.

The methods of IProvideFontInfo are:

Get default font (the BSTR returned will be freed by the caller of this function)

    virtual BSTR GetDefaultFont() = 0;

Get font face ID to be used for new characters

    virtual DWORD GetRunFontFaceId(     _In_z_
        const wchar_t* pCurrentFontName,  // Current font
        _In_ DWRITE_FONT_WEIGHT weight,   // Bold, Extra Bold, ...
        _In_ DWRITE_FONT_STRETCH stretch, // Condensed, expanded, ...
        _In_ DWRITE_FONT_STYLE style,     // Italic, Oblique, Normal
        _In_ LCID lcid,                   // Locale id
                                          _In_opt_count_(charCount)
        const wchar_t* pText,             // Input characters
        _In_ unsigned int charCount,      // Character count of pText
        _In_ DWORD fontFaceIdCurrent,     // Current font face Id
        _Out_ unsigned int& runCount)=0;  // Character count for subset of pText covered

Check if a different font should be used for new characters

    virtual IDWriteFontFace* GetFontFace(
        _In_ DWORD fontFaceId) = 0; // Font ID

Get name of a font face belonging to a font ID

    virtual BSTR GetSerializableFontName(
        _In_ DWORD fontFaceId) = 0;

The GUID for IProvideFontInfo is 7502135B-17C1-4A25-BDC9-55E6BCB8598A.

The Office IProvideFontInfo instantiation includes two extra methods that belong to IProvideFontInfo2 (which inherits from IProvideFontInfo):

Get Default Font face without saving a fontID

    virtual HRESULT GetDefaultFontFace(
        IDWriteFontFace** pDwriteFontFace) = 0;

Refresh font-face cache

    virtual HRESULT RefreshFontFaceCache(
        const std::wstring& gdiName) = 0;

The GUID for IProvideFontInfo2 is F71EE023-E909-4F63-B569-EA08956D0004.

Murray Sargent Principal Software Engineer, CXE (Office Shared)

6 comments

Discussion is closed. Login to edit/delete existing comments.

Simon Mourier August 15, 2021 12:04 am 0

Hi, thanks for all your posts 🙂

My question is not related to this one but to the “RichEdit HTML Support” one (comments are closed on this).

I’ve been experimenting ITextRange2::SetText2(tomConvertHtml, “blah”) but it just returns E_NOTMPL. If I debug I can see that when I enter msftedit’s SetText2 it checks the flags against the 0x7F828E82 value and returns with this error (since tomConvertHtml is not part of it).

So it looks my version just doesn’t implements it. Could you tell me what version of Windows supports this? I’m running off the latest Windows 10.

filever on msftedit.dll says this:

Language 0x0409 (English (United States))
CharSet 0x04b0 Unicode
CompanyName Microsoft Corporation
FileDescription Rich Text Edit Control, v8.5
InternalName MsftEdit
OriginalFilenam MsftEdit.DLL.MUI
ProductName Microsoft« Windows« Operating System
ProductVersion 10.0.19041.1
FileVersion 10.0.19041.1 (WinBuild.160101.0800)
LegalCopyright ⌐ Microsoft Corporation. All rights reserved.

Fixed File Info (VS_FIXEDFILEINFO) for c:\Windows\System32\msftedit.dll
Signature: feef04bd
Struc Ver: 00010000
FileVer: 00060002:4a610315 (6.2:19041.789)
ProdVer: 000a0000:4a610315 (10.0:19041.789)
FlagMask: 0000003f
Flags: 00000000
OS: 00040004 NT Win32
FileType: 00000002 Dll
SubType: 00000000
FileDate: 00000000:00000000

Thanks!

PS: also official doc for ITextRange2::SetText2 has not been updated with this new flag either.
- Murray Sargent $Microsoft employee$ August 15, 2021 2:12 pm 0
  
  Many recent RichEdit enhancements only appear in the Microsoft Office riched20.dll. The RichEdit HTML reader uses the Office HTML parser so it isn’t likely to be ported to the msftedit.dll in the near future.
  - Simon Mourier August 21, 2021 5:05 am 0
    Ok, thanks for that answer… that’s what I thought. That’s sad.
    
    Anyway, I have tried with Office riched20.dll but If I use SetText2(tomConvertHtml, ComBSTR(L”<html><body><p>hello world</p></body></html>”)) then now it kinda works (returns S_FALSE, not S_OK…), but nothing is rendered, and when I reread it using GetText2(tomConvertHtml) it gives back (with a leading CRLF)
    
    <html><head><style>body{font-family:Arial,sans-serif;font-size:10pt;}</style><style>.cf0{font-family:Calibri;font-size:9.7pt;background-color:#FFFFFF;}</style></head><body><p></p></body></html>
    
    In other words, it eats it but returns it as empty (and if I read the plain text it’s CR, and the Rtf text corresponds to the HTML).
    
    So, is there any trick or any flag to add to tomConvertHtml (0x00900000)?
    
    Thanks again!
    - Murray Sargent $Microsoft employee$ August 21, 2021 12:28 pm 0
      
      In my unit tests for tomConvertHtml, I’ve used UTF-8 strings rather than UTF-16. You might want to allocate a BSTR with bytes rather than wchar_t’s. But it should work with either. Here’s an example with the equation 𝐸=𝑚𝑐² in a unit test with bytes:
      
      const char szHtml5[] =
      “”
      “”
      “body{font-family:Arial,sans-serif;font-size:10pt;}”
      “.cf0{font-family:Segoe UI;}.cf1{font-weight:bold;font-family:Segoe UI;}.cf2{font-style:italic;font-family:Segoe UI;}.cf3{font-style:italic;text-decoration:underline;font-family:Cambria Math;color:#0000FF;}.cf4{font-style:italic;font-family:Cambria Math;}.cf5{font-style:italic;font-family:Cambria Math;}.cf6{font-family:Calibri;}.pf0{}”
      “”
      “
      
      ”
      “this is ”
      “bold ”
      “and ”
      “italic ”
      “and a link ”
      “MSW”
      “ and an equation
      
      ”
      “
      
      ”
      “”
      “E=m”
      “c2
      
      ”
      ““;
      
      UINT size = static_cast(sizeof(szHtml5));
      Resp::BStrHolder bstr(SysAllocStringByteLen(szHtml5, size)));
      
      TestAssert::HrSucceeded(psel->SetRange(0, tomForward));
      TestAssert::HrSucceeded(psel->SetText2(tomConvertHtml, bstr));
      - Murray Sargent $Microsoft employee$ August 21, 2021 12:30 pm 0
        
        Wow, the editor messed that up 🙁 The problem is that I’d have to escape every less-than and greater-than character so that the HTML editor would work. Hopefully you get the idea
      - Simon Mourier August 23, 2021 12:51 am 0
        
        Thanks, I’ve tried the UTF8 way too, but although SetText2 does work ok, when I reread it using GetText2, the returned HTML string doesn’t contain the initial string (other strings format like plain or RTF don’t either). It’s just like SetText2 doesn’t parse it at all.
        
        Here is a small complete reproducing project: https://github.com/smourier/richedhtml and the source is here: https://github.com/smourier/richedhtml/blob/main/RichedHtml/RichedHtml.cpp#L426 maybe I’m not getting the selection right? or does it depend on some external / host / ambient settings?
        
        PS: Yes, quite ironically, the editor here on devblogs has no respect for editor gurus… 🙂