July 25th, 2021

RichEdit Font Binding

Murray Sargent
Principal Software Engineer

Suppose a user pastes some plain text into a document. In principle, the text can contain any Unicode character. That includes virtually all characters used in the current languages of the world along with many ancient scripts and a plethora of symbols, mathematical and otherwise, that don’t belong to any language. The question arises as to what font(s) to use for the pasted characters. In general, the same font cannot be used for all characters, since TrueType glyph indices are 16-bit numbers thereby limiting fonts to 65536 glyphs. Meanwhile Unicode has over 140,000 named characters. Furthermore, even if a font could contain glyphs for all Unicode characters, it wouldn’t be able to render them all without compromises in quality. East Asian characters, for example, ideally have different baselines from Latin characters. This post describes the two ways RichEdit chooses fonts for characters not present in the active font. This process is call “font binding”.

The first section describes RichEdit character repertoires. The second section explains how a character is assigned to a character repertoire. The third section describes how to find out what character repertoires are supported by a font. The fourth section shows how these two kinds of information can be combined to bind fonts to characters in a context-dependent way. The fifth section describes the alternative font binding technology (IProvideFontInfo) used by the XAML RichEditBox and TextBox controls as well as Office applications that run RichEdit in the D2D/DirectWrite mode.

RichEdit Character Repertoires

RichEdit’s built-in font binding facility is an extension of the GDI CreateFont() functionality that ensures the created font matches a given charset. If the font named in the call supports the charset, then the font is used, but if not, GDI instantiates a font that does. Before Unicode became popular, charsets defined character encodings for character repertoires typically associated with language systems, like Western European languages and Japanese. As such they were used for two purposes: 1) to define the encodings, and 2) to define character repertoires supported by fonts. The GDI CreateFont charset functionality addresses the latter purpose. This facility, which is a kind of “font fallback”, is very handy, since it’s usually easy to choose the charsets for characters that have charsets. In contrast it’s harder to choose the correct languages for characters in general.

Charsets correspond to code pages. The Windows code pages are described here. For reference, the Windows charsets supported by RichEdit are

ANSI_CHARSET EASTEUROPE_CHARSET RUSSIAN_CHARSET
GREEK_CHARSET TURKISH_CHARSET HEBREW_CHARSET
ARABIC_CHARSET BALTIC_CHARSET VIETNAMESE_CHARSET
DEFAULT_CHARSET SYMBOL_CHARSET THAI_CHARSET
SHIFTJIS_CHARSET GB2312_CHARSET HANGUL_CHARSET
CHINESEBIG5_CHARSET PC437_CHARSET OEM_CHARSET
MAC_CHARSET

When Windows 2000 added support for Indic and several other Southeast Asian scripts, the decision was made not to add charsets for new scripts since it was clear that Unicode was the best way to represent characters on computers. Unfortunately, that decision limited GDI’s convenient font “fallback” mechanism to character repertoires that have charsets. RichEdit needed to generalize this usage of charset. Accordingly, we defined the charrep, a character repertoire index. It usually corresponds to an ISO script, but there are charreps with no corresponding ISO script and vice versa. In addition to the charrep for each Windows charset, there are charreps for

Armenian Syriac Thaana Devanagari Bengali
Gurmukhi Gujarati Oriya Tamil Telugu
Kannada Malayalam Sinhala Lao Tibetan
Myanmar Georgian Jamo Ethiopic Cherokee
Aboriginal Ogham Runic Khmer Mongolian
Braille Yi Math General Math Alphanumeric Limbu
Taile Newtailu Sylotinagr Kharoshthi Kayahli
Unicode symbol Emoji Glagolitic Lisu Vai
N’Ko Osmanya Phagspa Gothic Deseret
Tifinagh Old italic Old Turkic Bopomofo Cyrillic xb
Javanese Olchiki Sorasompeng Buginese Coptic
Meroitic Enc. alpha-num Brahmi Carian Cuneiform
Cypriot Egyp. hieroglyph Aramaic Pahlavi Parthian
Lycian Lydian Old Persian Old Sarabian Phoenician
Shavian Ugaritic Adlam Osage

Most of these are described in The Unicode Standard. The charreps can be used for font binding and they are also used to improve performance by avoiding unnecessary text analysis. Ideally there would be charreps for all ISO scripts that Windows supports. The IProvideFontInfo font binding described in the last section of this post attempts to do just that.

Determining a Character’s Charrep

RichEdit does a kind of binary range search to find out which charrep a character nominally belongs to. ASCII is treated specially, since almost all fonts support “low ASCII”, the “neutral” range U+0020..U+003F. High ASCII, the range U+0040..U+007F is supported by most fonts as well. The ANSI_CHARSET includes ASCII and the Western European ANSI set contained in Windows code page 1252. For the range U+00A0..U+00FF, there’s a charrep named “high Latin 1”.

East Asian (CJK—Chinese Japanese Korean) fonts all support ASCII, but typically don’t support high Latin 1, since at least some of those code positions are used for lead bytes of double-byte character sets or for kana characters. Chinese characters are used in Japanese, in both simplified and traditional Chinese, and in Korean. The CJK fonts often have a lot of Unicode symbol characters (not SYMBOL_CHARSET discussed shortly), so a CJK charrep may be returned for those. An exception occurs in math zones, where a math charrep is preferred.

The default charrep is assigned to the Unicode Private Use Area characters, U+E000..U+F8FF, since these characters have no standardized scripts. Special attention is given to SYMBOL_CHARSET fonts, which don’t use Unicode. But characters in the range U+F020..U+F0FF are assigned the SYMBOL_CHARSET charrep, since Microsoft TrueType SYMBOL_CHARSET fonts use those locations as aliases for U+0020..U+00FF. The charrep assignments are similar to some assignment models based on scripts. But note that natural language isn’t used. This is because in general it’s easier and more reliable to figure out a reasonable charrep for a character than a natural language for a character. A RichEdit client can find out what charrep is assigned to a range of text by calling ITextFont2::GetCharRep();

Determining what Charreps a Font Supports

Windows GDI has a handy structure known as the FONTSIGNATURE, which has bits claiming support for various code pages and Unicode ranges. Some fonts don’t have reliable values, so buyers beware. Nevertheless, it’s fast and useful, so RichEdit uses it to fill in a bit mask for supported character repertoires and has some back-up code to handle errant fonts. Some fonts claim to support a given character repertoire, but only support it partially. For example Japanese fonts claim to support Greek and Cyrillic, but they only have glyphs for basic Greek and Cyrillic characters. RichEdit classifies other characters in these repertoires as “extended”, which means that they need “cmap” verification. The cmap is a TrueType font’s character-to-glyph map. If it returns 0 for a character, the character is missing and will display a missing-character glyph, usually an empty box. The cmap approach is valuable for other cases in which the FONTSIGNATURE may be inadequate. For example, a font may not claim to support Latin 1, but it nevertheless supports low and/or high ASCII. By checking the cmap for ‘0’ and ‘a’, respectively, one can find out the amount of ASCII support available. This approach can also be useful for finding out about new Unicode ranges not represented in the FONTSIGNATURE, e.g., emoji.

Binding Fonts

Given the information in the two preceding sections, we need to ensure appropriate fonts are bound to characters. Let’s start with the basic algorithm and then consider some of many fix-ups.

The simple algorithm is: assign a character flag (bit) to each character repertoire (charrep) and AND the resulting bit mask for a character against the bit mask for the current font. If a nonzero value results, the font claims to support the character’s charrep and the font can continue to be used. If the result is zero, font binding is needed.

To keep the current user font choices when font binding, RichEdit scans the text runs backward from the insertion point looking for a font that supports the desired charrep. If one is found, it is used unless it was introduced by font binding. Otherwise, the default font for the charrep is used. The RichEdit client can change the default font for a charrep by using the EM_SETCHARFORMAT message with the SCF_ASSOCIATEFONT flag and an LCID (locale ID) that corresponds to the desired charrep.

Special considerations apply to math zones and Chinese characters (among other scenarios). In math zones, a math font is used whenever it can handle the characters. This includes not only Unicode math symbols and math alphanumerics, but also Latin, Greek and Cyrillic text. Many Chinese characters are used in Japanese, but a Chinese font may not look pleasing to a Japanese person. In particular, the simplified Chinese fonts look quite different from the more traditional look of the same characters rendered with a Japanese font. So to bind appropriately, we scan forward and backward around an inserted Chinese character to see if any Hiragana or Katakana characters are present. If so, it’s most likely to be Japanese text and a Japanese font should be used. Similarly, if Hangul characters are found, a Korean font should be used. Other heuristics involve noting what the user locale is.

If even with its many heuristics RichEdit still cannot find a reasonable font and a Microsoft Office application is running, RichEdit queries the Office mso.dll if it’s loaded. Also, various kinds of “font fallback” may kick in at display time to save a character from being rendered as a missing-character glyph. Font binding is an area of active research since new character scripts continue to be added to Windows and Unicode continues to add new characters. In general, RichEdit’s built-in font binding does a good job of it, but it could be better.

DirectWrite font binding

Alternatively, RichEdit controls can call methods on the IProvideFontInfo interface to assign fonts to text runs. During RichEdit control initialization, the client may supply the IProvideFontInfo pointer via a callback to ITextHost::QueryInterface() as done by the XAML RichEditBox and TextBox controls. Or if the client has loaded the Microsoft Office shared library mso.dll, a D2D/DirectWrite RichEdit control will try to create an enhanced IProvideFontInfo object. The latter is a wrapper around the DirectWrite IDWriteFontFallback interface. In principle, this font binding handles all ISO scripts that Windows supports. But it doesn’t handle math-zone font binding.

The methods of IProvideFontInfo are:

  • Get default font (the BSTR returned will be freed by the caller of this function)
    virtual BSTR GetDefaultFont() = 0;
  • Get font face ID to be used for new characters
    virtual DWORD GetRunFontFaceId(     _In_z_
        const wchar_t* pCurrentFontName,  // Current font
        _In_ DWRITE_FONT_WEIGHT weight,   // Bold, Extra Bold, ...
        _In_ DWRITE_FONT_STRETCH stretch, // Condensed, expanded, ...
        _In_ DWRITE_FONT_STYLE style,     // Italic, Oblique, Normal
        _In_ LCID lcid,                   // Locale id
                                          _In_opt_count_(charCount)
        const wchar_t* pText,             // Input characters
        _In_ unsigned int charCount,      // Character count of pText
        _In_ DWORD fontFaceIdCurrent,     // Current font face Id
        _Out_ unsigned int& runCount)=0;  // Character count for subset of pText covered
  • Check if a different font should be used for new characters
    virtual IDWriteFontFace* GetFontFace(
        _In_ DWORD fontFaceId) = 0; // Font ID
  • Get name of a font face belonging to a font ID
    virtual BSTR GetSerializableFontName(
        _In_ DWORD fontFaceId) = 0;

The GUID for IProvideFontInfo is 7502135B-17C1-4A25-BDC9-55E6BCB8598A.

The Office IProvideFontInfo instantiation includes two extra methods that belong to IProvideFontInfo2 (which inherits from IProvideFontInfo):

  • Get Default Font face without saving a fontID
    virtual HRESULT GetDefaultFontFace(
        IDWriteFontFace** pDwriteFontFace) = 0;
  • Refresh font-face cache
    virtual HRESULT RefreshFontFaceCache(
        const std::wstring& gdiName) = 0;

The GUID for IProvideFontInfo2 is F71EE023-E909-4F63-B569-EA08956D0004.

Author

Murray Sargent
Principal Software Engineer

Yale BS, MS, PhD in theoretical physics. Worked 22 years in laser theory & applications first at Bell Labs and then Professor of Optical Sciences, University of Arizona. Worked on technical word processing, writing the first math display program (1969) and the technical word processor PS (1980s). Developed the SST debugger we used to get Windows 2.0 running in protected mode thereby eliminating the 640KB DOS barrier (1988). Have more than 100 refereed publications, 3 laser-physics books, 4 ...

More about author

6 comments

Discussion is closed. Login to edit/delete existing comments.

  • Simon Mourier

    Hi, thanks for all your posts :-)

    My question is not related to this one but to the "RichEdit HTML Support" one (comments are closed on this).

    I've been experimenting ITextRange2::SetText2(tomConvertHtml, "blah") but it just returns E_NOTMPL. If I debug I can see that when I enter msftedit's SetText2 it checks the flags against the 0x7F828E82 value and returns with this error (since tomConvertHtml is not part of it).

    So it looks my version just doesn't implements...

    Read more
    • Murray SargentMicrosoft employee Author

      Many recent RichEdit enhancements only appear in the Microsoft Office riched20.dll. The RichEdit HTML reader uses the Office HTML parser so it isn’t likely to be ported to the msftedit.dll in the near future.

      • Simon Mourier · Edited

        Ok, thanks for that answer... that's what I thought. That's sad.

        Anyway, I have tried with Office riched20.dll but If I use SetText2(tomConvertHtml, ComBSTR(L"<html><body><p>hello world</p></body></html>")) then now it kinda works (returns S_FALSE, not S_OK...), but nothing is rendered, and when I reread it using GetText2(tomConvertHtml) it gives back (with a leading CRLF)

        <code>

        In other words, it eats it but returns it as empty (and if I read the plain text it's CR, and the Rtf text...

        Read more
      • Murray SargentMicrosoft employee Author

        In my unit tests for tomConvertHtml, I've used UTF-8 strings rather than UTF-16. You might want to allocate a BSTR with bytes rather than wchar_t's. But it should work with either. Here's an example with the equation 𝐸=𝑚𝑐² in a unit test with bytes:

        const char szHtml5[] =
        ""
        ""
        "body{font-family:Arial,sans-serif;font-size:10pt;}"
        ".cf0{font-family:Segoe UI;}.cf1{font-weight:bold;font-family:Segoe UI;}.cf2{font-style:italic;font-family:Segoe UI;}.cf3{font-style:italic;text-decoration:underline;font-family:Cambria Math;color:#0000FF;}.cf4{font-style:italic;font-family:Cambria Math;}.cf5{font-style:italic;font-family:Cambria Math;}.cf6{font-family:Calibri;}.pf0{}"
        ""
        ""
        "this is "
        ...

        Read more
      • Simon Mourier · Edited

        Thanks, I've tried the UTF8 way too, but although SetText2 does work ok, when I reread it using GetText2, the returned HTML string doesn't contain the initial string (other strings format like plain or RTF don't either). It's just like SetText2 doesn't parse it at all.

        Here is a small complete reproducing project: https://github.com/smourier/richedhtml and the source is here: https://github.com/smourier/richedhtml/blob/main/RichedHtml/RichedHtml.cpp#L426 maybe I'm not getting the selection right? or does it depend on some external / host...

        Read more
      • Murray SargentMicrosoft employee Author · Edited

        Wow, the editor messed that up 🙁 The problem is that I’d have to escape every less-than and greater-than character so that the HTML editor would work. Hopefully you get the idea