RichEdit Font Binding
Suppose a user pastes some plain text into a document. In principle, the text can contain any Unicode character. That includes virtually all characters used in the current languages of the world along with many ancient scripts and a plethora of symbols, mathematical and otherwise, that don’t belong to any language. The question arises as to what font(s) to use for the pasted characters. In general, the same font cannot be used for all characters, since TrueType glyph indices are 16-bit numbers thereby limiting fonts to 65536 glyphs. Meanwhile Unicode has over 140,000 named characters. Furthermore, even if a font could contain glyphs for all Unicode characters, it wouldn’t be able to render them all without compromises in quality. East Asian characters, for example, ideally have different baselines from Latin characters. This post describes the two ways RichEdit chooses fonts for characters not present in the active font. This process is call “font binding”.
The first section describes RichEdit character repertoires. The second section explains how a character is assigned to a character repertoire. The third section describes how to find out what character repertoires are supported by a font. The fourth section shows how these two kinds of information can be combined to bind fonts to characters in a context-dependent way. The fifth section describes the alternative font binding technology (IProvideFontInfo) used by the XAML RichEditBox and TextBox controls as well as Office applications that run RichEdit in the D2D/DirectWrite mode.
RichEdit Character Repertoires
RichEdit’s built-in font binding facility is an extension of the GDI CreateFont() functionality that ensures the created font matches a given charset. If the font named in the call supports the charset, then the font is used, but if not, GDI instantiates a font that does. Before Unicode became popular, charsets defined character encodings for character repertoires typically associated with language systems, like Western European languages and Japanese. As such they were used for two purposes: 1) to define the encodings, and 2) to define character repertoires supported by fonts. The GDI CreateFont charset functionality addresses the latter purpose. This facility, which is a kind of “font fallback”, is very handy, since it’s usually easy to choose the charsets for characters that have charsets. In contrast it’s harder to choose the correct languages for characters in general.
Charsets correspond to code pages. The Windows code pages are described here. For reference, the Windows charsets supported by RichEdit are
When Windows 2000 added support for Indic and several other Southeast Asian scripts, the decision was made not to add charsets for new scripts since it was clear that Unicode was the best way to represent characters on computers. Unfortunately, that decision limited GDI’s convenient font “fallback” mechanism to character repertoires that have charsets. RichEdit needed to generalize this usage of charset. Accordingly, we defined the charrep, a character repertoire index. It usually corresponds to an ISO script, but there are charreps with no corresponding ISO script and vice versa. In addition to the charrep for each Windows charset, there are charreps for
|Braille||Yi||Math General||Math Alphanumeric||Limbu|
|Tifinagh||Old italic||Old Turkic||Bopomofo||Cyrillic xb|
|Lycian||Lydian||Old Persian||Old Sarabian||Phoenician|
Most of these are described in The Unicode Standard. The charreps can be used for font binding and they are also used to improve performance by avoiding unnecessary text analysis. Ideally there would be charreps for all ISO scripts that Windows supports. The IProvideFontInfo font binding described in the last section of this post attempts to do just that.
Determining a Character’s Charrep
RichEdit does a kind of binary range search to find out which charrep a character nominally belongs to. ASCII is treated specially, since almost all fonts support “low ASCII”, the “neutral” range U+0020..U+003F. High ASCII, the range U+0040..U+007F is supported by most fonts as well. The ANSI_CHARSET includes ASCII and the Western European ANSI set contained in Windows code page 1252. For the range U+00A0..U+00FF, there’s a charrep named “high Latin 1”.
East Asian (CJK—Chinese Japanese Korean) fonts all support ASCII, but typically don’t support high Latin 1, since at least some of those code positions are used for lead bytes of double-byte character sets or for kana characters. Chinese characters are used in Japanese, in both simplified and traditional Chinese, and in Korean. The CJK fonts often have a lot of Unicode symbol characters (not SYMBOL_CHARSET discussed shortly), so a CJK charrep may be returned for those. An exception occurs in math zones, where a math charrep is preferred.
The default charrep is assigned to the Unicode Private Use Area characters, U+E000..U+F8FF, since these characters have no standardized scripts. Special attention is given to SYMBOL_CHARSET fonts, which don’t use Unicode. But characters in the range U+F020..U+F0FF are assigned the SYMBOL_CHARSET charrep, since Microsoft TrueType SYMBOL_CHARSET fonts use those locations as aliases for U+0020..U+00FF. The charrep assignments are similar to some assignment models based on scripts. But note that natural language isn’t used. This is because in general it’s easier and more reliable to figure out a reasonable charrep for a character than a natural language for a character. A RichEdit client can find out what charrep is assigned to a range of text by calling ITextFont2::GetCharRep();
Determining what Charreps a Font Supports
Windows GDI has a handy structure known as the FONTSIGNATURE, which has bits claiming support for various code pages and Unicode ranges. Some fonts don’t have reliable values, so buyers beware. Nevertheless, it’s fast and useful, so RichEdit uses it to fill in a bit mask for supported character repertoires and has some back-up code to handle errant fonts. Some fonts claim to support a given character repertoire, but only support it partially. For example Japanese fonts claim to support Greek and Cyrillic, but they only have glyphs for basic Greek and Cyrillic characters. RichEdit classifies other characters in these repertoires as “extended”, which means that they need “cmap” verification. The cmap is a TrueType font’s character-to-glyph map. If it returns 0 for a character, the character is missing and will display a missing-character glyph, usually an empty box. The cmap approach is valuable for other cases in which the FONTSIGNATURE may be inadequate. For example, a font may not claim to support Latin 1, but it nevertheless supports low and/or high ASCII. By checking the cmap for ‘0’ and ‘a’, respectively, one can find out the amount of ASCII support available. This approach can also be useful for finding out about new Unicode ranges not represented in the FONTSIGNATURE, e.g., emoji.
Given the information in the two preceding sections, we need to ensure appropriate fonts are bound to characters. Let’s start with the basic algorithm and then consider some of many fix-ups.
The simple algorithm is: assign a character flag (bit) to each character repertoire (charrep) and AND the resulting bit mask for a character against the bit mask for the current font. If a nonzero value results, the font claims to support the character’s charrep and the font can continue to be used. If the result is zero, font binding is needed.
To keep the current user font choices when font binding, RichEdit scans the text runs backward from the insertion point looking for a font that supports the desired charrep. If one is found, it is used unless it was introduced by font binding. Otherwise, the default font for the charrep is used. The RichEdit client can change the default font for a charrep by using the EM_SETCHARFORMAT message with the SCF_ASSOCIATEFONT flag and an LCID (locale ID) that corresponds to the desired charrep.
Special considerations apply to math zones and Chinese characters (among other scenarios). In math zones, a math font is used whenever it can handle the characters. This includes not only Unicode math symbols and math alphanumerics, but also Latin, Greek and Cyrillic text. Many Chinese characters are used in Japanese, but a Chinese font may not look pleasing to a Japanese person. In particular, the simplified Chinese fonts look quite different from the more traditional look of the same characters rendered with a Japanese font. So to bind appropriately, we scan forward and backward around an inserted Chinese character to see if any Hiragana or Katakana characters are present. If so, it’s most likely to be Japanese text and a Japanese font should be used. Similarly, if Hangul characters are found, a Korean font should be used. Other heuristics involve noting what the user locale is.
If even with its many heuristics RichEdit still cannot find a reasonable font and a Microsoft Office application is running, RichEdit queries the Office mso.dll if it’s loaded. Also, various kinds of “font fallback” may kick in at display time to save a character from being rendered as a missing-character glyph. Font binding is an area of active research since new character scripts continue to be added to Windows and Unicode continues to add new characters. In general, RichEdit’s built-in font binding does a good job of it, but it could be better.
DirectWrite font binding
Alternatively, RichEdit controls can call methods on the IProvideFontInfo interface to assign fonts to text runs. During RichEdit control initialization, the client may supply the IProvideFontInfo pointer via a callback to ITextHost::QueryInterface() as done by the XAML RichEditBox and TextBox controls. Or if the client has loaded the Microsoft Office shared library mso.dll, a D2D/DirectWrite RichEdit control will try to create an enhanced IProvideFontInfo object. The latter is a wrapper around the DirectWrite IDWriteFontFallback interface. In principle, this font binding handles all ISO scripts that Windows supports. But it doesn’t handle math-zone font binding.
The methods of IProvideFontInfo are:
- Get default font (the BSTR returned will be freed by the caller of this function)
virtual BSTR GetDefaultFont() = 0;
- Get font face ID to be used for new characters
virtual DWORD GetRunFontFaceId( _In_z_ const wchar_t* pCurrentFontName, // Current font _In_ DWRITE_FONT_WEIGHT weight, // Bold, Extra Bold, ... _In_ DWRITE_FONT_STRETCH stretch, // Condensed, expanded, ... _In_ DWRITE_FONT_STYLE style, // Italic, Oblique, Normal _In_ LCID lcid, // Locale id _In_opt_count_(charCount) const wchar_t* pText, // Input characters _In_ unsigned int charCount, // Character count of pText _In_ DWORD fontFaceIdCurrent, // Current font face Id _Out_ unsigned int& runCount)=0; // Character count for subset of pText covered
- Check if a different font should be used for new characters
virtual IDWriteFontFace* GetFontFace( _In_ DWORD fontFaceId) = 0; // Font ID
- Get name of a font face belonging to a font ID
virtual BSTR GetSerializableFontName( _In_ DWORD fontFaceId) = 0;
The GUID for IProvideFontInfo is 7502135B-17C1-4A25-BDC9-55E6BCB8598A.
The Office IProvideFontInfo instantiation includes two extra methods that belong to IProvideFontInfo2 (which inherits from IProvideFontInfo):
- Get Default Font face without saving a fontID
virtual HRESULT GetDefaultFontFace( IDWriteFontFace** pDwriteFontFace) = 0;
- Refresh font-face cache
virtual HRESULT RefreshFontFaceCache( const std::wstring& gdiName) = 0;
The GUID for IProvideFontInfo2 is F71EE023-E909-4F63-B569-EA08956D0004.