September 29th, 2022

Setting and Getting Text in Various Formats

Murray Sargent
Principal Software Engineer

You can get and set text from/into RichEdit in a variety of formats including RTF, HTML, MathML, OMML, UnicodeMath, Nemeth Braille, and speech. This post documents RichEdit options for a general way to access text using ITextRange2::SetText2(options, bstr) and ITextRange2::GetText2(options, pbstr). As such, this post is for programmers. All options work in the current Microsoft Office RichEdit (riched20.dll in an Office subdirectory) and many work in the Windows RichEdit (msftedit.dll). The options are defined in the following table in which s/g stands for SetText2/GetText2, respectively.

Option Value s/g Meaning
tomUnicodeBiDi 0x00000001 s Use Unicode BiDi algorithm for inserted text
tomAdjustCRLF 0x00000001 g If range start is inside multicode unit like CRLF, surrogate pair, etc., move to start of unit
tomUseCRLF 0x00000002 g Paragraph ends use CRLF (U+000D U+000A)
tomTextize 0x00000004 g Embedded objects export alt text; else U+FFFC
tomAllowFinalEOP 0x00000008 g If range includes final EOP, export it; else don’t
tomUnlink 0x00000008 s Disables link attributes if present
tomUnhide 0x00000010 s Disables hidden attribute if present
tomFoldMathAlpha 0x00000010 g Replace math alphanumerics with ASCII/Greek
tomIncludeNumbering 0x00000040 g Lists include bullets/numbering
tomCheckTextLimit 0x00000020 s Only insert up to text limit
tomDontSelectText 0x00000040 s After insertion, call Collapse(tomEnd)
tomTranslateTableCell 0x00000080 g Export spaces for table delimiters
tomNoMathZoneBrackets 0x00000100 g Used with tomConvertUnicodeMath and tomConvertTeX. Set discards math zone brackets
tomLanguageTag 0x00001000 s/g Sets BCP-47 language tag for range; gets tag
tomConvertRTF 0x00002000 s/g Set or get RTF
tomGetTextForSpell 0x00008000 g Export spaces for hidden/math text, table delims
tomConvertMathML 0x00010000 s/g Set or get MathML
tomGetUtf16 0x00020000 g Causes tomConvertRTF, etc. to get UTF-16. SetText2 accepts 8-bit or 16-bit RTF
tomConvertLinearFormat 0x00040000 s/g Alias for tomConvertUnicodeMath
tomConvertUnicodeMath 0x00040000 s/g UnicodeMath
tomConvertOMML 0x00080000 s/g Office MathML
tomConvertMask 0x00F00000 s/g Mask for mutually exclusive modes
tomConvertRuby 0x00100000 s See section below on Entering Ruby Text
tomConvertTeX 0x00200000 s/g See LaTeX Math in Office
tomConvertMathSpeech 0x00300000 g Math speech (English only here)
tomConvertSpeechTokens 0x00400000 g Simple Unicode and speech tokens
tomConvertNemeth 0x00500000 s/g Nemeth math braille in U+2800 block
tomConvertNemethAscii 0x00600000 g Corresponding ASCII braille
tomConvertNemethNoItalic 0x00700000 g Nemeth braille in U+2800 block w/o math italic
tomConvertNemethDefinition 0x00800000 g Fine-grained speech in braille
tomConvertHtml 0x00900000 s/g Convert HTML
tomConvertEnclose 0x00A00000 s See section below on Entering Enclosed Text
tomConvertCRtoLF 0x01000000 g Plain-text paragraphs end with LF, not CRLF
tomLaTeXDelim 0x02000000 g Use LaTeX math-zone delimiters \(…\) inline, \[…\] display; else $…$, $$…$$. Set handles all
tomGhostText 0x04000000 s Set ghost text (used for text prediction)
tomNoGhostText 0x04000000 g Get text without ghost text

Mutually exclusive options

Nonzero values within the mask defined by tomConvertMask (0x00F00000) are mutually exclusive, that is, they cannot be combined (OR’d) with one another. The options UnicodeMath, [La]TeX (tomConvertTeX), and Nemeth math braille (tomConvertNemeth) are also mutually exclusive. You can set only one at a time. But other options can be OR’d in if desired.

Nemeth math braille options

A string of Nemeth math braille codes in the Unicode range U+2800..U+283F can be inserted and built up by calling ITextRange2::SetText2(tomConvertNemeth, bstr). If the string is valid, you can get it back in any of the math formats including Nemeth math braille. For example, if you insert the string

⠹⠂⠌⠆⠨⠏⠼⠮⠰⠴⠘⠆⠨⠏⠐⠹⠨⠈⠈⠙⠨⠹⠌⠁⠬⠃⠀⠎⠊⠝⠀⠨⠹⠼⠀⠨⠅⠀⠹⠂⠌⠜⠁⠘⠆⠐⠤⠃⠘⠆⠐⠻⠼

you see

Image integral

You can also input braille with a standard keyboard by typing a control word \braille assigned to the Unicode character U+24B7 (Ⓑ). (See LaTeX Math in Office for how to add commands to math autocorrect). The \braille command causes math input to accept braille input via a regular keyboard using the braille ASCII codes sometimes referred to as North American Braille Computer Codes. The character ~ (U+007E) disables this input mode. These braille codes are described in the post Nemeth Braille—the first math linear format and can be input using refreshable braille displays. Alternatively, such input can be automated by calling ITextSelection::TypeText(bstr). Just as in entering UnicodeMath, the equations build up on screen as soon as the math braille input becomes unambiguous. The implementation includes the math braille UI that cues the user where the insertion point is for unambiguous editing of math zones using braille. Note that as of this posting, the math braille facility isn’t hooked up to Narrator or other screen readers.

Getting (and Setting) Math Speech

The tomConvertMathSpeech currently only gets math speech in English. Microsoft Office apps like Word, PowerPoint and OneNote deliver math speech in over 18 languages to the assistive technology (AT) program Narrator via the UIA ITextRangeProvider::GetText() function. Other ATs could also get math speech this way, although they usually get MathML and generate speech from that. Dictating (setting) math speech would be nice for both blind and sighted folks. Imagine, you can say 𝑎² + 𝑏² = 𝑐² faster than you can type it or write it! The SetText2(tomConvertMathSpeech, bstr) is ready to handle such input, but the feature is not available yet.

Entering ruby text

In a nonmath context, the option, tomConvertRuby (0x00100000), can be used to convert strings like “{…|…}” to ruby inline objects, where the first ellipsis represents the ruby text and the second ellipsis the base text. The ASCII curly braces and vertical bar are translated to the internal ruby-object structure characters U+FDD1, U+FDEF, and U+FDEE, respectively. Alternatively, the string can contain those structure characters directly. If a digit follows the start delimiter (‘{‘ or U+FDD1}, the digit defines the ruby options

rubyAlign val Meaning
center (0) Center <ruby> with respect to <base>
distributeLetter (1) Distribute difference in space between longer and shorter text in the latter, evenly between each character
distributeSpace (2) Distribute difference in space between longer and shorter text in the latter using a ratio of 1:2:1 which corresponds to lead : inter-character : end
left (3) Align <ruby> with the left of <base>
right (4) Align <ruby> with the right of <base>

 

If you add 5 to these values, the ruby object will display the ruby text below the base text instead of above it. For example, calling ITextRange2::SetText2(tomConvertRuby, bstr) with bstr containing the string “{1にほんご|日本語}” inserts

Image ruby

The string can contain text in addition to ruby objects and the ruby objects can be nested to create compound ruby objects such as

Image rubyc

Entering enclosed text

The post Rounded Rectangles and Ellipses – Math in Office (microsoft.com) describes ways to enclose text in possibly rounded rectangles and ellipses. The SetText2(tomConvertEnclose, bstr) option is similar to the tomConvertRuby option. It converts strings like “{…}” to a tomEnclose object.

Other ways to get/set text

In addition to the ITextRange2::SetText2/GetText2(), the messages WM_SETTEXT, EM_SETTEXTEX, WM_GETTEXT, and EM_GETTEXTEX are useful. The set-text messages work with plain text or RTF in rich-text controls. EM_SETTEXTEX accepts both 16-bit RTF as well as 8-bit RTF, while WM_SETTEXT doesn’t handle 16-bit RTF.

Author

Murray Sargent
Principal Software Engineer

Yale BS, MS, PhD in theoretical physics. Worked 22 years in laser theory & applications first at Bell Labs and then Professor of Optical Sciences, University of Arizona. Worked on technical word processing, writing the first math display program (1969) and the technical word processor PS (1980s). Developed the SST debugger we used to get Windows 2.0 running in protected mode thereby eliminating the 640KB DOS barrier (1988). Have more than 100 refereed publications, 3 laser-physics books, 4 PC books, 163 posts on Math in Office blog. Joined Microsoft Research in 1992. Since 1994 have been in Microsoft Office working mostly on RichEdit and OfficeMath. Member of Unicode Technical Committee (1994—) and MathML Working Group (1999—).

1 comment

Discussion is closed. Login to edit/delete existing comments.

Newest
Newest
Popular
Oldest
  • Helmut C. Gross

    Thank you for that list! It’s a life-saver for me 🙂 The RichEdit syntax can be confusing at times, especially for a German speaker like me. Thanks for the recap.
    Best regards,
    Helmut C. Gross
    Florist Berlin, Deutschland

Feedback