May 30th, 2021

RichEdit HTML Support

Murray Sargent
Principal Software Engineer

RichEdit has had limited HTML support for many years, but it wasn’t general enough to document publicly. A recent RichEdit client (to be described in a future post) needs better support, so we have been improving it. For example, we have added HTML copy/paste, images, and math (of course!) to the Microsoft Office riched20.dll. Ideally RichEdit HTML should be able to represent any property that RichEdit RTF can represent. That still wouldn’t make RichEdit a general HTML editor replete with forms and JavaScript functionality. But it would add good interoperability with Office apps, Teams, and the web, all of which use HTML as a lingua franca. This post describes the current RichEdit HTML capabilities which are a subset of its RTF capabilities. The HTML converters are works in progress and this post will be updated as more functionality is added. For example, RichEdit can write HTML tables, but not yet read them.

Contents

RichEdit HTML Support 1

HTML copy/paste format 1

Rich text 2

Images. 2

Programming details. 2

Messages. 3

TOM methods. 3

Math format options. 3

 

HTML copy/paste format

The “HTML format” clipboard format includes header and comment data in addition to the HTML to be copied (see https://docs.microsoft.com/en-us/windows/win32/dataxchg/html-clipboard-format#description). This info needs to be added to copy HTML between RichEdit, Word, PPT, OneNote, Teams, and other apps. Frankly having to add this info seems like overkill. RTF can be copied and pasted without such overhead. We illustrate the format as written by RichEdit with the HTML for Einstein’s energy equation 𝐸 = 𝑚𝑐². In the HTML, OMML is the math format used by default since that’s what Word and PowerPoint expect. Here’s the HTML

Version:1.0
StartHTML:0000000105
EndHTML:0000000844
StartFragment:0000000417
EndFragment:0000000811
 
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml">
<head><style>body{font-family:Arial,sans-serif;font-size:10pt;}</style>
<style>.cf0{font-style:italic;font-family:Cambria Math;font-size:24pt;}</style></head>
<body><!--StartFragment --><p><m:oMathPara><m:oMath class="cf0">
<span class="cf0"><m:r><i>&#x1D438;</i></m:r></span>
<span class="cf0"><m:r><i>=</i></m:r></span><span class="cf0">
<m:r><i>&#x1D45A;</i></m:r></span>
<m:sSup><m:sSupPr><m:ctrlPr></m:ctrlPr></m:sSupPr><m:e><span class="cf0">
<m:r><i>&#x1D450;</i></m:r></span></m:e><m:sup><span class="cf0"><m:r><i>2</i>
</m:r></span></m:sup></m:sSup></m:oMath></m:oMathPara></p>
<!--EndFragment --></body></html>

Here the StartHTML entry in the header gives the character position (cp) offset of the HTML <body> and EndHTML gives the cp at the end of the HTML <body>. The StartFragment gives the cp of the text that the user selected and the EndFragment gives the cp at the end of the selection. In this example, the equation 𝐸 = 𝑚𝑐² is selected and displayed on its own line (display mode rather than inline mode). The start of the displayed equation is given by the OMML  <m:oMathPara>. The corresponding MathML including an mml: prefix is

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block">
  <mml:mi>E</mml:mi>
  <mml:mo>=</mml:mo>
  <mml:mi>m</mml:mi>
  <mml:msup>
    <mml:mi>c</mml:mi>
    <mml:mn>2</mml:mn></mml:msup></mml:math>

The Programming details section describes how to write HTML with OMML or MathML with and without the mml: prefix. The HTML5 standard includes MathML without a prefix. RichEdit can write and read HTML with all three math formats.

Rich text

Character formatting includes font and family, height, text and back color, weight, spacing, bold, italic, underline, strikeout, subscript, superscript, small caps, all caps and hyperlinks. Paragraph formatting includes numbered and bulleted lists, left, right, and centered alignments, and paragraph margins.

Images

RichEdit can read and write the HTML <img> element with a src attribute that has a base64 encoding of the binary image data. This is a technique used widely in Microsoft Office for HTML copy/paste. For example, the tag might begin with “<img src=\”data:image/png;base64,”.

Programming details

HTML content can be read in and out via messages, hot keys (Ctrl+c, Ctrl+v, Ctrl+x), and TOM methods.

Messages

A client can get HTML content by sending the EM_STREAMOUT message with wParam = SF_HTML | SF_BINARY. The SF_BINARY (0x0008) is needed to write the data in the RichEdit binary format to temporary memory and then the SF_HTML (0x00100000) writes that data out as HTML. If clipboard HTML is desired, OR the SF_CLIPBOARD (0x80000000) flag into wParam.

A client can stream in HTML content by sending the EM_ISTREAMIN message (WM_USER + 252), which streams in using the IStream interface pointed to by the lParam instead of using the usual EDITSTREAM struct. This choice is due to use of the Office HTML parser for input and the mso.dll must be loaded for that to work. Set wParam equal to 1, which signifies HTML. Currently only HTML can be streamed in using the EM_ISTREAMIN message.

Other messages that can be used are WM_COPY, WM_PASTE, WM_CUT, and EM_PASTESPECIAL which are all described on the web.

TOM methods

In addition to the ITextRange::Copy() and ITextRange::Paste() methods, you can input HTML content into a range by calling ITextRange2::SetText2(tomConvertHtml, bstr), where tomConvertHtml is given by 0x00900000. Similarly, you can get the HTML content from a range by calling ITextRange2:GetText2(tomConvertHtml, pbstr).

Math format options

By default, RichEdit writes equations in HTML in the OMML format since that format is what Office apps like Word and PowerPoint expect. But it can write equations in MathML with or without an mml: prefix. The function to call to set which math format to use is ITextDocument2::SetMathProperties() with tomHtmlOMML, tomHtmlMathML, or tomHtmlMath

  tomHtmlMathFormatMask      = 0x00300000,   // Mask for math-format flags
  tomHtmlOMML                = 0,            // m:
  tomHtmlMathML              = 0x00100000,   // mml:
  tomHtmlMath                = 0x00200000,   // No prefix MathML (HTML5)

Author

Murray Sargent
Principal Software Engineer

Yale BS, MS, PhD in theoretical physics. Worked 22 years in laser theory & applications first at Bell Labs and then Professor of Optical Sciences, University of Arizona. Worked on technical word processing, writing the first math display program (1969) and the technical word processor PS (1980s). Developed the SST debugger we used to get Windows 2.0 running in protected mode thereby eliminating the 640KB DOS barrier (1988). Have more than 100 refereed publications, 3 laser-physics books, 4 ...

More about author

0 comments

Discussion are closed.