RichEdit has had limited HTML support for many years, but it wasn’t general enough to document publicly. A recent RichEdit client (to be described in a future post) needs better support, so we have been improving it. For example, we have added HTML copy/paste, images, and math (of course!) to the Microsoft Office riched20.dll. Ideally RichEdit HTML should be able to represent any property that RichEdit RTF can represent. That still wouldn’t make RichEdit a general HTML editor replete with forms and JavaScript functionality. But it would add good interoperability with Office apps, Teams, and the web, all of which use HTML as a lingua franca. This post describes the current RichEdit HTML capabilities which are a subset of its RTF capabilities. The HTML converters are works in progress and this post will be updated as more functionality is added. For example, RichEdit can write HTML tables, but not yet read them.
Contents
HTML copy/paste format
The “HTML format” clipboard format includes header and comment data in addition to the HTML to be copied (see https://docs.microsoft.com/en-us/windows/win32/dataxchg/html-clipboard-format#description). This info needs to be added to copy HTML between RichEdit, Word, PPT, OneNote, Teams, and other apps. Frankly having to add this info seems like overkill. RTF can be copied and pasted without such overhead. We illustrate the format as written by RichEdit with the HTML for Einstein’s energy equation 𝐸 = 𝑚𝑐². In the HTML, OMML is the math format used by default since that’s what Word and PowerPoint expect. Here’s the HTML
Version:1.0 StartHTML:0000000105 EndHTML:0000000844 StartFragment:0000000417 EndFragment:0000000811 <html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"> <head><style>body{font-family:Arial,sans-serif;font-size:10pt;}</style> <style>.cf0{font-style:italic;font-family:Cambria Math;font-size:24pt;}</style></head> <body><!--StartFragment --><p><m:oMathPara><m:oMath class="cf0"> <span class="cf0"><m:r><i>𝐸</i></m:r></span> <span class="cf0"><m:r><i>=</i></m:r></span><span class="cf0"> <m:r><i>𝑚</i></m:r></span> <m:sSup><m:sSupPr><m:ctrlPr></m:ctrlPr></m:sSupPr><m:e><span class="cf0"> <m:r><i>𝑐</i></m:r></span></m:e><m:sup><span class="cf0"><m:r><i>2</i> </m:r></span></m:sup></m:sSup></m:oMath></m:oMathPara></p> <!--EndFragment --></body></html>
Here the StartHTML entry in the header gives the character position (cp) offset of the HTML <body> and EndHTML gives the cp at the end of the HTML <body>. The StartFragment gives the cp of the text that the user selected and the EndFragment gives the cp at the end of the selection. In this example, the equation 𝐸 = 𝑚𝑐² is selected and displayed on its own line (display mode rather than inline mode). The start of the displayed equation is given by the OMML <m:oMathPara>. The corresponding MathML including an mml: prefix is
<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"> <mml:mi>E</mml:mi> <mml:mo>=</mml:mo> <mml:mi>m</mml:mi> <mml:msup> <mml:mi>c</mml:mi> <mml:mn>2</mml:mn></mml:msup></mml:math>
The Programming details section describes how to write HTML with OMML or MathML with and without the mml: prefix. The HTML5 standard includes MathML without a prefix. RichEdit can write and read HTML with all three math formats.
Rich text
Character formatting includes font and family, height, text and back color, weight, spacing, bold, italic, underline, strikeout, subscript, superscript, small caps, all caps and hyperlinks. Paragraph formatting includes numbered and bulleted lists, left, right, and centered alignments, and paragraph margins.
Images
RichEdit can read and write the HTML <img> element with a src attribute that has a base64 encoding of the binary image data. This is a technique used widely in Microsoft Office for HTML copy/paste. For example, the tag might begin with “<img src=\”data:image/png;base64,”.
Programming details
HTML content can be read in and out via messages, hot keys (Ctrl+c, Ctrl+v, Ctrl+x), and TOM methods.
Messages
A client can get HTML content by sending the EM_STREAMOUT message with wParam = SF_HTML | SF_BINARY. The SF_BINARY (0x0008) is needed to write the data in the RichEdit binary format to temporary memory and then the SF_HTML (0x00100000) writes that data out as HTML. If clipboard HTML is desired, OR the SF_CLIPBOARD (0x80000000) flag into wParam.
A client can stream in HTML content by sending the EM_ISTREAMIN message (WM_USER + 252), which streams in using the IStream interface pointed to by the lParam instead of using the usual EDITSTREAM struct. This choice is due to use of the Office HTML parser for input and the mso.dll must be loaded for that to work. Set wParam equal to 1, which signifies HTML. Currently only HTML can be streamed in using the EM_ISTREAMIN message.
Other messages that can be used are WM_COPY, WM_PASTE, WM_CUT, and EM_PASTESPECIAL which are all described on the web.
TOM methods
In addition to the ITextRange::Copy() and ITextRange::Paste() methods, you can input HTML content into a range by calling ITextRange2::SetText2(tomConvertHtml, bstr), where tomConvertHtml is given by 0x00900000. Similarly, you can get the HTML content from a range by calling ITextRange2:GetText2(tomConvertHtml, pbstr).
Math format options
By default, RichEdit writes equations in HTML in the OMML format since that format is what Office apps like Word and PowerPoint expect. But it can write equations in MathML with or without an mml: prefix. The function to call to set which math format to use is ITextDocument2::SetMathProperties() with tomHtmlOMML, tomHtmlMathML, or tomHtmlMath
tomHtmlMathFormatMask = 0x00300000, // Mask for math-format flags tomHtmlOMML = 0, // m: tomHtmlMathML = 0x00100000, // mml: tomHtmlMath = 0x00200000, // No prefix MathML (HTML5)
0 comments