{"id":43453,"date":"2014-12-08T07:00:00","date_gmt":"2014-12-08T07:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2014\/12\/08\/creating-double-precision-integer-multiplication-with-a-quad-precision-result-from-single-precision-multiplication-with-a-double-precision-result\/"},"modified":"2022-11-14T17:45:01","modified_gmt":"2022-11-15T01:45:01","slug":"creating-double-precision-integer-multiplication-with-a-quad-precision-result-from-single-precision-multiplication-with-a-double-precision-result","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20141208-00\/?p=43453","title":{"rendered":"Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result"},"content":{"rendered":"<p>Suppose you want to multiply two double-word values producing a quad-word result, but your processor supports only single-word multiplication with a double-word result. For concreteness, let&#8217;s say that your processor supports 32 \u00d7 32 \u2192 64 multiplication and you want to implement 64 \u00d7 64 \u2192 128 multiplication. (Sound like any processor you know?)<\/p>\n<p>Oh boy, let&#8217;s do some high school algebra. Let&#8217;s start with unsigned multiplication.<\/p>\n<p>Let <var>x<\/var> = <var>A<\/var> \u00d7 2\u00b3\u00b2 + <var>B<\/var> and <var>y<\/var> = <var>C<\/var> \u00d7 2\u00b3\u00b2 + <var>D<\/var>, where <var>A<\/var>, <var>B<\/var>, <var>C<\/var>, and <var>D<\/var> are all in the range 0 \u2026 2\u00b3\u00b2 \u2212 1.<\/p>\n<table style=\"text-align: center;\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td align=\"right\" valign=\"baseline\"><var>x <\/var>\u00d7 <var>y<\/var> =\u00a0<\/td>\n<td colspan=\"3\" valign=\"baseline\"><var>A<\/var><var>C<\/var> \u00d7 2<sup>64<\/sup> + (<var>A<\/var><var>D<\/var> + <var>B<\/var><var>C<\/var>) \u00d7 2<sup>32<\/sup> + <var>B<\/var><var>D<\/var><\/td>\n<\/tr>\n<tr>\n<td align=\"right\" valign=\"baseline\">=\u00a0<\/td>\n<td valign=\"baseline\"><var>A<\/var><var>C<\/var> \u00d7 2<sup>64<\/sup> + <var>B<\/var><var>D<\/var><\/td>\n<td valign=\"baseline\">\u00a0+\u00a0<\/td>\n<td valign=\"baseline\">(<var>A<\/var><var>D<\/var> + <var>B<\/var><var>C<\/var>) \u00d7 2<sup>32<\/sup><\/td>\n<\/tr>\n<tr>\n<td valign=\"baseline\">\u00a0<\/td>\n<td style=\"border-top: solid black 1px;\" valign=\"baseline\">provisional result<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border-top: solid black 1px;\" valign=\"baseline\">cross-terms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Each of the multiplications (not counting the power-of-two multiplications) is a 32 \u00d7 32 \u2192 64 multiplication, so they are something we have as a building block. And the basic implementation is simply to perform the four multiplications and add the pieces together. But if you have SSE, you can perform two multiplies in a single instruction.<\/p>\n<pre>    \/\/ Prepare our source registers\r\n    movq xmm0, x         \/\/ xmm0 = { 0, 0, A, B } = { *, *, A, B }\r\n    movq xmm1, y         \/\/ xmm1 = { 0, 0, C, D } = { *, *, C, D }\r\n    punpckldq xmm0, xmm0 \/\/ xmm0 = { A, A, B, B } = { *, A, *, B }\r\n    punpckldq xmm1, xmm1 \/\/ xmm1 = { C, C, D, D } = { *, C, *, D }\r\n    pshufd xmm2, xmm1, _MM_SHUFFLE(2, 0, 3, 1)\r\n                         \/\/ xmm2 = { D, D, C, C } = { *, D, *, C }\r\n<\/pre>\n<p>The <code>PMULUDQ<\/code> instruction multiplies 32-bit lanes 0 and 2 of its source and destination registers, producing 64-bit results. The values in lanes 1 and 3 do not participate in the multiplication, so it doesn&#8217;t matter what we put there. It so happens that the <code>PUNPCKLDQ<\/code> instruction duplicates the value, but we really don&#8217;t care. I used <code>*<\/code> to represent a don&#8217;t-care value.<\/p>\n<pre>    pmuludq xmm1, xmm0 \/\/ xmm1 = { AC, BD } \/\/ provisional result\r\n    pmuludq xmm2, xmm0 \/\/ xmm2 = { AD, BC } \/\/ cross-terms\r\n<\/pre>\n<p>In two <code>PMULUDQ<\/code> instructions, we created the provisional result and the cross-terms. Now we just need to add the cross-terms to the provisional result. Unfortunately, SSE does not have a 128-bit addition (or at least SSE2 doesn&#8217;t; who knows what they&#8217;ll add in the future), so we need to do that the old-fashioned way.<\/p>\n<pre>    movdqa result, xmm1\r\n    movdqa crossterms, xmm2\r\n    mov    eax, crossterms[0]\r\n    mov    edx, crossterms[4] \/\/ edx:eax = BC\r\n    add    result[4], eax\r\n    adc    result[8], edx\r\n    adc    result[12], 0      \/\/ add the first cross-term\r\n    mov    eax, crossterms[8]\r\n    mov    edx, crossterms[12] \/\/ edx:eax = AD\r\n    add    result[4], eax\r\n    adc    result[8], edx\r\n    adc    result[12], 0      \/\/ add the second cross-term\r\n<\/pre>\n<p>There we go, a 64 \u00d7 64 \u2192 128 multiply constructed from 32 \u00d7 32 \u2192 64 multiplies.<\/p>\n<p><b>Exercise<\/b>: Why didn&#8217;t I use the <code>rax<\/code> register to perform the 64-bit addition? (This is sort of a trick question.)<\/p>\n<p>That calculates an unsigned multiplication, but how do we do a signed multiplication? Let&#8217;s work modulo 2<sup>128<\/sup> so that signed and unsigned multiplication are equivalent. This means that we need to expand <var>x<\/var> and <var>y<\/var> to 128-bit values <var>X<\/var> and <var>Y<\/var>.<\/p>\n<p>Let <var>s<\/var> = the sign bit of <var>x<\/var> expanded to a 64-bit value, and similarly <var>t<\/var> = the sign bit of <var>y<\/var> expanded to a 64-bit value. In other words, <var>s<\/var> is <code>0xFFFFFFFF`FFFFFFFF<\/code> if <var>x<\/var> &lt; 0 and zero if <var>x<\/var> \u2265 0.<\/p>\n<p>The 128-bit values being multiplied are<\/p>\n<table border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td valign=\"baseline\"><var>X<\/var> =\u00a0<\/td>\n<td valign=\"baseline\"><var>s<\/var> \u00d7 2<sup>64<\/sup> + <var>x<\/var><\/td>\n<\/tr>\n<tr>\n<td valign=\"baseline\"><var>Y<\/var> =\u00a0<\/td>\n<td valign=\"baseline\"><var>t<\/var> \u00d7 2<sup>64<\/sup> + <var>y<\/var><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The product is therefore<\/p>\n<table style=\"text-align: center;\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td align=\"right\" valign=\"baseline\"><var>X <\/var>\u00d7 <var>Y<\/var> =\u00a0<\/td>\n<td valign=\"baseline\"><var>s<\/var><var>t<\/var> \u00d7 2<sup>128<\/sup>\u00a0<\/td>\n<td valign=\"baseline\">+\u00a0<\/td>\n<td valign=\"baseline\">(<var>s<\/var><var>y<\/var> + <var>t<\/var><var>x<\/var>) \u00d7 2<sup>64<\/sup>\u00a0<\/td>\n<td valign=\"baseline\">\u00a0+\u00a0<\/td>\n<td valign=\"baseline\"><var>x<\/var><var>y<\/var><\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<td style=\"border-top: solid black 1px;\">zero<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border-top: solid black 1px;\">adjustment<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border-top: solid black 1px; text-align: left;\" nowrap=\"nowrap\">unsigned product<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The first term is zero because it overflows the 128-bit result. That leaves the second term as the adjustment, which simplifies to &#8220;If <var>x &lt; 0<\/var> then subtract <var>y<\/var> from the high 64 bits; if <var>y &lt; 0<\/var> then subtract <var>x<\/var> from the high 64 bits.&#8221;<\/p>\n<pre>    if (x &lt; 0) result.m128i_u64[1] -= y;\r\n    if (y &lt; 0) result.m128i_u64[1] -= x;\r\n<\/pre>\n<p>If we were still playing with SSE, we could compute this as follows:<\/p>\n<pre>    movdqa xmm0, result   \/\/ xmm0 = { high, low }\r\n    movq   xmm1, x        \/\/ xmm1 = { 0, x }\r\n    movq   xmm2, y        \/\/ xmm2 = { 0, y }\r\n    pshufd xmm3, xmm1, _MM_SHUFFLE(1, 1, 3, 2) \/\/ xmm3 = { xhi, xhi, 0, 0 }\r\n    pshufd xmm1, xmm1, _MM_SHUFFLE(1, 0, 3, 2) \/\/ xmm1 = { x, 0 }\r\n    pshufd xmm4, xmm2, _MM_SHUFFLE(1, 1, 3, 2) \/\/ xmm4 = { yhi, yhi, 0, 0 }\r\n    pshufd xmm2, xmm2, _MM_SHUFFLE(1, 0, 3, 2) \/\/ xmm2 = { y, 0 }\r\n    psrad  xmm3, 31       \/\/ xmm3 = { s, s, 0, 0 } = { s, 0 }\r\n    psrad  xmm4, 31       \/\/ xmm4 = { t, t, 0, 0 } = { t, 0 }\r\n    pand   xmm3, xmm2     \/\/ xmm3 = { x &lt; 0 ? y : 0, 0 }\r\n    pand   xmm4, xmm1     \/\/ xmm4 = { y &lt; 0 ? x : 0, 0 }\r\n    psubq  xmm0, xmm3     \/\/ first adjustment\r\n    psubq  xmm0, xmm4     \/\/ second adjustment\r\n    movdqa result, xmm0   \/\/ update result\r\n<\/pre>\n<p>The code is a bit strange because SSE2 doesn&#8217;t have a full set of 64-bit integer opcodes. We would have liked to have used a <code>psraq<\/code> instruction to fill a 64-bit field with a sign bit. But there is no such instruction, so instead we duplicate the 64-bit sign bit into two 32-bit sign bits (one in lane 2 and one in lane 3) and then fill the lanes with that bit using <code>psrad<\/code>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Suppose you want to multiply two double-word values producing a quad-word result, but your processor supports only single-word multiplication with a double-word result. For concreteness, let&#8217;s say that your processor supports 32 \u00d7 32 \u2192 64 multiplication and you want to implement 64 \u00d7 64 \u2192 128 multiplication. (Sound like any processor you know?) Oh [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-43453","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Suppose you want to multiply two double-word values producing a quad-word result, but your processor supports only single-word multiplication with a double-word result. For concreteness, let&#8217;s say that your processor supports 32 \u00d7 32 \u2192 64 multiplication and you want to implement 64 \u00d7 64 \u2192 128 multiplication. (Sound like any processor you know?) Oh [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43453","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=43453"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43453\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=43453"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=43453"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=43453"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}