{"id":106335,"date":"2022-03-10T07:00:00","date_gmt":"2022-03-10T15:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=106335"},"modified":"2022-03-10T07:21:15","modified_gmt":"2022-03-10T15:21:15","slug":"20220310-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220310-00\/?p=106335","title":{"rendered":"Optimizing code to darken a bitmap, part 4"},"content":{"rendered":"<p>Our investigation into a simple function to darken a bitmap is still trying to beat this function:<\/p>\n<pre>union Pixel\r\n{\r\n    uint8_t c[4]; \/\/ four channels: red, green, blue, alpha\r\n    uint32_t v;   \/\/ full pixel value as a 32-bit integer\r\n};\r\n\r\nvoid darken(Pixel* first, Pixel* last, int darkness)\r\n{\r\n  int lightness = 256 - darkness;\r\n  for (; first &lt; last; ++first) {\r\n    first-&gt;c[0] = (uint8_t)(first-&gt;c[0] * lightness \/ 256);\r\n    first-&gt;c[1] = (uint8_t)(first-&gt;c[1] * lightness \/ 256);\r\n    first-&gt;c[2] = (uint8_t)(first-&gt;c[2] * lightness \/ 256);\r\n  }\r\n}\r\n<\/pre>\n<p>We tried parallelizing the multiplication by treating a traditional 32-register as a bunch of 10-bit fields, but it turns out that all the overhead of shifting and masking ended up costing more than the savings of reducing the number of multiplications.<\/p>\n<p>But instead of doing fake SIMD, let&#8217;s just do real SIMD.<\/p>\n<p>First, we&#8217;ll use the x86 SIMD intrinsics. I&#8217;m going to limit myself to SSE2, since that&#8217;s the minimum requirement for x86-based Windows starting with Windows\u00a08.<\/p>\n<p>For simplicity of exposition, let&#8217;s assume that the start of the pixel array is aligned on a 16-byte boundary and the total size is a perfect multiple of 16. This avoids the hassle of dealing with the edge cases at the start and end.<\/p>\n<pre>void darken(Pixel* first, Pixel* last, int darkness)\r\n{\r\n  int lightness = 256 - darkness;\r\n  auto lightness128 = _mm_set_epi16(\r\n        256, lightness, lightness, lightness,\r\n        256, lightness, lightness, lightness);\r\n  void* end = last;\r\n  for (auto pixels = (__m128i*)first; pixels &lt; end; pixels++) {\r\n    auto val = _mm_loadu_si128(pixels);\r\n    auto vlo = _mm_unpacklo_epi8(val, _mm_setzero_si128());\r\n    vlo = _mm_mullo_epi16(vlo, alpha128);\r\n    vlo = _mm_srli_epi16(vlo, 8);\r\n    auto vhi = _mm_unpackhi_epi8(val, _mm_setzero_si128());\r\n    vhi = _mm_mullo_epi16(vhi, alpha128);\r\n    vhi = _mm_srli_epi16(vhi, 8);\r\n    val = _mm_packus_epi16(vlo, vhi);\r\n    _mm_storeu_si128(pixels, val);\r\n  }\r\n}\r\n<\/pre>\n<p>First, we set up our <code>lightness128<\/code> vector to consists of eight 16-bit lanes. The lanes corresponding to color channels get the specified lightness, and the lanes corresponding to alpha channels get a lightness of 256, which means &#8220;do not darken&#8221;.<\/p>\n<p>Inside the loop, we process 16 bytes at a time, which comes out to four pixels.<\/p>\n<p>First, we load the 16 bytes into an SSE2 register and call it <code>val<\/code>.<\/p>\n<p>Next, we unpack the low part of the register with a register full of zeroes, putting the result into <code>vlo<\/code>. The &#8220;unpack low&#8221; instruction interleaves the low bytes of the two source registers.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: center;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td>source 1<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A7<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A6<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A5<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A4<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A3<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A2<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A1<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A0<\/td>\n<\/tr>\n<tr>\n<td>source 2<\/td>\n<td style=\"border: solid 1px black;\">B7<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">B6<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">B5<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">B4<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">B3<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">B2<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">B1<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">B0<\/td>\n<td>\u2193<\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<\/tr>\n<tr>\n<td>destination<\/td>\n<td style=\"border: solid 1px black;\">D15<\/td>\n<td style=\"border: solid 1px black;\">D14<\/td>\n<td style=\"border: solid 1px black;\">D13<\/td>\n<td style=\"border: solid 1px black;\">D12<\/td>\n<td style=\"border: solid 1px black;\">D11<\/td>\n<td style=\"border: solid 1px black;\">D10<\/td>\n<td style=\"border: solid 1px black;\">D09<\/td>\n<td style=\"border: solid 1px black;\">D08<\/td>\n<td style=\"border: solid 1px black;\">D07<\/td>\n<td style=\"border: solid 1px black;\">D06<\/td>\n<td style=\"border: solid 1px black;\">D05<\/td>\n<td style=\"border: solid 1px black;\">D04<\/td>\n<td style=\"border: solid 1px black;\">D03<\/td>\n<td style=\"border: solid 1px black;\">D02<\/td>\n<td style=\"border: solid 1px black;\">D01<\/td>\n<td style=\"border: solid 1px black;\">D00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In our case, the second source register is all zeroes, so it has the effect of performing a zero extension of the first eight 8-bit values (corresponding to the first two pixels) into eight 16-bit values.<\/p>\n<pre>    auto vlo = _mm_unpacklo_epi8(val, _mm_setzero_si128());\r\n<\/pre>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: center;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td>source 1<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A7<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A6<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A5<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A4<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A3<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A2<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A1<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\">A0<\/td>\n<\/tr>\n<tr>\n<td>source 2<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td>\u2193<\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<td>\u2193<\/td>\n<\/tr>\n<tr>\n<td>destination<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A7<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A6<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A5<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A4<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A3<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A2<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A1<\/td>\n<td style=\"border: solid 1px black;\">00<\/td>\n<td style=\"border: solid 1px black;\">A0<\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A7<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A6<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A5<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A4<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A3<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A2<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A1<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A0<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Next up is the multiplication and division:<\/p>\n<pre>    vlo = _mm_mullo_epi16(vlo, alpha128);\r\n    vlo = _mm_srli_epi16(vlo, 8);\r\n<\/pre>\n<p>We perform a parallel multiply of the 16-bit values against the values in our <code>lightness128<\/code> register, and then we perform a parallel right-shift by 8 positions.<\/p>\n<p>This combination of operations performs the <code>newPixel = oldPixel * lightness \/ 256<\/code> calculation on eight values at once. Recall that we preloaded the alpha channel with a lightness value of 256, so this multiplies by 256 and the shifts right by 8, which is a net nop.<\/p>\n<p>We perform the same sequence of operations on the high bytes. The only difference is that we unpack with the <code>unpackhi<\/code> flavor of the intrinsic, so that it operates on the high 8 bytes instead of the low 8 bytes, thereby performing the calculations on the last two pixels instead of the first two.<\/p>\n<p>We now have the desired results in sixteen 16-bit lanes, spread over two registers. We want to collapse those 16-bit lanes back into sixteen 8-bit lanes of a single register, which we do with the <code>pack<\/code> instruction. The <code>us<\/code> suffix means that this uses unsigned saturation. The unsigned part is important, but the saturation part isn&#8217;t, since we know that the values are already in the range 0\u2026255.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: center;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td>source 1<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A7<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A6<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A5<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A4<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A3<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A2<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A1<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">A0<\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>\u2193<\/td>\n<\/tr>\n<tr>\n<td>destination<\/td>\n<td style=\"width: 3ex;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; width: 3ex;\">B7<\/td>\n<td style=\"border: solid 1px black;\">A7<\/td>\n<td style=\"border: solid 1px black;\">B6<\/td>\n<td style=\"border: solid 1px black;\">A6<\/td>\n<td style=\"border: solid 1px black;\">B5<\/td>\n<td style=\"border: solid 1px black;\">A5<\/td>\n<td style=\"border: solid 1px black;\">B4<\/td>\n<td style=\"border: solid 1px black;\">A4<\/td>\n<td style=\"border: solid 1px black;\">B3<\/td>\n<td style=\"border: solid 1px black;\">A3<\/td>\n<td style=\"border: solid 1px black;\">B2<\/td>\n<td style=\"border: solid 1px black;\">A2<\/td>\n<td style=\"border: solid 1px black;\">B1<\/td>\n<td style=\"border: solid 1px black;\">A1<\/td>\n<td style=\"border: solid 1px black;\">B0<\/td>\n<td style=\"border: solid 1px black;\">A0<\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td>source 2<\/td>\n<td style=\"border: solid 1px black; width: 6ex;\" colspan=\"2\">B7<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">B6<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">B5<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">B4<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">B3<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">B2<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">B1<\/td>\n<td style=\"border: solid 1px black;\" colspan=\"2\">B0<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>At each iteration of the loop, we process four pixels.<\/p>\n<p>This rewrite of the loop using SIMD pays off: It&#8217;s 3.5 times faster then the non-SIMD version.<\/p>\n<p>Next time, we&#8217;ll apply the same approach to the ARM version.<\/p>\n<p><b>Bonus chatter<\/b>: I tried reducing the strength of the multiplication by using the same &#8220;addition with masking&#8221; trick that I tried in the general-purpose register version. It didn&#8217;t help. The multiplication is fast enough that attempts to reduce its strength end up costing more in overhead than they do in savings by avoiding the multiplcation instruction.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Bringing out the big SIMD guns, for x86 at least.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-106335","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Bringing out the big SIMD guns, for x86 at least.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106335","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=106335"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106335\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=106335"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=106335"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=106335"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}