{"id":106340,"date":"2022-03-11T07:00:00","date_gmt":"2022-03-11T15:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=106340"},"modified":"2022-03-11T07:07:27","modified_gmt":"2022-03-11T15:07:27","slug":"20220311-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220311-00\/?p=106340","title":{"rendered":"Optimizing code to darken a bitmap, part 5"},"content":{"rendered":"<p>For our last trick, we&#8217;ll ARM-ify this simple function to darken a bitmap.<\/p>\n<pre>union Pixel\r\n{\r\n    uint8_t c[4]; \/\/ four channels: red, green, blue, alpha\r\n    uint32_t v;   \/\/ full pixel value as a 32-bit integer\r\n};\r\n\r\nvoid darken(Pixel* first, Pixel* last, int darkness)\r\n{\r\n  int lightness = 256 - darkness;\r\n  for (; first &lt; last; ++first) {\r\n    first-&gt;c[0] = (uint8_t)(first-&gt;c[0] * lightness \/ 256);\r\n    first-&gt;c[1] = (uint8_t)(first-&gt;c[1] * lightness \/ 256);\r\n    first-&gt;c[2] = (uint8_t)(first-&gt;c[2] * lightness \/ 256);\r\n  }\r\n}\r\n<\/pre>\n<p>The general principle is the same, but we just apply it to ARM Neon intrinsics instead of x86 SSE intrinsics.<\/p>\n<pre>void darken(Pixel* first, Pixel* last, int darkness)\r\n{\r\n  int lightness = 256 - darkness;\r\n  uint16x8_t lightness128 = vdupq_n_u16((uint16_t)lightness);\r\n  lightness128 = vsetq_lane_u16(256, lightness128, 3);\r\n  lightness128 = vsetq_lane_u16(256, lightness128, 7);\r\n  void* end = last;\r\n  for (auto pixels = (uint8_t*)first; pixels &lt; end; pixels += 16) {\r\n    uint8x16_t val = vld1q_u8(pixels);\r\n    uint8x16x2_t zipped = vzipq_u8(val, vdupq_n_u8(0));\r\n    uint16x8_t lo = vreinterpretq_u16_u8(zipped.val[0]);\r\n    lo = vmulq_u16(lo, lightness128);\r\n    auto hi = vreinterpretq_u16_u8(zipped.val[1]);\r\n    hi = vmulq_u16(hi, lightness128);\r\n    val = vuzpq_u8(vreinterpretq_u8_u16(lo), vreinterpretq_u8_u16(hi)).val[1];\r\n    vst1q_u8(pixels, val);\r\n  }\r\n}\r\n<\/pre>\n<p>We want to set up our <code>lightness128<\/code> vector to consists of 8 lanes of 16-bit values, with the ones corresponding to color channels get the specified lightness, and the ones corresponding to alpha channels get a lightness of 256, which means &#8220;do not darken&#8221;. The quickest way I found to do this (not that I looked very hard) is to broadcast the <code>lightness<\/code> value to all the lanes, and then set lanes 3 and 7 explicitly to 256.<\/p>\n<p>Inside the loop, we process 16 bytes at a time, which comes out to four pixels.<\/p>\n<p>First, we load the 16 bytes into a Neon register and call it <code>val<\/code>.<\/p>\n<p>Next, we <i>zip<\/i> the 16-byte register with a register with a register full of zeroes. The zip intrinsic interleaves the results twice: The element in <code>val[0]<\/code> contains the interleaved low bytes, and the element in <code>val[1]<\/code> contains the interleaved high bytes.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: center; font-size: 80%;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td>source 0<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A8<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A0<\/td>\n<\/tr>\n<tr>\n<td>source 1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B8<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B0<\/td>\n<\/tr>\n<tr>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>val[0]<\/code><\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B0<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A0<\/td>\n<\/tr>\n<tr>\n<td><code>val[1]<\/code><\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B8<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A8<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>If you aren&#8217;t interested in one of the results, you can just ignore it, and the optimizer will remove that calculation from the code generation. In our case, we use both parts, though, so that optimization doesn&#8217;t come into play.<\/p>\n<p>Zipping with zero has the effect of zero-extending all the lanes from 8-bit to 16-bit, once you reinterpret the values as eight 16-bit lanes rather than sixteen 8-bit lanes.<\/p>\n<p>For the low part, we take the first zipped-up value, reinterpret it as eight 16-bit lanes, and then perform a parallel multiply with our <code>lightness128<\/code> vector. Our x86 version took the result of the multiplication and shifted it right by 8 positions at this point, in order to divide by 256 and put the values in the right place to be combined with a pack instruction. For Neon, however, we&#8217;ll leave the values in the odd bytes for reasons we&#8217;ll see later.<\/p>\n<p>We perform the same set of calculations on the high part, again leaving the values in the odd bytes of the result.<\/p>\n<p>The final step is combining the results, which we do with the <i>unzip<\/i> instruction. This takes the even-numbered lanes from the first source register and puts them in the low-order lanes of the first destination. And it takes the even-numbered lanes from the second source register and puts them in the high-order lanes of the first destination. The odd-numbered lanes are collected and placed in the second destination. If you just relabel the bytes, you&#8217;ll see that this is the inverse of the <i>zip<\/i> instruction, hence its name.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: center; font-size: 80%;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td>source 0<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A8<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A0<\/td>\n<td><code>val[0]<\/code><\/td>\n<\/tr>\n<tr>\n<td>source 1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B8<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B0<\/td>\n<td><code>val[1]<\/code><\/td>\n<\/tr>\n<tr>\n<td>\u2193zip\u2193<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>\u2191unzip\u2191<\/td>\n<\/tr>\n<tr>\n<td><code>val[0]<\/code><\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A7<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A6<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A5<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A4<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A3<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A2<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A1<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B0<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A0<\/td>\n<td>source 0<\/td>\n<\/tr>\n<tr>\n<td><code>val[1]<\/code><\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A15<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A14<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A13<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A12<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A11<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A10<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A9<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">B8<\/td>\n<td style=\"border: solid 1px black; width: 4ex;\">A8<\/td>\n<td>source 1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>It is that second result that&#8217;s interesting to us: For each 16-bit lane consists of An in the low byte and Bn in the high byte, we want to shift right 8 and save the low byte. &#8220;Shift right 8 and save the low byte&#8221; is the same as &#8220;extract the high byte&#8221;, which in our case is Bn. In other words, we want to save all the Bn bytes and pack them into a single register. Everything is set up perfectly for an unzip, and our result is in the unzip output <code>val[1]<\/code>, which we store to memory.<\/p>\n<p>I don&#8217;t have easy access to an AArch64 system for performance testing,\u00b9 so I can&#8217;t say how much faster this is than the original version, but I suspect it gives a comparable speed-up as the x86 version.<\/p>\n<p>\u00b9 So how do I know this code even works? I put this function into a test program and sent it to a colleague who does have an AArch64 system. He ran the program and sent me a screen shot of the output, and I visually confirmed that it looked correct. It took me back to the days of punch card programming, where you submitted your deck to the machine operator and then came back later for the results. Faced with such slow turnaround times, you double- and sometimes triple-checked that your program was exactly what you wanted, because if you messed up, you had to go through the whole cycle again.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Once more, with ARM feeling.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-106340","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Once more, with ARM feeling.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106340","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=106340"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106340\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=106340"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=106340"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=106340"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}