March 10th, 2022

Optimizing code to darken a bitmap, part 4

Raymond Chen

Our investigation into a simple function to darken a bitmap is still trying to beat this function:

union Pixel
{
    uint8_t c[4]; // four channels: red, green, blue, alpha
    uint32_t v;   // full pixel value as a 32-bit integer
};

void darken(Pixel* first, Pixel* last, int darkness)
{
  int lightness = 256 - darkness;
  for (; first < last; ++first) {
    first->c[0] = (uint8_t)(first->c[0] * lightness / 256);
    first->c[1] = (uint8_t)(first->c[1] * lightness / 256);
    first->c[2] = (uint8_t)(first->c[2] * lightness / 256);
  }
}

We tried parallelizing the multiplication by treating a traditional 32-register as a bunch of 10-bit fields, but it turns out that all the overhead of shifting and masking ended up costing more than the savings of reducing the number of multiplications.

But instead of doing fake SIMD, let’s just do real SIMD.

First, we’ll use the x86 SIMD intrinsics. I’m going to limit myself to SSE2, since that’s the minimum requirement for x86-based Windows starting with Windows 8.

For simplicity of exposition, let’s assume that the start of the pixel array is aligned on a 16-byte boundary and the total size is a perfect multiple of 16. This avoids the hassle of dealing with the edge cases at the start and end.

void darken(Pixel* first, Pixel* last, int darkness)
{
  int lightness = 256 - darkness;
  auto lightness128 = _mm_set_epi16(
        256, lightness, lightness, lightness,
        256, lightness, lightness, lightness);
  void* end = last;
  for (auto pixels = (__m128i*)first; pixels < end; pixels++) {
    auto val = _mm_loadu_si128(pixels);
    auto vlo = _mm_unpacklo_epi8(val, _mm_setzero_si128());
    vlo = _mm_mullo_epi16(vlo, alpha128);
    vlo = _mm_srli_epi16(vlo, 8);
    auto vhi = _mm_unpackhi_epi8(val, _mm_setzero_si128());
    vhi = _mm_mullo_epi16(vhi, alpha128);
    vhi = _mm_srli_epi16(vhi, 8);
    val = _mm_packus_epi16(vlo, vhi);
    _mm_storeu_si128(pixels, val);
  }
}

First, we set up our lightness128 vector to consists of eight 16-bit lanes. The lanes corresponding to color channels get the specified lightness, and the lanes corresponding to alpha channels get a lightness of 256, which means “do not darken”.

Inside the loop, we process 16 bytes at a time, which comes out to four pixels.

First, we load the 16 bytes into an SSE2 register and call it val.

Next, we unpack the low part of the register with a register full of zeroes, putting the result into vlo. The “unpack low” instruction interleaves the low bytes of the two source registers.

source 1

source 2

↓

destination

D15

D14

D13

D12

D11

D10

D09

D08

D07

D06

D05

D04

D03

D02

D01

D00

In our case, the second source register is all zeroes, so it has the effect of performing a zero extension of the first eight 8-bit values (corresponding to the first two pixels) into eight 16-bit values.

    auto vlo = _mm_unpacklo_epi8(val, _mm_setzero_si128());

source 1

source 2

↓

destination

Next up is the multiplication and division:

    vlo = _mm_mullo_epi16(vlo, alpha128);
    vlo = _mm_srli_epi16(vlo, 8);

We perform a parallel multiply of the 16-bit values against the values in our lightness128 register, and then we perform a parallel right-shift by 8 positions.

This combination of operations performs the newPixel = oldPixel * lightness / 256 calculation on eight values at once. Recall that we preloaded the alpha channel with a lightness value of 256, so this multiplies by 256 and the shifts right by 8, which is a net nop.

We perform the same sequence of operations on the high bytes. The only difference is that we unpack with the unpackhi flavor of the intrinsic, so that it operates on the high 8 bytes instead of the low 8 bytes, thereby performing the calculations on the last two pixels instead of the first two.

We now have the desired results in sixteen 16-bit lanes, spread over two registers. We want to collapse those 16-bit lanes back into sixteen 8-bit lanes of a single register, which we do with the pack instruction. The us suffix means that this uses unsigned saturation. The unsigned part is important, but the saturation part isn’t, since we know that the values are already in the range 0…255.

source 1

↓

destination

↑

source 2

At each iteration of the loop, we process four pixels.

This rewrite of the loop using SIMD pays off: It’s 3.5 times faster then the non-SIMD version.

Next time, we’ll apply the same approach to the ARM version.

Bonus chatter: I tried reducing the strength of the multiplication by using the same “addition with masking” trick that I tried in the general-purpose register version. It didn’t help. The multiplication is fast enough that attempts to reduce its strength end up costing more in overhead than they do in savings by avoiding the multiplcation instruction.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

Discussion is closed. Login to edit/delete existing comments.

Adam Rosenfield March 11, 2022

> I’m going to limit myself to SSE2, since that’s the minimum requirement for x86-based Windows starting with Windows 8.

There’s also the option of detecting the CPU’s features at runtime and using an SSE3/SSE4/AVX/AVX2 implementation if those are supported, but of course you still need to keep the SSE2 version for compatibility in the worst case.
- MGetz March 14, 2022
  
  Doesn’t actually work believe it or not, you can thank Intel for this one too. As I mentioned in my comment the PS/PG lineup has AVX turned off. However that doesn’t extend to the cpuid instruction on the skylake versions of those chips which incorrectly report they support AVX2 because the upstream part does. How this made it out I don’t know or why Intel thought it was a good idea to remove AVX2 from something that should by rights have it I’ll never know.
  - Adam Rosenfield March 14, 2022
    
    That’s a rather baffling decision. I suppose one could always try out an AVX2 instruction, catch the illegal instruction exception and fall back to another method, but that’s a bit of a kludge.
  - MGetz March 14, 2022
    
    As best I can tell those parts exist so Intel can sell what would otherwise be waste silicon. I'm not actually sure the volumes they move but I've vary rarely come across them here in the US. That doesn't mean they don't exist or cause problems though. The majority of atom parts also don't have AVX but at least they correctly report it. Recently the Diablo 2 Remake had to remove AVX as default. Unfortunately hasn't been updated to check for anything beyond sse3 either. So you have to actually bugcheck it by doing the CPUID and then if...
    Read more
    As best I can tell those parts exist so Intel can sell what would otherwise be waste silicon. I’m not actually sure the volumes they move but I’ve vary rarely come across them here in the US. That doesn’t mean they don’t exist or cause problems though. The majority of atom parts also don’t have AVX but at least they correctly report it. Recently the Diablo 2 Remake had to remove AVX as default. Unfortunately IsProcessorFeaturePresent hasn’t been updated to check for anything beyond sse3 either. So you have to actually bugcheck it by doing the CPUID and then if it says AVX and Pentium… you check the known bad list.
    
    Read less
jokoe March 10, 2022

this code is also a good example for a buffer-overrun:

If your pixels are not a multiple of 4, you are overwriting some memory that dont belong to you.
- Raymond Chen Author March 11, 2022
  
  See the paragraph that begins “For simplicity of exposition.” The prerequisites happen to be met in my scenario because the pixels came from a DIB section, which is page-aligned and page-granular.
- Falcon March 10, 2022
  
  The article specifically calls out that this is NOT the case. I’m guessing the code posted here is intended for demonstration purposes, rather than a complete solution to be copy-pasted as is.
jokoe March 10, 2022

in your code example, you store lightness in a variable named lightness128.
But in the multiplication you use an undefine variable alpha128.

That wont work ...

also I would reshuffle the instructions a bit so that processor can take advantage of multiple ALUs.
auto zero = _mm_setzero_si128()
for (auto pixels = (__m128i*)first; pixels < end; pixels++) {
auto val = _mm_loadu_si128(pixels);
auto vlo = _mm_unpacklo_epi8(val, zero);
auto vhi = _mm_unpackhi_epi8(val, zero);
vlo = _mm_mullo_epi16(vlo, lightness128);
vhi = _mm_mullo_epi16(vhi, lightness128);
...
Read more
in your code example, you store lightness in a variable named lightness128.
But in the multiplication you use an undefine variable alpha128.

That wont work …

also I would reshuffle the instructions a bit so that processor can take advantage of multiple ALUs.
auto zero = _mm_setzero_si128()
for (auto pixels = (__m128i*)first; pixels < end; pixels++) {
auto val = _mm_loadu_si128(pixels);
auto vlo = _mm_unpacklo_epi8(val, zero);
auto vhi = _mm_unpackhi_epi8(val, zero);
vlo = _mm_mullo_epi16(vlo, lightness128);
vhi = _mm_mullo_epi16(vhi, lightness128);
vlo = _mm_srli_epi16(vlo, 8);
vhi = _mm_srli_epi16(vhi, 8);
val = _mm_packus_epi16(vlo, vhi);
_mm_storeu_si128(pixels, val);
}

Read less
MGetz March 10, 2022 · Edited

Note to anyone thinking Raymond should have used AVX: AVX isn't available on all intel processors even brand new ones. The Pentium gold/silver line notoriously doesn't have them (despite Atom chips having it). This means that if you intend to support as many users as possible for a library that would need to be general purpose and not intended only for higher end hardware you're limited to SSE-SSE4.1. That should support at least awhile back without being too restrictive.

Edit: It looks like intel is finally changing this.. the new Pentium gold 7xxx series does finally support: AVX2 and nothing beyond.

Read more
Note to anyone thinking Raymond should have used AVX: AVX isn’t available on all intel processors even brand new ones. The Pentium gold/silver line notoriously doesn’t have them (despite Atom chips having it). This means that if you intend to support as many users as possible for a library that would need to be general purpose and not intended only for higher end hardware you’re limited to SSE-SSE4.1. That should support at least awhile back without being too restrictive.

Edit: It looks like intel is finally changing this.. the new Pentium gold 7xxx series does finally support: AVX2 and nothing beyond.

Read less