Optimizing code to darken a bitmap, part 4

Raymond Chen

Our investigation into a simple function to darken a bitmap is still trying to beat this function:

union Pixel
{
    uint8_t c[4]; // four channels: red, green, blue, alpha
    uint32_t v;   // full pixel value as a 32-bit integer
};

void darken(Pixel* first, Pixel* last, int darkness)
{
  int lightness = 256 - darkness;
  for (; first < last; ++first) {
    first->c[0] = (uint8_t)(first->c[0] * lightness / 256);
    first->c[1] = (uint8_t)(first->c[1] * lightness / 256);
    first->c[2] = (uint8_t)(first->c[2] * lightness / 256);
  }
}

We tried parallelizing the multiplication by treating a traditional 32-register as a bunch of 10-bit fields, but it turns out that all the overhead of shifting and masking ended up costing more than the savings of reducing the number of multiplications.

But instead of doing fake SIMD, let’s just do real SIMD.

First, we’ll use the x86 SIMD intrinsics. I’m going to limit myself to SSE2, since that’s the minimum requirement for x86-based Windows starting with Windows 8.

For simplicity of exposition, let’s assume that the start of the pixel array is aligned on a 16-byte boundary and the total size is a perfect multiple of 16. This avoids the hassle of dealing with the edge cases at the start and end.

void darken(Pixel* first, Pixel* last, int darkness)
{
  int lightness = 256 - darkness;
  auto lightness128 = _mm_set_epi16(
        256, lightness, lightness, lightness,
        256, lightness, lightness, lightness);
  void* end = last;
  for (auto pixels = (__m128i*)first; pixels < end; pixels++) {
    auto val = _mm_loadu_si128(pixels);
    auto vlo = _mm_unpacklo_epi8(val, _mm_setzero_si128());
    vlo = _mm_mullo_epi16(vlo, alpha128);
    vlo = _mm_srli_epi16(vlo, 8);
    auto vhi = _mm_unpackhi_epi8(val, _mm_setzero_si128());
    vhi = _mm_mullo_epi16(vhi, alpha128);
    vhi = _mm_srli_epi16(vhi, 8);
    val = _mm_packus_epi16(vlo, vhi);
    _mm_storeu_si128(pixels, val);
  }
}

First, we set up our lightness128 vector to consists of eight 16-bit lanes. The lanes corresponding to color channels get the specified lightness, and the lanes corresponding to alpha channels get a lightness of 256, which means “do not darken”.

Inside the loop, we process 16 bytes at a time, which comes out to four pixels.

First, we load the 16 bytes into an SSE2 register and call it val.

Next, we unpack the low part of the register with a register full of zeroes, putting the result into vlo. The “unpack low” instruction interleaves the low bytes of the two source registers.

source 1   A7   A6   A5   A4   A3   A2   A1   A0
source 2 B7 B6 B5 B4 B3 B2 B1 B0
 
destination D15 D14 D13 D12 D11 D10 D09 D08 D07 D06 D05 D04 D03 D02 D01 D00

In our case, the second source register is all zeroes, so it has the effect of performing a zero extension of the first eight 8-bit values (corresponding to the first two pixels) into eight 16-bit values.

    auto vlo = _mm_unpacklo_epi8(val, _mm_setzero_si128());
source 1   A7   A6   A5   A4   A3   A2   A1   A0
source 2 00 00 00 00 00 00 00 00
 
destination 00 A7 00 A6 00 A5 00 A4 00 A3 00 A2 00 A1 00 A0
 
  A7 A6 A5 A4 A3 A2 A1 A0  

Next up is the multiplication and division:

    vlo = _mm_mullo_epi16(vlo, alpha128);
    vlo = _mm_srli_epi16(vlo, 8);

We perform a parallel multiply of the 16-bit values against the values in our lightness128 register, and then we perform a parallel right-shift by 8 positions.

This combination of operations performs the newPixel = oldPixel * lightness / 256 calculation on eight values at once. Recall that we preloaded the alpha channel with a lightness value of 256, so this multiplies by 256 and the shifts right by 8, which is a net nop.

We perform the same sequence of operations on the high bytes. The only difference is that we unpack with the unpackhi flavor of the intrinsic, so that it operates on the high 8 bytes instead of the low 8 bytes, thereby performing the calculations on the last two pixels instead of the first two.

We now have the desired results in sixteen 16-bit lanes, spread over two registers. We want to collapse those 16-bit lanes back into sixteen 8-bit lanes of a single register, which we do with the pack instruction. The us suffix means that this uses unsigned saturation. The unsigned part is important, but the saturation part isn’t, since we know that the values are already in the range 0…255.

source 1   A7 A6 A5 A4 A3 A2 A1 A0
                   
destination   B7 A7 B6 A6 B5 A5 B4 A4 B3 A3 B2 A2 B1 A1 B0 A0
                   
source 2 B7 B6 B5 B4 B3 B2 B1 B0  

At each iteration of the loop, we process four pixels.

This rewrite of the loop using SIMD pays off: It’s 3.5 times faster then the non-SIMD version.

Next time, we’ll apply the same approach to the ARM version.

Bonus chatter: I tried reducing the strength of the multiplication by using the same “addition with masking” trick that I tried in the general-purpose register version. It didn’t help. The multiplication is fast enough that attempts to reduce its strength end up costing more in overhead than they do in savings by avoiding the multiplcation instruction.

9 comments

Comments are closed. Login to edit/delete your existing comments

  • MGetz 0

    Note to anyone thinking Raymond should have used AVX: AVX isn’t available on all intel processors even brand new ones. The Pentium gold/silver line notoriously doesn’t have them (despite Atom chips having it). This means that if you intend to support as many users as possible for a library that would need to be general purpose and not intended only for higher end hardware you’re limited to SSE-SSE4.1. That should support at least awhile back without being too restrictive.

    Edit: It looks like intel is finally changing this.. the new Pentium gold 7xxx series does finally support: AVX2 and nothing beyond.

  • jokoe 0

    in your code example, you store lightness in a variable named lightness128.
    But in the multiplication you use an undefine variable alpha128.

    That wont work …

    also I would reshuffle the instructions a bit so that processor can take advantage of multiple ALUs.
    auto zero = _mm_setzero_si128()
    for (auto pixels = (__m128i*)first; pixels < end; pixels++) {
    auto val = _mm_loadu_si128(pixels);
    auto vlo = _mm_unpacklo_epi8(val, zero);
    auto vhi = _mm_unpackhi_epi8(val, zero);
    vlo = _mm_mullo_epi16(vlo, lightness128);
    vhi = _mm_mullo_epi16(vhi, lightness128);
    vlo = _mm_srli_epi16(vlo, 8);
    vhi = _mm_srli_epi16(vhi, 8);
    val = _mm_packus_epi16(vlo, vhi);
    _mm_storeu_si128(pixels, val);
    }

  • jokoe 0

    this code is also a good example for a buffer-overrun:

    If your pixels are not a multiple of 4, you are overwriting some memory that dont belong to you.

    • Falcon 0

      The article specifically calls out that this is NOT the case. I’m guessing the code posted here is intended for demonstration purposes, rather than a complete solution to be copy-pasted as is.

    • Raymond ChenMicrosoft employee 0

      See the paragraph that begins “For simplicity of exposition.” The prerequisites happen to be met in my scenario because the pixels came from a DIB section, which is page-aligned and page-granular.

  • Adam Rosenfield 0

    > I’m going to limit myself to SSE2, since that’s the minimum requirement for x86-based Windows starting with Windows 8.

    There’s also the option of detecting the CPU’s features at runtime and using an SSE3/SSE4/AVX/AVX2 implementation if those are supported, but of course you still need to keep the SSE2 version for compatibility in the worst case.

    • MGetz 0

      Doesn’t actually work believe it or not, you can thank Intel for this one too. As I mentioned in my comment the PS/PG lineup has AVX turned off. However that doesn’t extend to the cpuid instruction on the skylake versions of those chips which incorrectly report they support AVX2 because the upstream part does. How this made it out I don’t know or why Intel thought it was a good idea to remove AVX2 from something that should by rights have it I’ll never know.

      • Adam Rosenfield 0

        That’s a rather baffling decision. I suppose one could always try out an AVX2 instruction, catch the illegal instruction exception and fall back to another method, but that’s a bit of a kludge.

        • MGetz 0

          As best I can tell those parts exist so Intel can sell what would otherwise be waste silicon. I’m not actually sure the volumes they move but I’ve vary rarely come across them here in the US. That doesn’t mean they don’t exist or cause problems though. The majority of atom parts also don’t have AVX but at least they correctly report it. Recently the Diablo 2 Remake had to remove AVX as default. Unfortunately IsProcessorFeaturePresent hasn’t been updated to check for anything beyond sse3 either. So you have to actually bugcheck it by doing the CPUID and then if it says AVX and Pentium… you check the known bad list.

Feedback usabilla icon