Our investigation into a simple function to darken a bitmap left off with this function:
struct Pixel { uint8_t c[4]; // four channels: red, green, blue, alpha uint32_t v; // full pixel value as a 32-bit integer }; void darken(Pixel* first, Pixel* last, int darkness) { int lightness = 256 - darkness; for (; first < last; ++first) { first->c[0] = (uint8_t)(first->c[0] * lightness / 256); first->c[1] = (uint8_t)(first->c[1] * lightness / 256); first->c[2] = (uint8_t)(first->c[2] * lightness / 256); } }
There’s a lot of multiplication going on here, and multiplication tends to be one of the more expensive CPU instructions. Maybe we can collapse them into a single multiplication operation by running the calculations in parallel.
The idea here is to break the 32-bit integer into the respective channels, but spread them out so they act like independent lanes of a SIMD register. (This technique goes by the name SWAR, short for “SIMD within a register”.) Once they’ve been spread out into lanes, we perform a single multiplication, which acts like a parallel multiplication across all the lanes, thanks to the magic of the distributive law:
(100²a + 100b + 100c)d = 100²ad + 100bd + cd
If cd < 100, then there will be no carry into the hundreds place, and similar, if bd < 100, then there will be no carry into the ten thousands place. Under such conditions, you can pluck out the individual products by extracting the corresponding pairs of digits from the final product.
It so happens that my program uses only three darkness values: 8, 16, and 24, corresponding to lightness factors 248, 240, and 232, respectively. The application of the lightness factor can therefore be simplified to
newPixel = oldPixel * (lightness / 8) / 32;
which can be rewritten as
newPixel = oldPixel - ceil(oldPixel * (darkness / 8) / 32);
The truncation toward zero in the lightness calculation becomes a ceiling calculation when viewed as darkness.
Reinterpreting it as a darkness calculation is helpful because the value of darkness / 8
is limited to the values 1, 2, and 3. The factor is at most two bits.
Multiplying an 8-bit value with a 2-bit value produces a 10-bit result. And we can squeeze three 10-bit fields inside a 32-bit integer.
A quick mental check confirms that rounding up any fractional portion won’t take us into 11 bits:
255 × 3 < 256 × 3 = 256 × 4 − 256 < 256 × 4 − 32 = 10¹¹ − 32.
The idea therefore is that we take our three channel values and arrange them inside a 32-bit integer like this:
3 1 |
3 0 |
2 9 |
2 8 |
2 7 |
2 6 |
2 5 |
2 4 |
2 3 |
2 2 |
2 1 |
2 0 |
1 9 |
1 8 |
1 7 |
1 6 |
1 5 |
1 4 |
1 3 |
1 2 |
1 1 |
1 0 |
9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
b | g | r |
In other words,
value[ 0: 9] = r;
value[10:19] = g;
value[20:29] = b;
value[30:31]
are unused
The lanes are 10 bits wide, so multiplying the entire integer by 1, 2, or 3 will not trigger any carries across lanes.
After the multiply, we reinterpret the integer as a fixed-point value, by applying an implied scale of 32 = 2⁵. Rounding up the fixed-point value to the next integer is the same as rounding up the original integer to the next multiple of 32.
Once the multiplication and rounding is done, we extract the integer portion of the fixed-point fields (which means we take the top five bits of each field) and subtract them from our starting value.
constexpr unsigned pack_fields(uint8_t r, uint8_t g, uint8_t b) { return r | (g << 10) | (b << 20); } void darken(Pixel* first, Pixel* last, int darkness) { int factor = darkness / 8; for (; first < last; ++first) { uint32_t fields = pack_fields( first->c[0], first->c[1], first->c[2]); fields *= factor; fields += pack_fields(31, 31, 31); first->c[0] -= (fields >> 5) & 31; first->c[1] -= (fields >> 15) & 31; first->c[2] -= (fields >> 25) & 31; } }
Unfortunately, this benchmarks at 2.9× slower than the non-parallel version.
Okay, so maybe it’s the byte access that is killing us. Let’s load and store entire pixels at a time.
void darken(Pixel* first, Pixel* last, int darkness) { int factor = darkness / 8; for (; first < last; ++first) { uint32_t v = first->v; uint32_t fields = (v & 0xFF) | ((v & 0xFF00) << 2) | ((v & 0xFF0000) << 4); fields *= factor; fields += pack_fields(31, 31, 31); uint32_t diff = ((fields >> 5) & 0x1F) | ((fields >> 7) & 0x1F00) | ((fields >> 9) & 0x1F0000) | first->v = v - diff; } }
Still 2.1× slower than the non-parallel version. A tiny improvement but still way behind.
Some micro-tweaking here won’t help much: We waste our time building up the difference fields, when we can just subtract each of the pieces as we calculate them.
void darken(Pixel* first, Pixel* last, int darkness)
{
int factor = darkness / 8;
for (; first < last; ++first) {
uint32_t v = first->v;
uint32_t fields = (v & 0xFF) |
((v & 0xFF00) << 2) |
((v & 0xFF0000) << 4);
fields *= factor;
fields += pack_fields(31, 31, 31);
v -= (fields >> 5) & 0x1F;
v -= (fields >> 7) & 0x1F00;
v -= (fields >> 9) & 0x1F0000;
first->v = v;
}
}
As expected, this doesn’t really help. Still 2.1× slower than the non-parallel version.
Part of the problem might be that I’m running the benchmarks on an x86-64 system, and the x86-64 does not have dedicated bitfield manipulation instructions, unlike other processors like PowerPC, ARM, and AArch64.
Or maybe the problem is that I’m doing a multiply. Since I know that the darkness factor is going to be 1, 2, or 3, I can strength-reduce that to an addition. We’ll look at how to do that jumplessly next time.
I believe there is a typo.
“(100²a + 100b + 100c)d = 100²ad + 100bd + cd”
That should have been just “c”, not “100c” on the left side.
In the “quick mental check” part, do you mean 2¹⁰ rather than 10¹¹?
I’m gonna need a Microspeak post about “strength-reduce” 🙂
Strength reduction
The techniques you describe are cool, but from a narrative perspective, I feel that introducing new rules in part 2 is not fair. It completely changes the problem space and possibilities for optimization.
Imagine if in part 3 you change the setup again so darkness can only be 0 or 100%. Then the program can be “optimized” to be a no-op when darkness is 0, and a memset(0) when darkness is 100%.
I’m pretty sure you meant union Pixel not struct Pixel.
So, the summary of Episode 2 of “Optimizing code to darken a bitmap”: Making the code run 2.9x times slower. Highlights of Episode 3: How to make the code cause BSOD! 😉
Joke aside, I hope the 3rd episode justifies this entire post. I’ll keep my fingers crossed. 🤞
In the 90s, you would end using a look-up table for the multiplication. Building it takes 256 iterations and 256 bytes of memory, but after that, you replace a multiplication and a division with a byte array access, which is a lot faster. The table even fits inside the meager 8 KB cache of a 486. You could even have several pre-computed tables and use them on demand, saving even more time.
This was used a lot in software renderers for 3D games. It allowed a lot of tricks with the textures: shading, coloring, and even changing individual colors (say, turn blues to reds while leaving all other colors unchanged). And it was quick enough to apply it per-pixel in the inner texturing loop. In fact, that’s where the term “shader” comes from: a small piece of code which gets executed once per pixel while applying a texture, to “shade” it.
Ken Silverman, author of the Build engine used in Duke Nukem 3D, Shadow Warrior and Blood, tells an interesting story. During the development of Shadow Warrior, 3D Realms asked him to add MMX support. He played with it, and he found that due to his code being very optimized, and all the palette lookup tricks, it benefited little from SIMD, and in some cases the work of packing/unpacking bytes actually made it run slower (Build, as most game engines of the day, ran in 256-color paletted modes).
HI, try something like:
(since we’re going dark there should be no chance of any overflows or underflows to other bytes)
That was somewhat predictable. Multiplications are not as expensive as they used to be. I selected a few recent microarchitectures on the uops.info website, and they all show a throughput of 1 multiplication/cycle (not the latency, which is longer, but the multiplier is pipelined). Sure the CPU can process quite a few more instructions/cycle (and it’ll have to — loads, stores, loop increments, branching, etc.), but what good is that if you have to waste all the remaining resources (and then some more) with expensive conversion to a different representation and back?
This might work if the operation is inside of a pipeline with other operations that benefit from this representation, so that you could amortize the conversion cost across many different operations, performing it only at the start and finish.
Of course, and sorry to everyone else for the spoiler, but the highest performing implementation will use SIMD (i.e. AVX2 or AVX-512), and I wouldn’t bet that this conversion of representation trick would perform better there either. My guess is that the best performing version will simply zero/sign-extend from 8-bit to 16-bit lanes, do a 16-bit/lane multiply, then use shuffling trickery to pick up the most significant byte from each 16-bit lane.
Seems to be just an MSVC issue, clang and gcc will both swap to SIMD if you set the options
No AVX512: https://godbolt.org/z/beGh4qs1K
march=rocketlake https://godbolt.org/z/oPfaj6GsM