Our investigation into a simple function to darken a bitmap is still trying to beat this function:
union Pixel { uint8_t c[4]; // four channels: red, green, blue, alpha uint32_t v; // full pixel value as a 32-bit integer }; void darken(Pixel* first, Pixel* last, int darkness) { int lightness = 256 - darkness; for (; first < last; ++first) { first->c[0] = (uint8_t)(first->c[0] * lightness / 256); first->c[1] = (uint8_t)(first->c[1] * lightness / 256); first->c[2] = (uint8_t)(first->c[2] * lightness / 256); } }
We transformed the problem into subtracting off the darkness rather than preserving the lightness, specialized the function so it supported only darkness values 8, 16, and 24 (which when rescaled to become 1, 2, and 3), and used a bitfield trick so that we had to perform only one expensive multiply instruction per iteration, rather than three, but that didn’t seem to be helping. We’ll now take advantage of the fact that the darkness factor is known to be 1, 2, or 3.
void darken(Pixel* first, Pixel* last, int darkness) { int factor = darkness / 8; uint32_t mask2 = factor >= 2 ? 0xFFFFFFFF : 0; uint32_t mask3 = factor >= 3 ? 0xFFFFFFFF : 0; for (; first < last; ++first) { uint32_t v = first->v; uint32_t fields = (v & 0xFF) | ((v & 0xFF00) << 2) | ((v & 0xFF0000) << 4); fields += (fields & mask2) + (fields & mask3); fields += pack_fields(31, 31, 31); v -= (fields >> 5) & 0x1F; v -= (fields >> 7) & 0x1F00; v -= (fields >> 9) & 0x1F0000; first->v = v; } }
The usage pattern for this function is that the darkness factor is in practice always 1, 2, or 3. So we calculate some masks that keep track of the actual value of the factor.
factor | mask2 |
mask3 |
(fields & mask2) |
+ |
(fields & mask3) |
---|---|---|---|---|---|
1 | 0x00000000 |
0x00000000 |
0 |
+ |
0 |
2 | 0xFFFFFFFF |
0x00000000 |
fields |
+ |
0 |
3 | 0xFFFFFFFF |
0xFFFFFFFF |
fields |
+ |
fields |
The masks let us zero out one or both of the fields
terms when calculating the product.
Alas, this is 2.2× slower than the previous version. It seems that performing two bitwise and operations and two additions is slower than a single multiply. My guess is that it’s because the factor is so small, and the CPU has an early-out for small factors.
Okay, it’s time to bring out the big guns. Time for the SIMD registers. We’ll do that next time.
Multiplications are a tricky thing and their behavior changed a lot over the last 2 decades. Today multiplications are super-cheap and often as fast as additions. It depends on the dependency chain of all operations and the number of available/free ALU sub-units that can perform multiply/add. There is also fused-add-multiply to consider. Today you thumb-count a multiplication as cost 1 if you have at least 2 other operations at the same time not dependent on...
You can use this technique via SIMD in a register as well, since you know what you're subtracting is less than the original value, so you don't have to worry about underflow.
Further, what you're subtracting can be decomposed as follows:
Factor 1 = 8/256, or 1/32
Factor 2 = 16/256, or 1/16
Factor 3 = 24/256, or 1/16 + 1/32
You can compute these efficiently via SIMD in a register:
1/16 is just (pixel >> 4) & 0x0F0F0FRead more
One thing I’m wondering about: Shouldn’t lightness be an unsigned integer? We originally defined darkness as range [0, 256]. So lightness is unsigned, too. I’m asking specifically because the current code treats the division by 256 as a signed division which requires more instructions than an unsigned division.
Early-out?
Not a CPU expert by any means, but as best I can tell via the context and a simple web search, CPUs can execute a multiplication in all of its fullness (multiply/carry/add/lather/rinse/repeat), but in some cases, with small factors, they stop when the multiplication unit knows that the multiplication is complete and exit then, leading to shorter instruction execution time. Because multiplications can then have different execution times depending on inputs, this can make scheduling...
An early-out would make multiplications possible attack vectors for timing attacks, similar to divisions. That led me to some information on the issue. Apparently, this kind of stuff is uncommon in desktop and laptop CPUs and definitely not the source of issues in this blog post. Hear is a list of susceptible CPUs
Because the decoder and the actual EUs are split the decoders can use the fact they can see ahead to make it happen. This is how you have things like "macro op fusion" (not linking as my last post with links is still awaiting moderation) where the CPU just does things more efficiently for it. It can also split instructions internally too. But a classic example of fusion is (or most flag modification instructions)...
Thank you both.
So, if I understand correctly, this early-out thing works pretty much like when I was 13 years old and conversed with 7- or 8-years-old kids on the school bus. Impressed by my seniority, they'd ask me to mentally add or multiply large numbers for them. Unbeknownst to them, I had an easy time fulfilling their requests because their so-called "large" numbers had a lot of zero in front of them. Early-out for me....
Yeah, that sounds unlikely to me, too. Integer multiply should be a fixed delay + fixed-throughput operation on any modern CPU. My guess is that the optimized version has significantly more instructions. Enough to bottleneck on retired instructions per cycle. Godbolt shows 24 vs 33 instructions, depending on how you count. That's at least two extra clock cycles. And even the original version probably doesn't bottleneck in the multiplier if we assume 6 instructions per...
Citing you: “Fastest code is code that doesn’t run.”, lesson learned!