More notes on calculating constants in SSE registers

Raymond Chen

A few weeks ago I noted some tricks for creating special bit patterns in all lanes, but I forgot to cover the case where you treat the 128-bit register as one giant lane: Setting all of the least significant N bits or all of the most significant N bits.

This is a variation of the trick for setting a bit pattern in all lanes, but the catch is that the pslldq instruction shifts by bytes, not bits.

We’ll assume that N is not a multiple of eight, because if it were a multiple of eight, then the pslldq or psrldq instruction does the trick (after using pcmpeqd to fill the register with ones).

One case is if N ≤ 64. This is relatively easy because we can build the value by first building the desired value in both 64-bit lanes, and then finishing with a big pslldq or psrldq to clear the lane we don’t like.

; set the bottom N bits, where N ≤ 64
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift right
64 − N bits
unsigned shift right
64 − N bits
psrlq   xmm0, 64 - N ;

0000 0000 0FFF FFFF 0000 0000 0FFF FFFF
unsigned shift right 64 bits
psrldq  xmm0, 8 ;

0000 0000 0000 0000 0000 0000 0FFF FFFF
 
; set the top N bits, where N ≤ 64
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift left
64 − N bits
unsigned shift left
64 − N bits
psllq   xmm0, 64 - N ; FFFF FFF0 0000 0000 FFFF FFF0 0000 0000
unsigned shift left 64 bits
pslldq  xmm0, 8 ; FFFF FFF0 0000 0000 0000 0000 0000 0000

If N ≥ 80, then we shift in zeroes into the top and bottom half, but then use a shuffle to patch up the half that needs to stay all-ones.

; set the bottom N bits, where N ≥ 80
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift right
128 − N bits
unsigned shift right
128 − N bits
psrlq   xmm0, 128 - N ; 0000 0000 0FFF FFFF 0000 0000 0FFF FFFF
copy shuffle
pshuflw xmm0, _MM_SHUFFLE(0, 0, 0, 0) ; 0000 0000 0FFF FFFF FFFF FFFF FFFF FFFF
 
; set the top N bits, where N ≥ 80
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift left
128 − N bits
unsigned shift left
128 − N bits
psllq   xmm0, 128 - N ; FFFF FFF0 0000 0000 FFFF FFF0 0000 0000
shuffle copy
pshufhw xmm0, _MM_SHUFFLE(3, 3, 3, 3) ; FFFF FFFF FFFF FFFF FFFF FFF0 0000 0000

We have N ≥ 80, which means that 128 – N ≤ 48, which means that there are at least 16 bits of ones left in low-order bits after we shift right. We then use a 4×16-bit shuffle to copy those known-all-ones 16 bits into the other lanes of the lower half. (A similar argument applies to setting the top bits.)

This leaves 64 < N < 80. That uses a different trick:

; set the bottom N bits, where N ≤ 120
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift right 8 bits
psrldq  xmm0, 1 ; 00FF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
signed shift right
120 − N bits
signed shift right
120 − N bits
psrad  xmm0, 120 - N ; 0000 00FF FFFF FFFF FFFF FFFF FFFF FFFF

The sneaky trick here is that we use a signed shift in order to preserve the bottom half. Unfortunately, there is no corresponding left shift that shifts in ones, so the best I can come up with is four instructions:

; set the top N bits, where 64 ≤ N ≤ 96
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift left
96 − N bits
unsigned shift left
96 − N bits
psllq   xmm0, 96 - N ; FFFF FFFF FFF0 0000 FFFF FFFF FFF0 0000
shuffle
pshufd  xmm0, _MM_SHUFFLE(3, 3, 1, 0) ; FFFF FFFF FFFF FFFF FFFF FFFF FFF0 0000
unsigned shift left 32 bits
pslldq  xmm0, 4 ; FFFF FFFF FFFF FFFF FFFF FF00 0000 0000

We view the 128-bit register as four 32-bit lanes. split the shift into two steps. First, we fill Lane 0 with the value we ultimately want in Lane 1, then we patch up the damage we did to Lane 2, then we do a shift the 128-bit value left 32 places to slide the value into position and zero-fill Lane 0.

Note that a lot of the ranges of N overlap, so you often have a choice of solutions. There are other three-instruction solutions I didn’t bother presenting here. The only one I couldn’t find a three-instruction solution for was setting the top N bits where 64 < N < 80.

If you find a three-instruction solution for this last case, share it in the comments.

0 comments

Discussion is closed.

Feedback usabilla icon