There are a few ways to load constants into SSE registers.
- Load them from memory.
- Load them from general purpose registers via
movd
. - Insert selected bits from general purpose registers via
pinsr[b|w|d|q]
. - Try to calculate them in clever ways.
Loading constants from memory incurs memory access penalties. Loading or inserting them from general purpose registers incurs cross-domain penalties. So let’s see what we can do with clever calculations.
The most obvious clever calculations are the ones for setting a register to all zeroes or all ones.
pxor xmm0, xmm0 ; set all bits to zero pcmpeqd xmm0, xmm0 ; set all bits to one
These two idioms are special-cased in the processor and execute
faster than normal pxor and pcmpeqd instructions
because the results are not dependent on the previous value
in xmm0
.
There’s not much more you can do to construct other values from zero, but a register with all bits set does create additional opportunities.
If you need a value loaded into all lanes whose bit pattern
is either a bunch of 0’s followed by a bunch of 1’s,
or a bunch of 1’s followed by a bunch of 0’s,
then you can shift in zeroes.
For example, assuming you’ve set all bits in xmm0
to 1,
here’s how you can load some other constants:
pcmpeqd xmm0, xmm0 ; set all bits to one -then- pslld xmm0, 30 ; all 32-bit lanes contain 0xC0000000 -or- psrld xmm0, 29 ; all 32-bit lanes contain 0x00000007 -or- psrld xmm0, 31 ; all 32-bit lanes contain 0x00000001
Intel suggests loading 1 into all lanes with the sequence
pxor xmm0, xmm0 ; xmm0 = { 0, 0, 0, 0 } pcmpeqd xmm1, xmm1 ; xmm1 = { -1, -1, -1, -1 } psubd xmm0, xmm1 ; xmm0 = { 1, 1, 1, 1 }
but that not only takes more instructions but also consumes two registers,
and registers are at a premium since there are only eight of them.
The only thing I can think of is that psubd
might be faster
than psrld
.
In general, to load 2ⁿ−1
into all lanes, you do
pcmpeqd xmm0, xmm0 ; set all bits to one -then- psrlw xmm0, 16-n ; clear top 16-n bits of all 16-bit lanes -or- psrld xmm0, 32-n ; clear top 32-n bits of all 32-bit lanes -or- psrlq xmm0, 64-n ; clear top 64-n bits of all 64-bit lanes
Conversely, if you want to load
~(2ⁿ−1) = -2ⁿ
into all lanes,
you shift the other way.
pcmpeqd xmm0, xmm0 ; set all bits to one -then- psllw xmm0, n ; clear bottom n bits of all 16-bit lanes = 2¹⁶ - 2ⁿ -or- pslld xmm0, n ; clear bottom n bits of all 32-bit lanes = 2³² - 2ⁿ -or- psllq xmm0, n ; clear bottom n bits of all 64-bit lanes = 2⁶⁴ - 2ⁿ
And if the value you want has all its set bits in the middle, you can combine two shifts (and stick something in between the two shifts to ameliorate the stall):
pcmpeqd xmm0, xmm0 ; set all bits to one -then- psrlw xmm0, 13 ; all lanes = 0x0007 psllw xmm0, 4 ; all lanes = 0x0070 -or- psrld xmm0, 31 ; all lanes = 0x00000001 pslld xmm0, 3 ; all lanes = 0x00000008
If you want to set high or low lanes to zero,
you can use pslldq
and
psrldq
.
pcmpeqd xmm0, xmm0 ; set all bits to one -then- pslldq xmm0, 2 ; clear bottom word, xmm0 = { -1, -1, -1, -1, -1, -1, -1, 0 } -or- pslldq xmm0, 4 ; clear bottom dword, xmm0 = { -1, -1, -1, 0 } -or- pslldq xmm0, 8 ; clear bottom qword, xmm0 = { -1, 0 } -or- psrldq xmm0, 2 ; clear top word, xmm0 = { 0, -1, -1, -1, -1, -1, -1, -1 } -or- psrldq xmm0, 4 ; clear top dword, xmm0 = { 0, -1, -1, -1 } -or- psrldq xmm0, 8 ; clear top qword, xmm0 = { 0, -1 }
No actual program today. Just some notes from my days writing SSE assembly language.
Bonus chatter:
There is an intrinsic for pxor xmmReg, xmmReg
:
_mm_setzero_si128
.
However, there is no corresponding intrinsic for
pcmpeqd xmmReg, xmmReg
,
which would presumably be called
_mm_setones_si128
or
_mm_setmone_epiNN
.
In order to get all-ones, you need to get a throwaway register
and compare it against itself.
The cheapest throwaway register is one that is set to zero,
since that is special-cased inside the processor.
__m128i zero = _mm_setzero_si128(); __m128i ones = _mm_cmpeq_epi32(zero, zero);
0 comments