{"id":43403,"date":"2014-12-15T07:00:00","date_gmt":"2014-12-15T07:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2014\/12\/15\/notes-on-calculating-constants-in-sse-registers\/"},"modified":"2014-12-15T07:00:00","modified_gmt":"2014-12-15T07:00:00","slug":"notes-on-calculating-constants-in-sse-registers","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20141215-00\/?p=43403","title":{"rendered":"Notes on calculating constants in SSE registers"},"content":{"rendered":"<p>\nThere are a few ways to load constants into SSE registers.\n<\/p>\n<ul>\n<li>Load them from memory.\n<li>Load them from general purpose registers via\n    <code>movd<\/code>.<\/p>\n<li>Insert selected bits from general purpose registers via\n    <code>pinsr[b|w|d|q]<\/code>.<\/p>\n<li>Try to calculate them in clever ways.\n<\/ul>\n<p>\nLoading constants from memory incurs memory access penalties.\nLoading or inserting them from general purpose registers incurs\ncross-domain penalties.\nSo let&#8217;s see what we can do with clever calculations.\n<\/p>\n<p>\nThe most obvious clever calculations are the ones for setting\na register to all zeroes or all ones.\n<\/p>\n<pre>\n    pxor    xmm0, xmm0 ; set all bits to zero\n    pcmpeqd xmm0, xmm0 ; set all bits to one\n<\/pre>\n<p>\nThese two idioms are special-cased in the processor and execute\nfaster than normal pxor and pcmpeqd instructions\nbecause the results are not dependent on the previous value\nin <code>xmm0<\/code>.\n<\/p>\n<p>\nThere&#8217;s not much more you can do to construct other\nvalues from zero,\nbut a register with all bits set does create additional\nopportunities.\n<\/p>\n<p>If you need a value loaded into all lanes whose bit pattern\nis either a bunch of 0&#8217;s followed by a bunch of 1&#8217;s,\nor a bunch of 1&#8217;s followed by a bunch of 0&#8217;s,\nthen you can shift in zeroes.\nFor example, assuming you&#8217;ve set all bits in <code>xmm0<\/code> to 1,\nhere&#8217;s how you can load some other constants:\n<\/p>\n<pre>\n    pcmpeqd xmm0, xmm0 ; set all bits to one\n-then-\n    pslld  xmm0, 30    ; all 32-bit lanes contain 0xC0000000\n-or-\n    psrld  xmm0, 29    ; all 32-bit lanes contain 0x00000007\n-or-\n    psrld  xmm0, 31    ; all 32-bit lanes contain 0x00000001\n<\/pre>\n<p>\nIntel suggests loading 1 into all lanes with the sequence\n<\/p>\n<pre>\n    pxor    xmm0, xmm0 ; xmm0 = { 0, 0, 0, 0 }\n    pcmpeqd xmm1, xmm1 ; xmm1 = { -1, -1, -1, -1 }\n    psubd   xmm0, xmm1 ; xmm0 = { 1, 1, 1, 1 }\n<\/pre>\n<p>\nbut that not only takes more instructions but also consumes two registers,\nand registers are at a premium since there are only eight of them.\nThe only thing I can think of is that <code>psubd<\/code> might be faster\nthan <code>psrld<\/code>.\n<\/p>\n<p>\nIn general, to load <code>2&#x207F;&minus;1<\/code>\ninto all lanes, you do<\/p>\n<p><pre>\n    pcmpeqd xmm0, xmm0 ; set all bits to one\n-then-\n    psrlw  xmm0, 16-n  ; clear top 16-n bits of all 16-bit lanes\n-or-\n    psrld  xmm0, 32-n  ; clear top 32-n bits of all 32-bit lanes\n-or-\n    psrlq  xmm0, 64-n  ; clear top 64-n bits of all 64-bit lanes\n<\/pre>\n<p>\nConversely, if you want to load\n<code>~(2&#x207F;&minus;1) = -2&#x207F;<\/code> into all lanes,\nyou shift the other way.\n<\/p>\n<pre>\n    pcmpeqd xmm0, xmm0 ; set all bits to one\n-then-\n    psllw  xmm0, n     ; clear bottom n bits of all 16-bit lanes = 2&sup1;&#x2076; - 2&#x207F;\n-or-\n    pslld  xmm0, n     ; clear bottom n bits of all 32-bit lanes = 2&sup3;&sup2; - 2&#x207F;\n-or-\n    psllq  xmm0, n     ; clear bottom n bits of all 64-bit lanes = 2&#x2076;&#x2074; - 2&#x207F;\n<\/pre>\n<p>\nAnd if the value you want has all its set bits in the middle,\nyou can combine two shifts (and stick something in between the two\nshifts to ameliorate the stall):\n<\/p>\n<pre>\n    pcmpeqd xmm0, xmm0 ; set all bits to one\n-then-\n    psrlw  xmm0, 13    ; all lanes = 0x0007\n    psllw  xmm0, 4     ; all lanes = 0x0070\n-or-\n    psrld  xmm0, 31    ; all lanes = 0x00000001\n    pslld  xmm0, 3     ; all lanes = 0x00000008\n<\/pre>\n<p>\nIf you want to set high or low lanes to zero,\nyou can use <code>pslldq<\/code> and\n<code>psrldq<\/code>.\n<\/p>\n<pre>\n    pcmpeqd xmm0, xmm0 ; set all bits to one\n-then-\n    pslldq xmm0, 2     ; clear bottom word, xmm0 = { -1, -1, -1, -1, -1, -1, -1, 0 }\n-or-\n    pslldq xmm0, 4     ; clear bottom dword, xmm0 = { -1, -1, -1, 0 }\n-or-\n    pslldq xmm0, 8     ; clear bottom qword, xmm0 = { -1, 0 }\n-or-\n    psrldq xmm0, 2     ; clear top word, xmm0 = { 0, -1, -1, -1, -1, -1, -1, -1 }\n-or-\n    psrldq xmm0, 4     ; clear top dword, xmm0 = { 0, -1, -1, -1 }\n-or-\n    psrldq xmm0, 8     ; clear top qword, xmm0 = { 0, -1 }\n<\/pre>\n<p>\nNo actual program today.\nJust some notes from my days writing SSE assembly language.\n<\/p>\n<p>\n<b>Bonus chatter<\/b>:\nThere is an intrinsic for <code>pxor xmmReg, xmmReg<\/code>:\n<a HREF=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ys7dw0kh(v=vs.90).aspx\">\n<code>_mm_setzero_si128<\/code><\/a>.\nHowever, there is no corresponding intrinsic for\n<code>pcmpeqd xmmReg, xmmReg<\/code>,\nwhich would presumably be called\n<code>_mm_setones_si128<\/code>\nor\n<code>_mm_setmone_epiNN<\/code>.\nIn order to get all-ones, you need to get a throwaway register\nand compare it against itself.\nThe cheapest throwaway register is one that is set to zero,\nsince that is special-cased inside the processor.\n<\/p>\n<pre>\n__m128i zero = _mm_setzero_si128();\n__m128i ones = _mm_cmpeq_epi32(zero, zero);\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>There are a few ways to load constants into SSE registers. Load them from memory. Load them from general purpose registers via movd. Insert selected bits from general purpose registers via pinsr[b|w|d|q]. Try to calculate them in clever ways. Loading constants from memory incurs memory access penalties. Loading or inserting them from general purpose registers [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-43403","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>There are a few ways to load constants into SSE registers. Load them from memory. Load them from general purpose registers via movd. Insert selected bits from general purpose registers via pinsr[b|w|d|q]. Try to calculate them in clever ways. Loading constants from memory incurs memory access penalties. Loading or inserting them from general purpose registers [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=43403"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43403\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=43403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=43403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=43403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}