{"id":43503,"date":"2014-12-01T07:00:00","date_gmt":"2014-12-01T07:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2014\/12\/01\/counting-array-elements-which-are-below-a-particular-limit-value-using-sse\/"},"modified":"2014-12-01T07:00:00","modified_gmt":"2014-12-01T07:00:00","slug":"counting-array-elements-which-are-below-a-particular-limit-value-using-sse","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20141201-00\/?p=43503","title":{"rendered":"Counting array elements which are below a particular limit value using SSE"},"content":{"rendered":"<p>\nSome time ago, we looked at\n<a HREF=\"http:\/\/blogs.msdn.com\/b\/oldnewthing\/archive\/2014\/06\/13\/10533875.aspx\">\nhow doing something can be faster than not doing it<\/a>.\nThat is, we observed the non-classical effect of the branch predictor.\nI took the branch out of the inner loop,\nbut let&#8217;s see how much further I can push it.\n<\/p>\n<p>\nThe trick I&#8217;ll employ today is using SIMD in order to\noperate on multiple pieces of data simultaneously.\nTake the original program and replace the\n<code>count&shy;them<\/code>\nfunction with this one:\n<\/p>\n<pre>\nint countthem(int boundary)\n{\n __m128i xboundary = _mm_cvtsi32_si128(boundary);\n __m128i count = _mm_setzero_si128();\n for (int i = 0; i &lt; 10000; i++) {\n  __m128i value =  _mm_cvtsi32_si128(array[i]);\n  __m128i test = _mm_cmplt_epi32(value, xboundary);\n  count = _mm_sub_epi32(count, test);\n }\n return _mm_cvtsi128_si32(count);\n}\n<\/pre>\n<p>\nNow, this program doesn&#8217;t actually use any parallel operations,\nbut it&#8217;s our starting point.\nFor each 32-bit value,\nwe load it,\ncompare it agains the boundary value,\nand accumulate the result.\nThe <code>_mm_cmplt_epi32<\/code> function\ncompares the four 32-bit integers in the first parameter\nagainst the four 32-bit integers in the second parameter,\nproducing four new 32-bit integers.\nEach of the new 32-bit integers is <code>0xFFFFFFFF<\/code>\nif the corresponding first parameter is less than the second,\nor it is <code>0x00000000<\/cODE> if it is greater than or equal.\n<\/p>\n<p>\nIn this case, we loaded up the value we care about,\nthen compare it against the boundary value.\nThe result of the comparison is either 32 bits of 0 (for false)\nor 32 bits of 1 (for true),\nso this merely sets <code>test<\/code> equal to\n<code>0xFFFFFFFF<\/code> if the value is less than the boundary;\notherwise\n<code>0x0000000<\/code>.\nSince <code>0xFFFFFFFF<\/code> is the same as a 32-bit <code>-1<\/code>,\nwe subtract the value so that the count goes up by 1 if the\nvalue is less than the boundary.\n<\/p>\n<p>\nFinally, we convert back to a 32-bit integer\nand return it.\n<\/p>\n<p>\nWith this change, the running time drops from 2938 time units\nto 2709, an improvement of 8%.\n<\/p>\n<p>\nSo far,\nwe have been using only the bottom 32 bits of the 128-bit XMM registers.\nLet's turn on the parallelism.\n<\/p>\n<pre>\nint countthem(int boundary)\n{\n <font COLOR=\"blue\">__m128i *xarray = (__m128i*)array;<\/font>\n __m128i xboundary = <font COLOR=\"blue\">_mm_set1_epi32<\/font>(boundary);\n __m128i count = _mm_setzero_si128();\n for (int i = 0; i &lt; <font COLOR=\"blue\">10000 \/ 4<\/font>; i++) {\n  __m128i value = <font COLOR=\"blue\">_mm_loadu_si128(&amp;xarray[i])<\/font>;\n  __m128i test = _mm_cmplt_epi32(value, xboundary);\n  count = _mm_sub_epi32(count, test);\n }\n <font COLOR=\"blue\">__m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));\n count = _mm_add_epi32(count, shuffle1);\n __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));\n count = _mm_add_epi32(count, shuffle2);<\/font>\n return _mm_cvtsi128_si32(count);\n}\n<\/pre>\n<p>\nWe take our 32-bit integers and put them in groups of four,\nso instead of thinking of them as 10000 32-bit integers,\nwe think of them as 2500 128-bit blocks,\neach block containing four <i>lanes<\/i>,\nwith each lane holding one 32-bit integers.\n<\/p>\n<table BORDER=\"0\" CELLPADDING=\"3\" STYLE=\"border-collapse: collapse;text-align: center\">\n<tr>\n<td><\/td>\n<td STYLE=\"width: 1em\"><\/td>\n<td>Lane 3<\/td>\n<td>Lane 2<\/td>\n<td>Lane 1<\/td>\n<td>Lane 0<\/td>\n<\/tr>\n<tr>\n<td><code>xarray[0]<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[3]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[2]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[1]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[0]<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>xarray[1]<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[7]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[6]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[5]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[4]<\/code><\/td>\n<\/tr>\n<tr>\n<td>&#x22EE;<\/td>\n<td><\/td>\n<td>&#x22EE;<\/td>\n<td>&#x22EE;<\/td>\n<td>&#x22EE;<\/td>\n<td>&#x22EE;<\/td>\n<\/tr>\n<tr>\n<td><code>xarray[2499]<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9999]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9998]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9997]<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9996]<\/code><\/td>\n<\/tr>\n<\/table>\n<p>\nNow we can run our previous algorithm in parallel on each lane.\n<\/p>\n<table BORDER=\"0\" CELLPADDING=\"3\" STYLE=\"border-collapse: collapse;text-align: center\">\n<tr>\n<td><\/td>\n<td STYLE=\"width: 1em\"><\/td>\n<td>Lane 3<\/td>\n<td>Lane 2<\/td>\n<td>Lane 1<\/td>\n<td>Lane 0<\/td>\n<\/tr>\n<tr>\n<td><code>xboundary<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\"><code>boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>boundary<\/code><\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"6\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>test<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[3] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[2] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[1] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[0] &lt; boundary<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>test<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[7] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[6] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[5] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[4] &lt; boundary<\/code><\/td>\n<\/tr>\n<tr>\n<td>&#x22EE;<\/td>\n<td><\/td>\n<td>&#x22EE;<\/td>\n<td>&#x22EE;<\/td>\n<td>&#x22EE;<\/td>\n<td>&#x22EE;<\/td>\n<\/tr>\n<tr>\n<td><code>test<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9999] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9998] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9997] &lt; boundary<\/code><\/td>\n<td STYLE=\"border: solid 1px black\"><code>array[9996] &lt; boundary<\/code><\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"6\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>count<\/code> = <var STYLE=\"font-style: normal\">&Sigma;<\/var> &minus;<code>test<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\">Lane 3 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 2 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 1 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 0 totals<\/td>\n<\/tr>\n<\/table>\n<p>\nThe <code>xboundary<\/code> variable contains\na copy of the boundary in each of the four 32-bit lanes.\nWe load the values from the array four at a time&sup1;\nand compare them (in parallel) against the boundary,\nthen we tally them (in parallel).\nThe result of the loop is that each lane of <code>count<\/code>\nperforms a count of values for its lane.\n<\/p>\n<p>\nAfter we complete the loop, we combine the parallel results\nby adding the lanes together. We do this by shuffling the values\naround and performing more parallel adds.\nThe\n<code>_mm_shuffle_epi32<\/code> function lets you rearrange the\nlanes of an XMM register.\nThe <code>_MM_SHUFFLE<\/code> macro lets you specify how you\nwant the shuffle to occur.\nFor example,\n<code>_MM_SHUFFLE(1, 0, 3, 2)<\/code>\nsays that we want lanes 1, 0, 3 then 2 of the original value.\n(You can shuffle a value into multiple destination lanes;\nfor example,\n<code>_MM_SHUFFLE(0, 0, 0, 0)<\/code>\nsays that you want four copies of lane 0.\nThat's how we created <code>xboundary<\/code>.)\n<\/p>\n<table BORDER=\"0\" CELLPADDING=\"3\" STYLE=\"border-collapse: collapse;text-align: center\">\n<tr>\n<td><\/td>\n<td STYLE=\"width: 1em\"><\/td>\n<td>Lane 3<\/td>\n<td>Lane 2<\/td>\n<td>Lane 1<\/td>\n<td>Lane 0<\/td>\n<\/tr>\n<tr>\n<td><code>count<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\">Lane 3 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 2 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 1 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 0 totals<\/td>\n<\/tr>\n<tr>\n<td><code>shuffle1<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\">Lane 1 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 0 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 3 totals<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 2 totals<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"6\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>count += shuffle1<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\">Lane 3 + Lane 1<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 2 + Lane 0<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 1 + Lane 3<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 0 + Lane 2<\/td>\n<\/tr>\n<tr>\n<td><code>shuffle2<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\">Lane 2 + Lane 0<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 3 + Lane 1<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 0 + Lane 2<\/td>\n<td STYLE=\"border: solid 1px black\">Lane 1 + Lane 3<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"6\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>count += shuffle2<\/code><\/td>\n<td><\/td>\n<td STYLE=\"border: solid 1px black\" ALIGN=\"left\">Lane 3 + Lane 1 +<br \/>\n                                        Lane 2 + Lane 0<\/td>\n<td STYLE=\"border: solid 1px black\" ALIGN=\"left\">Lane 2 + Lane 0 +<br \/>\n                                        Lane 3 + Lane 1<\/td>\n<td STYLE=\"border: solid 1px black\" ALIGN=\"left\">Lane 1 + Lane 3 +<br \/>\n                                        Lane 0 + Lane 2<\/td>\n<td STYLE=\"border: solid 1px black\" ALIGN=\"left\">Lane 0 + Lane 2 +<br \/>\n                                        Lane 1 + Lane 3<\/td>\n<\/tr>\n<\/table>\n<p>\nAt the end of the shuffling and adding,\nwe have calculated the sum of all\nfour lanes.\n(For\n<a HREF=\"http:\/\/blogs.msdn.com\/b\/oldnewthing\/archive\/2012\/11\/13\/10367904.aspx\">\nstyle points<\/a>, I put the answer in all the lanes.)\n<\/p>\n<p>\nThis new version runs in 688 time units,\nor 3.9 times faster than the previous one.\nThis makes sense because we are counting four\nvalues at each iteration.\nThe overall improvement is 4.3&times;.\n<\/p>\n<p>\nLet's see if we can reduce the loop overhead by\ndoing some unrolling.\n<\/p>\n<pre>\n<font COLOR=\"blue\">#define GETVALUE(n) __m128i value##n = _mm_loadu_si128(&amp;xarray[i+n])\n#define GETTEST(n) __m128i test##n = _mm_cmplt_epi32(value##n, xboundary)\n#define GETCOUNT(n)  count = _mm_sub_epi32(count, test##n)<\/font>\nint countthem(int boundary)\n{\n __m128i *xarray = (__m128i*)array;\n __m128i xboundary = _mm_set1_epi32(boundary);\n __m128i count = _mm_setzero_si128();\n for (int i = 0; i &lt; 10000 \/ 4; i += <font COLOR=\"blue\">4<\/font>) {\n  <font COLOR=\"blue\">GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);\n   GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);\n  GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);<\/font>\n }\n __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));\n count = _mm_add_epi32(count, shuffle1);\n __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));\n count = _mm_add_epi32(count, shuffle2);\n return _mm_cvtsi128_si32(count);\n}\n<\/pre>\n<p>\nWe unroll the loop fourfold.\nAt each iteration, we load 16 values from memory,\nand then accumulate the totals.\nWe fetch all the memory values first,\nthen do the comparisons,\nthen accumulate the results.\nIf we had written it as\n<code>GETVALUE<\/code> immediately followed\nby <code>GETTEST<\/code>,\nthen the <code>_mm_cmplt_epi32<\/code>\nwould have stalled waiting for the result\nto arrive from memory.\nBy interleaving the operations,\nwe get some work done instead of stalling.\n<\/p>\n<p>\nThis version runs in 514 time units,\nan improvement of 33% over the previous version\nand an overall improvement of 5.7&times;.\n<\/p>\n<p>\nCan we unroll even further?\nLet's try fivefold.\n<\/p>\n<pre>\nint countthem(int boundary)\n{\n __m128i *xarray = (__m128i*)array;\n __m128i xboundary = _mm_set1_epi32(boundary);\n __m128i count = _mm_setzero_si128();\n for (int i = 0; i &lt; 10000 \/ 4; i += <font COLOR=\"blue\">5<\/font>) {\n  GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); <font COLOR=\"blue\">GETVALUE(4);<\/font>\n   GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  <font COLOR=\"blue\">GETTEST(4);<\/font>\n  GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); <font COLOR=\"blue\">GETCOUNT(4);<\/font>\n }\n __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));\n count = _mm_add_epi32(count, shuffle1);\n __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));\n count = _mm_add_epi32(count, shuffle2);\n return _mm_cvtsi128_si32(count);\n}\n<\/pre>\n<p>\nHuh?\nThis version runs marginally <i>slower<\/i>,\nat 528 time units.\nSo I guess further unrolling won't help any more.\n(For example, if you unroll a loop so much that\nyou have more live variables than registers,\nthe compiler will need to spill registers to memory.\nThe x86 has eight XMM registers available,\nso you can easily cross that limit.)\n<\/p>\n<p>\nBut wait, there's still room for tweaking.\nWe have been using\n<code>_mm_cmplt_epi32<\/code> to perform the comparison,\nexpecting the compiler to generate code like this:\n<\/p>\n<pre>\n    ; suppose xboundary is in xmm0 and count is in xmm1\n    movdqu   xmm2, xarray[i] ; xmm2 = value\n    pcmpltd  xmm2, xmm0      ; xmm2 = test\n    psubd    xmm1, xmm2\n<\/pre>\n<p>\nIf you crack open your Intel manual,\nyou'll see that there is no\n<code>PCMPLTD<\/code> instruction.\nThe compiler intrinsic is emulating the instruction by\nflipping the parameters and using <code>PCMPGTD<\/code>.\n<\/p>\n<pre>\n_mm_cmplt_epi32(x, y) &harr; _mm_cmpgt_epi32(y, x)\n<\/pre>\n<p>\nBut the <code>PCMPGTD<\/code> instruction writes the result\nback into the first parameter.\nIn other words, it always takes the form\n<\/p>\n<pre>\ny = _mm_cmpgt_epi32(y, x);\n<\/pre>\n<p>\nIn our case, <code>y<\/code> is <code>xboundary<\/code>,\nbut we don't want to modify <code>xboundary<\/code>.\nAs a result, the compiler needs to introduce a temporary register:\n<\/p>\n<pre>\n    movdqu   xmm2, xarray[i] ; xmm2 = value\n    movdqa   xmm3, xmm0      ; xmm3 = copy of xboundary\n    pcmpgtd  xmm3, xmm2      ; xmm3 = test\n    psubd    xmm1, xmm3\n<\/pre>\n<p>\nWe can take an instruction out of the sequence by switching to\n<code>_mm_cmpgt_epi32<\/code> and adjusting our logic accordingly,\ntaking advantage of the fact that<\/p>\n<pre>\nx &lt; y &hArr; &not;(x &ge; y) &hArr; &not;(x &gt; y &minus; 1)\n<\/pre>\n<p>\nassuming the subtraction does not underflow.\nFortunately, it doesn't in our case since <code>boundary<\/code>\nranges from 0 to 10, and subtracting 1 does not put us in any danger\nof integer underflow.\n<\/p>\n<p>\nWith this rewrite, we can switch to using\n<code>_mm_cmpgt_epi32<\/code>,\nwhich is more efficient for our particular scenario.\nSince we are now counting the values which <i>don't<\/i>\nmeet our criteria,\nwe need to take our final result and subtract it from 10000.\n<\/p>\n<pre>\n#define GETTEST(n) __m128i test##n = _mm_<font COLOR=\"blue\">cmpgt<\/font>_epi32(value##n, <font COLOR=\"blue\">xboundary1<\/font>)\nint countthem(int boundary)\n{\n __m128i *xarray = (__m128i*)array;\n <font COLOR=\"blue\">__m128i xboundary1 = _mm_set1_epi32(boundary - 1);<\/font>\n __m128i count = _mm_setzero_si128();\n for (int i = 0; i &lt; 10000 \/ 4; i += <font COLOR=\"blue\">5<\/font>) {\n  GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);\n   GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);\n  GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);\n }\n __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));\n count = _mm_add_epi32(count, shuffle1);\n __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));\n count = _mm_add_epi32(count, shuffle2);\n return <font COLOR=\"blue\">10000 -<\/font> _mm_cvtsi128_si32(count);\n}\n<\/pre>\n<p>\nNotice that we have two subtractions which cancel out.\nWe are subtracting the result of the comparison, and then\nwe subtract the total from 10000.\nThe two signs cancel out, and we can use addition for both.\nThis saves an instruction in the <code>return<\/code> because\nsubtraction is not commutative, but addition is.\n<\/p>\n<pre>\n#define GETCOUNT(n) count = _mm_<font COLOR=\"blue\">add<\/font>_epi32(count, test##n)\nint countthem(int boundary)\n{\n __m128i *xarray = (__m128i*)array;\n __m128i xboundary1 = _mm_set1_epi32(boundary - 1);\n __m128i count = _mm_setzero_si128();\n for (int i = 0; i &lt; 10000 \/ 4; i += 5) {\n  GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);\n   GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);\n  GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);\n }\n __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));\n count = _mm_add_epi32(count, shuffle1);\n __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));\n count = _mm_add_epi32(count, shuffle2);\n return <font COLOR=\"blue\">10000 +<\/font> _mm_cvtsi128_si32(count);\n}\n<\/pre>\n<p>\nYou can look at the transformation this way:\nThe old code considered the glass half empty.\nIt started with zero and added 1 each time it found an entry\nthat passed the test.\nThe new code considers the glass half full.\nIt assumes each entry passes the test,\nand it subtracts one each time it finds an element that fails the test.\n<\/p>\n<p>\nThis version runs in 453 time units,\nan improvement of 13% over the fourfold unrolled version\nand an improvement of 6.5&times; overall.\n<\/p>\n<p>\nOkay, let's unroll sixfold, just for fun.\n<\/p>\n<pre>\nint countthem(int boundary)\n{\n __m128i *xarray = (__m128i*)array;\n __m128i xboundary = _mm_set1_epi32(boundary - 1);\n __m128i count = _mm_setzero_si128();\n <font COLOR=\"blue\">int i = 0;\n {\n    GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);\n     GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);\n    GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);\n }\n i += 4;<\/font>\n for (; i &lt; 10000 \/ 4; i += <font COLOR=\"blue\">6)<\/font> {\n  GETVALUE(0); GETVALUE(1); GETVALUE(2);\n  GETVALUE(3); GETVALUE(4); <font COLOR=\"blue\">GETVALUE(5);<\/font>\n   GETTEST(0);  GETTEST(1);  GETTEST(2);\n   GETTEST(3);  GETTEST(4);  <font COLOR=\"blue\">GETTEST(5);<\/font>\n  GETCOUNT(0); GETCOUNT(1); GETCOUNT(2);\n  GETCOUNT(3); GETCOUNT(4); <font COLOR=\"blue\">GETCOUNT(5);<\/font>\n }\n __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));\n count = _mm_add_epi32(count, shuffle1);\n __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));\n count = _mm_add_epi32(count, shuffle2);\n return 10000 + _mm_cvtsi128_si32(count);\n}\n<\/pre>\n<p>\nSince <code>10000 \/ 4 % 6 = 4<\/code>,\nwe have four values that don't fit in the loop.\nWe deal with those values up front,\nand then enter the loop to get the rest.\n<\/p>\n<p>\nThis version runs in 467 time units,\nwhich is 3% slower than the previous version.\nSo I guess it's time to stop unrolling.\nLet's go back to the previous version which ran faster.\n<\/p>\n<p>\nThe total improvement we got after all this tweaking\nis speedup of 6.5&times; over the original jumpless version.\nAnd most of that improvement (5.7&times;) came from\nunrolling the loop fourfold.\n<\/p>\n<p>\nAnyway, no real moral of the story today.\nI just felt like tinkering.\n<\/p>\n<p>\n<b>Notes<\/b>\n<\/p>\n<p>\n&sup1; The\n<code>_mm_loadu_si128<\/code>\nintrinsic is kind of weird.\nIts formal argument is a\n<code>__m128i*<\/code>,\nbut since it is for loading unaligned data,\nthe formal argument really should be\n<code>__m128i __unaligned*<\/code>.\nThe problem is that the <code>__unaligned<\/code> keyword\ndoesn't exist on x86 because prior to the introduction of MMX and SSE,\nx86 allowed arbitrary misaligned data.\nTherefore, you are in this weird situation where you have to\nuse an aligned pointer to access unaligned data.\n<\/p>\n<p>\n<b>Bonus chatter<\/b>: Clang at optimization level 3 does autovectorization.\nIt doesn't know some of the other tricks, like converting\n<code>x + 1<\/code>\nto\n<code>x - (-1)<\/code>, thereby saving an instruction and a register.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Some time ago, we looked at how doing something can be faster than not doing it. That is, we observed the non-classical effect of the branch predictor. I took the branch out of the inner loop, but let&#8217;s see how much further I can push it. The trick I&#8217;ll employ today is using SIMD in [&hellip;]<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-43503","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Some time ago, we looked at how doing something can be faster than not doing it. That is, we observed the non-classical effect of the branch predictor. I took the branch out of the inner loop, but let&#8217;s see how much further I can push it. The trick I&#8217;ll employ today is using SIMD in [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43503","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=43503"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/43503\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=43503"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=43503"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=43503"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}