{"id":97416,"date":"2017-11-17T07:00:00","date_gmt":"2017-11-17T22:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=97416"},"modified":"2019-03-13T01:35:27","modified_gmt":"2019-03-13T08:35:27","slug":"20171117-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20171117-00\/?p=97416","title":{"rendered":"The wrong way of benchmarking the most efficient integer comparison function"},"content":{"rendered":"<p>On StackOverflow, there&#8217;s a question about <a HREF=\"https:\/\/stackoverflow.com\/q\/10996418\/902497\">the most efficient way to compare two integers and produce a result suitable for a comparison function<\/a>, where a negative value means that the first value is smaller than the second, a positive value means that the first value is greater than the second, and zero means that they are equal. <\/p>\n<p>There was much microbenchmarking of various options, ranging from the straightforward <\/p>\n<pre>\nint compare1(int a, int b)\n{\n    if (a &lt; b) return -1;\n    if (a &gt; b) return 1;\n    return 0;\n}\n<\/pre>\n<p>to the clever <\/p>\n<pre>\nint compare2(int a, int b)\n{\n    return (a &gt; b) - (a &lt; b);\n}\n<\/pre>\n<p>to the hybrid <\/p>\n<pre>\nint compare3(int a, int b)\n{\n    return (a &lt; b) ? -1 : (a &gt; b);\n}\n<\/pre>\n<p>to inline assembly <\/p>\n<pre>\nint compare4(int a, int b)\n{\n    __asm__ __volatile__ (\n        \"sub %1, %0 \\n\\t\"\n        \"jno 1f \\n\\t\"\n        \"cmc \\n\\t\"\n        \"rcr %0 \\n\\t\"\n        \"1: \"\n    : \"+r\"(a)\n    : \"r\"(b)\n    : \"cc\");\n    return a;\n}\n<\/pre>\n<p>The benchmark pitted the comparison functions against each other by comparing random pairs of numbers and adding up the results to prevent the code from being optimized out. <\/p>\n<p>But here&#8217;s the thing: Adding up the results is completely unrealistic. <\/p>\n<p>There are no meaningful semantics that could be applied to a sum of numbers for which only the sign is significant. No program that uses a comparison function will add the results. The only thing you can do with the result is compare it against zero and take one of three actions based on the sign. <\/p>\n<p>Adding up all the results means that you&#8217;re not using the function in a realistic way, which means that your benchmark isn&#8217;t realistic. <\/p>\n<p>Let&#8217;s try to fix that. Here&#8217;s my alternative test: <\/p>\n<pre>\n\/\/ Looks for \"key\" in sorted range [first, last) using the\n\/\/ specified comparison function. Returns iterator to found item,\n\/\/ or last if not found.\n\ntemplate&lt;typename It, typename T, typename Comp&gt;\nIt binarySearch(It first, It last, const T&amp; key, Comp compare)\n{\n \/\/ invariant: if key exists, it is in the range [first, first+length)\n \/\/ This binary search avoids the <a HREF=\"https:\/\/research.googleblog.com\/2006\/06\/extra-extra-read-all-about-it-nearly.html\">integer overflow problem<\/a>\n \/\/ by operating on lengths rather than ranges.\n auto length = last - first;\n while (length &gt; 0) {\n  auto step = length \/ 2;\n  It it = first + step;\n  auto result = compare(*it, key);\n  if (result &lt; 0) {\n   first = it + 1;\n   length -= step + 1;\n  } else if (result == 0) {\n   return it;\n  } else {\n   length = step;\n  }\n }\n return last;\n}\n\nint main(int argc, char **argv)\n{\n \/\/ initialize the array with sorted even numbers\n int a[8192];\n for (int i = 0; i &lt; 8192; i++) a[i] = i * 2;\n\n for (int iterations = 0; iterations &lt; 1000; iterations++) {\n  int correct = 0;\n  for (int j = -1; j &lt; 16383; j++) {\n   auto it = binarySearch(a, a+8192, j, COMPARE);\n   if (j &lt; 0 || j &gt; 16382 || j % 2) correct += it == a+8192;\n   else correct += it == a + (j \/ 2);\n  }\n  \/\/ if correct != 16384, then we have a bug somewhere\n  if (correct != 16384) return 1;\n }\n return 0;\n}\n<\/pre>\n<\/p>\n<p>Let&#8217;s look at the code generation for the various comparison functions. I used <a HREF=\"https:\/\/gcc.godbolt.org\/\">gcc.godbolt.org<\/a> with x86-64 gcc 7.2 and optimization <code>-O3<\/code>. <\/p>\n<p>If we try <code>compare1<\/code>, then the binary search looks like this: <\/p>\n<pre>\n    ; on entry, esi is the value to search for\n\n    lea rdi, [rsp-120]          ; rdi = first\n    mov edx, 8192               ; edx = length\n    jmp .L9\n.L25:                           ; was greater than\n    mov rdx, rax                ; length = step\n    test rdx, rdx               ; while (length &gt; 0)\n    jle .L19\n.L9:\n    mov rax, rdx                ;\n    sar rax, 1                  ; eax = step = length \/ 2\n    lea rcx, [rdi+rax*4]        ; it = first + step\n\n    <font COLOR=\"blue\">; result = compare(*it, key), and then test the result\n    cmp dword ptr [rcx], esi    ; compare(*it, key)\n    jl .L11                     ; if less than\n    jne .L25                    ; if not equal (therefore if greater than)<\/font>\n    ... return value in rcx     ; if equal, answer is in rcx\n\n.L11:                           ; was less than\n    add rax, 1                  ; step + 1\n    lea rdi, [rcx+4]            ; first = it + 1\n    sub rdx, rax                ; length -= step + 1\n    test rdx, rdx               ; while (length &gt; 0)\n    jg .L9\n.L19:\n    lea rcx, [rsp+32648]        ; rcx = last\n    ... return value in rcx\n<\/pre>\n<p><b>Exercise<\/b>: Why is <code>rsp - 120<\/code> the start of the array? <\/p>\n<p>Observe that despite using the lamest, least-optimized comparison function, we got the comparison-and-test code that is much what we would have written if we had done it in assembly language ourselves: We compare the two values, and then follow up with two branches based on the same shared flags. The comparison is still there, but the calculation and testing of the return value are gone. <\/p>\n<p>In other words, not only was <code>compare1<\/code> optimized down to one <code>cmp<\/code> instruction, but it also managed to delete instructions from the <code>binarySearch<\/code> function too. It had a net cost of negative instructions! <\/p>\n<p>What happened here? How did the compiler manage to optimize out all our code and leave us with the shortest possible assembly language equivalent? <\/p>\n<p>Simple: First, the compiler did some constant propagation. After inlining the <code>compare1<\/code> function, the compiler saw this: <\/p>\n<pre>\n    int result;\n    if (*it &lt; key) result = -1;\n    else if (*it &gt; key) result = 1;\n    else result = 0;\n    if (result &lt; 0) {\n      ... less than ...\n    } else if (result == 0) {\n      ... equal to ...\n    } else {\n      ... greater than ...\n    }\n<\/pre>\n<p>The compiler realized that it already knew whether constants were greater than, less than, or equal to zero, so it could remove the test against <code>result<\/code> and jump straight to the answer: <\/p>\n<pre>\n    int result;\n    if (*it &lt; key) { result = -1; <font COLOR=\"blue\">goto less_than;<\/font> }\n    else if (*it &gt; key) { result = 1; <font COLOR=\"blue\">goto greater_than;<\/font> }\n    else { result = 0; <font COLOR=\"blue\">goto equal_to;<\/font> }\n    if (result &lt; 0) {\nless_than:\n      ... less than ...\n    } else if (result == 0) {\nequal_to:\n      ... equal to ...\n    } else {\ngreater_than:\n      ... greater than ...\n    }\n<\/pre>\n<p>And then it saw that all of the tests against <code>result<\/code> were unreachable code, so it deleted them. <\/p>\n<pre>\n    int result;\n    if (*it &lt; key) { result = -1; <font COLOR=\"blue\">goto less_than;<\/font> }\n    else if (*it &gt; key) { result = 1; <font COLOR=\"blue\">goto greater_than;<\/font> }\n    else { result = 0; <font COLOR=\"blue\">goto equal_to;<\/font> }\n\nless_than:\n      ... less than ...\n      <font COLOR=\"blue\">goto done;<\/font>\n\nequal_to:\n      ... equal to ...\n      <font COLOR=\"blue\">goto done;<\/font>\n\ngreater_than:\n      ... greater than ...\n<font COLOR=\"blue\">done:<\/font>\n<\/pre>\n<p> That then left <code>result<\/code> as a write-only variable, so it too could be deleted: <\/p>\n<pre>\n    if (*it &lt; key) { <font COLOR=\"blue\">goto less_than;<\/font> }\n    else if (*it &gt; key) { <font COLOR=\"blue\">goto greater_than;<\/font> }\n    else { <font COLOR=\"blue\">goto equal_to;<\/font> }\n\nless_than:\n      ... less than ...\n      <font COLOR=\"blue\">goto done;<\/font>\n\nequal_to:\n      ... equal to ...\n      <font COLOR=\"blue\">goto done;<\/font>\n\ngreater_than:\n      ... greater than ...\n<font COLOR=\"blue\">done:<\/font>\n<\/pre>\n<p>Which is equivalent to the code we wanted all along: <\/p>\n<pre>\n    if (*it &lt; key) {\n      ... less than ...\n    } else if (*it &gt; key) {\n      ... greater than ...\n    } else {\n      ... equal to ...\n    }\n<\/pre>\n<p>The last optimization is realizing that the test in the <code>else if<\/code> could use the flags left over by the <code>if<\/code>, so all that was left was the conditional jump. <\/p>\n<p>Some very straightforward optimizations took our very unoptimized (but easy-to-analyze) code and turned it into something much more efficient. <\/p>\n<p>On the other hand, let&#8217;s look at what happens with, say, the second comparison function: <\/p>\n<pre>\n    ; on entry, edi is the value to search for\n\n    lea r9, [rsp-120]           ; r9 = first\n    mov ecx, 8192               ; ecx = length\n    jmp .L9\n.L11:                           ;\n    <font COLOR=\"blue\">test eax, eax               ; result == 0?\n    je .L10                     ; Y: found it<\/font>\n                                ; was greater than\n    mov rcx, rdx                ; length = step\n    test rcx, rcx               ; while (length &gt; 0)\n    jle .L19\n.L9:\n    mov rdx, rcx\n    <font COLOR=\"blue\">xor eax, eax                ; return value of compare2<\/font>\n    sar rdx, 1                  ; rdx = step = length \/ 2\n    lea r8, [r9+rdx*4]          ; it = first + step\n\n    <font COLOR=\"blue\">; result = compare(*it, key), and then test the result\n    mov esi, dword ptr [r8]     ; esi = *it\n    cmp esi, edi                ; compare *it with key\n    setl sil                    ; sil = 1 if less than\n    setg al                     ; al  = 1 if greater than\n                                ; eax = 1 if greater than\n    movzx esi, sil              ; esi = 1 if less than\n    sub eax, esi                ; result = (greater than) - (less than)\n    cmp eax, -1                 ; less than zero?\n    jne .L11                    ; N: Try zero or positive<\/font>\n\n                                ; was less than\n    add rdx, 1                  ; step + 1\n    lea r9, [r8+4]              ; first = it + 1\n    sub rcx, rdx                ; length -= step + 1\n    test rcx, rcx               ; while (length &gt; 0)\n    jg .L9\n.L19:\n    lea r8, [rsp+32648]         ; r8 = last\n.L10:\n    ... return value in r8\n<\/pre>\n<p>The second comparison function <code>compare2<\/code> uses the relational comparison operators to generate exactly 0 or 1. This is a clever way of generating <code>-1<\/code>, <code>0<\/code>, or <code>+1<\/code>, but unfortunately,  that was not our goal in the grand scheme of things. It was merely a step toward that goal. The way that <code>compare2<\/code> calculates the result is too complicated for the optimizer to understand, so it just does its best at calculating the formal return value from <code>compare2<\/code> and testing its sign. (The compiler does realize that the only possible negative value is <code>-1<\/code>, but that&#8217;s not enough insight to let it optimize the entire expression away.) <\/p>\n<p>If we try <code>compare3<\/code>, we get this: <\/p>\n<pre>\n    ; on entry, esi is the value to search for\n\n    lea rdi, [rsp-120]          ; rdi = first\n    mov ecx, 8192               ; ecx = length\n    jmp .L12\n.L28:                           ; was greater than\n    mov rcx, rax                ; length = step\n.L12:\n    mov rax, rcx\n    sar rax, 1                  ; rax = step = length \/ 2\n    lea rdx, [rdi+rax*4]        ; it = first + step\n\n    <font COLOR=\"blue\">; result = compare(*it, key), and then test the result\n    cmp dword ptr [rdx], esi    ; compare(*it, key)\n    jl .L14                     ; if less than\n    jle .L13                    ; if less than or equal (therefore equal)<\/font>\n\n    ; \"length\" is in eax now\n.L15:                           ; was greater than\n    test eax, eax               ; length == 0?\n    jg .L28                     ; N: continue looping\n    lea rdx, [rsp+32648]        ; rdx = last\n.L13:\n    ... return value in rdx\n\n.L14:                           ; was less than\n    add rax, 1                  ; step + 1\n    lea rdi, [rdx+4]            ; first = it + 1\n    sub rcx, rax                ; length -= step + 1\n    mov rax, rcx                ; rax = length\n    jmp .L15\n<\/pre>\n<p>The compiler was able to understand this version of the comparison function: It observed that if <code>a &lt; b<\/code>, then the result of <code>compare3<\/code> is always negative, so it jumped straight to the less-than case. Otherwise, it observed that the result was zero if <code>a<\/code> is not greater than <code>b<\/code> and jumped straight to that case too. The compiler did have some room for improvement with the placement of the basic blocks, since there is an unconditional jump in the inner loop, but overall it did a pretty good job. <\/p>\n<p>The last case is the inline assembly with <code>compare4<\/code>. As you might expect, the compiler had the most trouble with this. <\/p>\n<pre>\n    ; on entry, edi is the value to search for\n\n    lea r8, [rsp-120]           ; r8 = first\n    mov ecx, 8192               ; ecx = length\n    jmp .L12\n.L14:                           ; zero or positive\n    <font color=\"blue\">je .L13                     ; zero - done<\/font>\n                                ; was greater than\n    mov rcx, rdx                ; length = step\n    test rcx, rcx               ; while (length &gt; 0)\n    jle .L22\n.L12:\n    mov rdx, rcx\n    sar rdx, 1                  ; rdx = step = length \/ 2\n    lea rsi, [r8+rdx*4]         ; it = first + step\n\n    <font COLOR=\"blue\">; result = compare(*it, key), and then test the result\n    mov eax, dword ptr [rsi]    ; eax = *it\n    sub eax, edi\n    jno 1f\n    cmc\n    rcr eax, 1\n1:\n    test eax, eax               ; less than zero?\n    jne .L14                    ; N: Try zero or positive<\/font>\n\n                                ; was less than\n    add rdx, 1                  ; step + 1\n    lea r8, [rsi+4]             ; first = it + 1\n    sub rcx, rdx                ; length -= step + 1\n    test rcx, rcx               ; while (length &gt; 0)\n    jg .L12\n.L22:\n    lea rsi, [rsp+32648]        ; rsi = last\n.L13:\n    ... return value in rsi\n<\/pre>\n<p>This is pretty much the same as <code>compare2<\/code>: The compiler has no insight at all into the inline assembly, so it just dumps it into the code like a black box, and then once control exits the black box, it checks the sign in a fairly efficient way. But it had no real optimization opportunities because you can&#8217;t really optimize inline assembly. <\/p>\n<p>The conclusion of all this is that optimizing the instruction count in your finely-tuned comparison function is a fun little exercise, but it doesn&#8217;t necessarily translate into real-world improvements. In our case, we focused on optimizing the code that encodes the result of the comparison without regard for how the caller is going to decode that result. The contract between the two functions is that one function needs to package some result, and the other function needs to unpack it. But we discovered that the more obtusely we wrote the code for the packing side, the less likely the compiler would be able to see how to optimize out the entire hassle of packing and unpacking in the first place. In the specific case of comparison functions, it means that you may want to return <code>+1<\/code>, <code>0<\/code>, and <code>-1<\/code> explicitly rather than calculating those values in a fancy way, because it turns out compilers are really good at optimizing &#8220;compare a constant with zero&#8221;. <\/p>\n<p>You have to see how your attempted optimizations fit into the bigger picture because you may have hyper-optimized one part of the solution to the point that it prevents deeper optimizations in other parts of the solution. <\/p>\n<p><b>Bonus chatter<\/b>: If the comparison function is not inlined, then all of these optimization opportunities disappear. But I personally wouldn&#8217;t worry about it too much, because if the comparison function is not inlined, then the entire operation is going to be dominated by the function call overhead: Setting up the registers for the call, making the call, returning from the call, testing the result, and most importantly, the lost register optimization opportunities not only because the compiler loses opportunities to enregister values across the call, but also because the compiler has to protect against the possibility that the comparison function will mutate global state and consequently create aliasing issues. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Missing the forest for the blade of grass.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-97416","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Missing the forest for the blade of grass.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97416","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=97416"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/97416\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=97416"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=97416"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=97416"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}