{"id":1273,"date":"2013-08-09T13:00:00","date_gmt":"2013-08-09T13:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/vcblog\/2013\/08\/09\/optimizing-c-code-dead-code-elimination\/"},"modified":"2019-02-18T18:40:57","modified_gmt":"2019-02-18T18:40:57","slug":"optimizing-c-code-dead-code-elimination","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/optimizing-c-code-dead-code-elimination\/","title":{"rendered":"Optimizing C++ Code : Dead Code Elimination"},"content":{"rendered":"<p>&nbsp;<\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">If you have arrived in the middle of this blog series, you might want instead to begin at the <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/vcblog\/archive\/2013\/05\/29\/optimizing-c-code.aspx\"><span style=\"color: #0563c1;font-family: Calibri;font-size: small\">beginning<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\">.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">This post examines the optimization called Dead-Code-Elimination, which I&rsquo;ll abbreviate to DCE.&nbsp; It does what it says: discards any calculations whose results are not actually <em>used<\/em> by the program.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Now, you will probably assert that your code calculates <em>only<\/em> results that are used, and never any results that are <em>not <\/em>used: only an idiot, after all, would gratuitously add useless code &ndash; calculating the first 1000 digits of <em>pi<\/em>, for example, whilst also doing something useful.&nbsp; So when would the DCE optimization ever have an effect?<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">The reason for describing DCE so early in this series is that it could otherwise wreak havoc and confusion throughout our&nbsp; exploration of other, more interesting optimizations: consider this tiny example program, held in a file called <\/span><span style=\"color: #000000\"><span style=\"color: #ff0000;font-family: Consolas;background-color: #ffffff\">Sum.cpp<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\">:<\/span><\/span><\/span><\/p>\n<p style=\"padding-left: 30px\"><span style=\"color: #0000ff;font-family: courier new,courier;font-size: small;background-color: #ffffff\">int main() {<br \/>&nbsp;&nbsp;&nbsp; long long s = 0;<br \/>&nbsp;&nbsp;&nbsp; for (long long i = 1; i &lt;= 1000000000; ++i) s += i;<br \/>}<\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">We are interested in how fast the loop executes, calculating the sum of the first billion integers. &nbsp;(Yes, this is a spectacularly silly example, since the result has a closed formula, taught in High school.&nbsp; But that&rsquo;s not the point)<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Build this program with the command: &nbsp;<\/span><\/span><span style=\"color: #ff0000\"><span style=\"font-family: Consolas;background-color: #ffffff\">CL \/Od \/FA Sum.cpp<\/span><\/span><span style=\"font-family: Calibri;font-size: small\">&nbsp; and run with the command <\/span><span style=\"color: #ff0000;font-family: Consolas\">Sum<\/span><span style=\"font-family: Calibri;font-size: small\">.&nbsp; Note that this build <em>disables<\/em> optimizations, via the <\/span><span style=\"color: #ff0000;font-family: Consolas\">\/Od<\/span><span style=\"font-family: Calibri;font-size: small\"> switch.&nbsp; On my PC, it takes about 4 seconds to run.&nbsp; Now try compiling optimized-for-speed, using <\/span><span style=\"color: #ff0000;font-family: Consolas\">CL \/O2 \/FA Sum.cpp<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\">.&nbsp; On my PC, this version runs so fast there&rsquo;s no perceptible delay.&nbsp; Has the compiler really done such a fantastic job at optimizing our code?&nbsp; The answer is no (but in a surprising way, also yes):<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Let&rsquo;s look first at the code it generates for the <span style=\"color: #ff0000\">\/Od<\/span> case, stored into <span style=\"color: #ff0000\">Sum.asm<\/span>.&nbsp; I have trimmed and annotated&nbsp;the text to show only the loop body:<\/span><\/span><\/p>\n<p class=\"Code\"><span style=\"color: #0000ff;background-color: #ffffff\"><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; QWORD PTR s$[rsp], 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; long long s = 0<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; QWORD PTR i$1[rsp], 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; long long i = 1<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; jmp&nbsp;&nbsp;&nbsp; SHORT $LN3@main&nbsp;<\/span><span style=\"font-family: courier new,courier\">&nbsp;<br \/><a href=\"mailto:%24LN2@main\"><span style=\"color: #0000ff;background-color: #ffffff\">$LN2@main<\/span><\/a>:<\/span><span style=\"font-family: courier new,courier\"><br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; rax, QWORD PTR i$1[rsp]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rax =&nbsp;i<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc&nbsp;&nbsp;&nbsp; rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rax += 1<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; QWORD PTR i$1[rsp], rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; i = rax<\/span><span style=\"font-family: courier new,courier\">&nbsp;<br \/><\/span><span style=\"font-family: courier new,courier\"><a href=\"mailto:%24LN3@main\"><span style=\"color: #0000ff;background-color: #ffffff\">$LN3@main<\/span><\/a>:<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cmp&nbsp;&nbsp;&nbsp; QWORD PTR i$1[rsp], 1000000000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; i &lt;= 1000000000 ?<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; jg&nbsp;&nbsp;&nbsp;&nbsp; SHORT $LN1@main&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; no &ndash; we&rsquo;re done<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; rax, QWORD PTR i$1[rsp]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rax =&nbsp;i<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; rcx, QWORD PTR s$[rsp]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rcx = s<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; rcx, rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rcx += rax<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; rax, rcx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rax = rcx<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; QWORD PTR s$[rsp], rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; s = rax<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; jmp&nbsp;&nbsp;&nbsp; SHORT $LN2@main&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&amp;\nnbsp; ;; loop<br \/><\/span><span style=\"font-family: courier new,courier\">$LN1@main:<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">The instructions look pretty much like you would expect.&nbsp; The variable <\/span><span style=\"color: #ff0000;font-family: Consolas\">i<\/span><span style=\"font-family: Calibri;font-size: small\"> is held on the stack at offset <\/span><span style=\"color: #ff0000;font-family: Consolas\">i$1<\/span><span style=\"font-family: Calibri;font-size: small\"> from the location pointed-to by the RSP register; elsewhere in the asm file, we find that <\/span><span style=\"color: #ff0000;font-family: Consolas\">i$1<\/span><span style=\"font-family: Calibri;font-size: small\"> = 0.&nbsp; We use the <\/span><span style=\"font-family: Consolas\">RAX<\/span><span style=\"font-family: Calibri;font-size: small\"> register to increment <\/span><span style=\"color: #ff0000;font-family: Consolas\">i<\/span><span style=\"font-family: Calibri;font-size: small\">.&nbsp; Similarly, variable <\/span><span style=\"color: #ff0000;font-family: Consolas\">s<\/span><span style=\"font-family: Calibri;font-size: small\"> is held on the stack (at offset <\/span><span style=\"color: #ff0000;font-family: Consolas\">s$<\/span><span style=\"font-family: Calibri;font-size: small\"> from the location pointed-to by the <\/span><span style=\"font-family: Consolas\">RSP<\/span><span style=\"font-family: Calibri;font-size: small\"> register; elsewhere in the asm file, we find that <\/span><span style=\"color: #ff0000;font-family: Consolas\">s$<\/span><span style=\"font-family: Calibri;font-size: small\"> = 8).&nbsp; The code uses <\/span><span style=\"font-family: Consolas\">RCX<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> to calculate the running sum each time around the loop.<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">Notice how we load the value for <\/span><span style=\"color: #ff0000;font-family: Consolas\">i<\/span><span style=\"font-family: Calibri;font-size: small\"> from its &ldquo;home&rdquo; location on the stack, every time around the loop; and write the new value back to its &ldquo;home&rdquo; location.&nbsp; The same for the variable <\/span><span style=\"color: #ff0000;font-family: Consolas\">s<\/span><span style=\"font-family: Calibri;font-size: small\">.&nbsp; We could describe this code as &ldquo;na&iuml;ve&rdquo; &ndash; it&rsquo;s been generated by a dumb compiler (i.e., one with optimizations disabled).&nbsp; For example, it&rsquo;s wasteful to keep accessing memory for every iteration of the loop, when we could have kept the values for <\/span><span style=\"color: #ff0000;font-family: Consolas\">i<\/span><span style=\"font-family: Calibri;font-size: small\"> and <\/span><span style=\"color: #ff0000;font-family: Consolas\">s<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> in registers throughout.<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">So much for the non-optimized code.&nbsp; What about the code generated for the optimized case?&nbsp; Let&rsquo;s look at the corresponding <\/span><span style=\"color: #ff0000;font-family: Consolas\">Sum.asm <\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\">for the optimized, <span style=\"color: #ff0000\">\/O2<\/span>, build.&nbsp; Again, I have trimmed the file down to just the part that implements the loop body, and the answer is:<\/span><\/span><\/p>\n<p class=\"Code\"><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; there&rsquo;s nothing here!<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">Yes &ndash; it&rsquo;s empty!&nbsp; There are <em>no<\/em> instructions that calculate the value of <\/span><span style=\"color: #ff0000;font-family: Consolas\">s<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\">.&nbsp;&nbsp; <\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">Well, that answer is clearly wrong, you might say.&nbsp; But how do we <em>know <\/em>the answer is wrong?&nbsp; The optimizer has reasoned that the program does not <em>actually<\/em> make use of the answer for <\/span><span style=\"color: #ff0000;font-family: Consolas\">s<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> at any point; and so it has not bothered to include any code to calculate it.&nbsp; You can&rsquo;t say the answer is wrong, unless you check that answer, right?<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">We have just fallen victim to, been mugged in the street by, and lost our shirt to, the DCE optimization.&nbsp; If you don&rsquo;t observe an answer, the program (often) won&rsquo;t calculate that answer.&nbsp; <\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">You might view this effect in the optimizer as analogous, in a profoundly shallow way, to a fundamental result in Quantum Physics, often paraphrased in articles on popular science as &ldquo;<\/span><a href=\"http:\/\/en.wikipedia.org\/wiki\/If_a_tree_falls_in_a_forest\"><span style=\"color: #0563c1;font-family: Calibri;font-size: small\">If a tree falls<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> in the forest, and there&rsquo;s nobody around to hear, does it make a sound?&rdquo;<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">Well, let&rsquo;s &ldquo;observe&rdquo; the answer in our program, by adding a <\/span><span style=\"color: #ff0000;font-family: Consolas\">printf<\/span><span style=\"font-family: Calibri;font-size: small\"> of variable <\/span><span style=\"color: #ff0000;font-family: Consolas\">s<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\">, into our code, as follows:<\/span><\/span><\/p>\n<p class=\"Code\" style=\"padding-left: 30px\"><span style=\"color: #0000ff\"><span style=\"font-family: courier new,courier\">#include &lt;stdio.h&gt;<br \/><\/span><span style=\"font-family: courier new,courier\">int main() {<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp; long long s = 0;<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp; for (long long i = 1; i &lt;= 1000000000; ++i) s += i;<br \/><\/span><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp; printf(&#8220;%lld &#8220;, s);<br \/><\/span><span style=\"font-family: courier new,courier\">}&nbsp; <\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">The <span style=\"color: #ff0000\">\/Od<\/span> version of this program prints the correct answer, still taking about 4 seconds to run.&nbsp; The <span style=\"color: #ff0000\">\/O2<\/span> version prints the same, correct answer, but runs much faster.&nbsp; &nbsp;(See the optional section below for the value of &ldquo;much faster&rdquo; &ndash; in fact, the speedup is around 7X)<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">At this point, we have explained the main point of this blog post: always be very careful&nbsp;in analyzing compiler optimizations, and in measuring their benefit, that we are not being misled by DCE.&nbsp; Here are four steps to help notice, and&nbsp;ward off, the unasked-for attention of DCE:<\/span><\/span><\/p>\n<ul>\n<li><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Check that the timings have not suddenly improved by an order of magnitude<\/span><\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: small\">Examine the generated code (using the <\/span><span style=\"color: #ff0000;font-family: Consolas\">\/FA<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> switch)<\/span><\/span><\/li>\n<li><span style=\"font-size: small\"><span style=\"font-family: Calibri\">If in doubt add a strategic <span style=\"color: #ff0000\">printf<\/span><\/span><\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: small\">Put the code of interest into its own <\/span><span style=\"font-family: Consolas\">.CPP<\/span><span style=\"font-family: Calibri;font-size: small\"> file, separate from the one holding <span style=\"color: #ff0000\">main<\/span>.&nbsp; This works, so long as you do <em>not <\/em>request whole-program optimization (via the <\/span><span style=\"color: #ff0000;font-family: Consolas\">\/GL<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> switch, that we&rsquo;ll cover later)<\/span><\/span><\/li>\n<\/ul>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">However, we can wring several more interesting lessons from this&nbsp;tiny example.&nbsp; I have marked these sections below as &ldquo;Optional-1&rdquo; through &ldquo;Optional-4&rdquo;.<\/span><\/span><\/p>\n<h2><span style=\"color: #323e4f;font-family: Calibri\">&nbsp;<\/span><span style=\"color: #323e4f;font-family: Calibri\">&nbsp;<\/span><\/h2>\n<h2><span style=\"color: #323e4f\"><span style=\"font-family: Calibri\">Optional-1 : Codegen for \/O2<\/span><\/span><\/h2>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Why is our <span style=\"color: #ff0000\">\/O2<\/span> version (including a <span style=\"color: #ff0000\">printf<\/span> to defeat DCE), so much faster than the <span style=\"color: #ff0000\">\/Od<\/span> version?&nbsp; Here is the code generated for the loop of the <span style=\"color: #ff0000\">\/O2<\/span> version, extracted from the <span style=\"color: #ff0000\">Sum.asm<\/span> file:<\/span><\/span><\/p>\n<p class=\"Code\" style=\"padding-left: 30px\"><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; xor&nbsp;&nbsp;&nbsp; edx, edx<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; eax, 1&nbsp;<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; ecx, edx<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; r8d, edx<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; r9d, edx<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; npad&nbsp;&nbsp; 13<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\"><a href=\"mailto:%24LL3@main\">$LL3@main<\/a>:<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc&nbsp;&nbsp;&nbsp; r9<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; r8, 2<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; rcx, 3<br \/><\/span><\/span><span style=\"color: #0000ff;font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; r9, rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; r9&nbsp; = 2&nbsp; 8 18 32 50 &#8230;<br \/><\/span><span style=\"color: #0000ff;font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; r8, rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; r8&nbsp; = 3 10 21 36 55 &#8230;<br \/><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; rcx, rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rcx = 4 12 24 40 60 &#8230;<br \/><\/span><\/span><span style=\"color: #0000ff;font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; rdx, rax&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rdx = 1&nbsp; 6 15 28 45 &#8230;<br \/><\/span><span style=\"color: #0000ff;font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; add&nbsp;&nbsp;&nbsp; rax, 4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rax = 1&nbsp; 5&nbsp; 9 13 17 &#8230;<br \/><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cmp&nbsp;&nbsp;&nbsp; rax, 1000000000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; i &lt;= 1000000000 ?<br \/><\/span><\/span><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; jle&nbsp;&nbsp;&nbsp; SHORT $LL3@main&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; yes, so loop back<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">Note that the loop body contains approximately the same number of instructions as the non-optimized build, and yet it runs <em>much<\/em> faster.&nbsp; That&rsquo;s mostly because the instructions in this optimized loop use registers, rather than memory locations.&nbsp; As we all know, accessing a register is much faster than accessing memory.&nbsp; Here are <\/span><a href=\"http:\/\/software.intel.com\/sites\/products\/collateral\/hpc\/vtune\/performance_analysis_guide.pdf\"><span style=\"color: #0563c1;font-family: Calibri;font-size: small\">approximate latencies<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> that demonstrate how memory access can slow your program to a snail&rsquo;s pace:<\/span><\/span><\/p>\n<table style=\"width: 172px;height: 170px;background-color: #f0ffff\" border=\"0\">\n<tbody>\n<tr>\n<td>Location<\/td>\n<td>\n<p>Latency<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>Register<\/td>\n<td>1 cycle<\/td>\n<\/tr>\n<tr>\n<td>L1<\/td>\n<td>4 cycles<\/td>\n<\/tr>\n<tr>\n<td>L2<\/td>\n<td>10 cycles<\/td>\n<\/tr>\n<tr>\n<td>L3<\/td>\n<td>75 cycles<\/td>\n<\/tr>\n<tr>\n<td>DRAM<\/td>\n<td>60 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">So, the non-optimized version is reading and writing to stack locations, which will quickly migrate into L2 (10 cycle access time) and L1 cache (4 cycle access time).&nbsp; Both are slower than if all the calculation is done in registers, with access times around a single cycle.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">But there&rsquo;s more going on here to make the code run faster.&nbsp; Notice how the <span style=\"color: #ff0000\">\/Od<\/span> version increments the loop counter by 1 each time around the loop.&nbsp; But the <span style=\"color: #ff0000\">\/O2<\/span> version increments the loop counter (held in register RAX) by 4 each time around the loop.&nbsp; Eh?<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">The optimizer has <em>unrolled<\/em> the loop.&nbsp; So it adds four items together on each iteration, like this:<\/span><\/span><\/p>\n<p class=\"Code\" style=\"padding-left: 30px\"><span style=\"color: #0000ff;font-family: courier new,courier\">s = (1 + 2 + 3 + 4) + (5 + 6 + 7 + 8) + (9 + 10 + 11 + 12) + (13 + . . .<\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">By <em>unrolling <\/em>this loop, the test for loop-end is made every four iterations, rather than on every iteration &ndash; so the CPU spends more time doing useful work than forever checking whether it can stop yet!<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Also, rather than accumulate the results into a single location, it has decided to use 4 separate registers, to accumulate 4 partial sums, like this:<\/span><\/span><\/p>\n<p class=\"Code\" style=\"padding-left: 30px\"><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">RDX = 1 + 5 +&nbsp; 9 + 13 + &#8230; &nbsp;= &nbsp;1, &nbsp;6, 15, 28 &#8230;<br \/> R9&nbsp; = 2 + 6 + 10 + 14 + &#8230; &nbsp;= &nbsp;2,&nbsp; 8, 18, 32 &#8230;<br \/> R8&nbsp; = 3 + 7 + 11 + 15 + &#8230; &nbsp;= &nbsp;3, 10, 21, 36 &#8230;<br \/> RCX = 4 + 8 + 12 + 16 + &#8230; &nbsp;= &nbsp;4, 12, 24, 40 &#8230;<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">At the end of the loop, it adds the partial sums, in these four registers, together, to get the final answer.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">(I will leave it as an exercise for the reader how the optimizer copes with a loop whose trip count is not a multiple of 4)<\/span><\/span><\/p>\n<h2><span style=\"color: #323e4f;font-family: Calibri\">&nbsp;<\/span><\/h2>\n<h2><span style=\"color: #323e4f\"><span style=\"font-family: Calibri\">Optional-2 : Accurate Performance Measurements<\/span><\/span><\/h2>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Earlier, I said that the <span style=\"color: #ff0000\">\/O2<\/span> version of the program, without a <span style=\"color: #ff0000\">printf <\/span><em>inhibitor<\/em>, &ldquo;runs so fast there&rsquo;s no perceptible delay&rdquo;.&nbsp; Here is a program that makes that statement more precise:<\/span><\/span><\/p>\n<p class=\"Code\" style=\"padding-left: 30px\"><span style=\"color: #0000ff\"><span style=\"font-family: Consolas\">#include &lt;stdio.h&gt;<br \/><\/span><span style=\"font-family: Consolas\">#include &lt;windows.h&gt;<br \/><\/span><span style=\"font-family: Consolas\">int main() {<br \/><\/span><span style=\"font-family: Consolas\">&nbsp; LARGE_INTEGER start, stop;<br \/><\/span><span style=\"font-family: Consolas\">&nbsp; QueryPerformanceCounter(&amp;start);<br \/><\/span><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp; long long s = 0;<br \/><\/span><span style=\"font-family: Consolas\">&nbsp;&nbsp;&nbsp; for (long long i = 1; i &lt;= 1000000000; ++i) s += i;<br \/><\/span><span style=\"font-family: Consolas\">&nbsp; QueryPerformanceCounter(&amp;stop);<br \/><\/span><span style=\"font-family: Consolas\">&nbsp; double diff = stop.QuadPart &#8211; start.QuadPart;<br \/><\/span><span style=\"font-family: Consolas\">&nbsp; printf(&#8220;%f&#8221;, diff);<br \/><\/span><span style=\"font-family: Consolas\">}<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">It uses <\/span><span style=\"color: #ff0000;font-family: Consolas\">QueryPerformanceCounter<\/span><span style=\"font-family: Calibri;font-size: small\"> to measure the elapsed time.&nbsp; (This is a &ldquo;sawn-off&rdquo; version of a <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2011\/12\/28\/high-resolution-timer-for-c.aspx\"><span style=\"color: #0563c1;font-family: Calibri;font-size: small\">high-resolution timer<\/span><\/a><span style=\"font-family: Calibri;font-size: small\"> described in a previous blog).&nbsp; There are lots of caveats to bear in-mind when measuring performance (you might want to browse a <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/06\/26\/auto-vectorizer-in-visual-studio-11-how-much-faster.aspx\"><span style=\"color: #0563c1;font-family: Calibri;font-size: small\">list I wrote previously<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\">), but they don&rsquo;t matter for this particular case, as we shall see in a moment:<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">On my PC, the <span style=\"color: #ff0000\">\/Od<\/span> version of this program prints a value for &nbsp;<\/span><\/span><span style=\"color: #ff0000;font-family: Consolas\">diff<\/span><span style=\"font-family: Calibri;font-size: small\"> of about 7 million <em>somethings<\/em>.&nbsp; (The units of the answer don&rsquo;t matter &ndash; just know that the number gets bigger as the program takes longer to run).&nbsp; The <span style=\"color: #ff0000\">\/O2<\/span> version prints a value for <\/span><span style=\"color: #ff0000;font-family: Consolas\">diff<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> of 0.&nbsp; And the cause, as explained above, is our friend DCE.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">When we add a <span style=\"color: #ff0000\">printf<\/span> to prevent DCE, the <span style=\"color: #ff0000\">\/Od<\/span> version runs in about 1 million <em>somethings <\/em>&#8211; a speedup of about 7X.<\/span><\/span><\/p>\n<h2>&nbsp;<\/h2>\n<h2><span style=\"color: #323e4f\"><span style=\"font-family: Calibri\">Optional-3 : x64 Assembler &ldquo;widening&rdquo;<\/span><\/span><\/h2>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">If we look carefully again at the assembler listings in this post, we find a few surprises in the part of the program that initializes our registers:<\/span><\/span><\/p>\n<p class=\"Code\"><span style=\"color: #0000ff;font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; xor&nbsp;&nbsp;&nbsp; edx, edx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rdx = 0&nbsp;&nbsp;&nbsp;&nbsp; (64-bit!)<br \/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; eax, 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rax = i = 1 (64-bit!)<br \/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; ecx, edx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; rcx = 0&nbsp;&nbsp;&nbsp;&nbsp; (64-bit!)<br \/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; r8d, edx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;;; r8&nbsp; = 0&nbsp;&nbsp;&nbsp;&nbsp; (64-bit!)<br \/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; mov&nbsp;&nbsp;&nbsp; r9d, edx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; r9&nbsp; = 0&nbsp;&nbsp;&nbsp;&nbsp; (64-bit!)<br \/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; npad&nbsp;&nbsp; 13&nbsp;&nbsp;&nbsp;&nbsp;&amp;nbs\np;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;; multi-byte nop alignment padding<br \/>$LL3@main:<\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Recall first that the original C++ program uses <span style=\"color: #ff0000\">long long<\/span> variables for both the loop counter and the sum.&nbsp; In the VC++ compiler, these map onto 64-bit integers.&nbsp; So we should expect the generated code to make use of x64&rsquo;s 64-bit registers.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">We already explained, in a previous post, that <span style=\"color: #ff0000\">xor reg, reg<\/span> is a compact way to zero out the contents of <span style=\"color: #ff0000\">reg<\/span>.&nbsp; But our first instruction is apply <span style=\"color: #ff0000\">x<\/span><\/span><\/span><\/p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; If you have arrived in the middle of this blog series, you might want instead to begin at the beginning. This post examines the optimization called Dead-Code-Elimination, which I&rsquo;ll abbreviate to DCE.&nbsp; It does what it says: discards any calculations whose results are not actually used by the program. Now, you will probably assert [&hellip;]<\/p>\n","protected":false},"author":271,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[66,36],"class_list":["post-1273","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus","tag-performance","tag-vc"],"acf":[],"blog_post_summary":"<p>&nbsp; If you have arrived in the middle of this blog series, you might want instead to begin at the beginning. This post examines the optimization called Dead-Code-Elimination, which I&rsquo;ll abbreviate to DCE.&nbsp; It does what it says: discards any calculations whose results are not actually used by the program. Now, you will probably assert [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/1273","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/271"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=1273"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/1273\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=1273"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=1273"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=1273"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}