{"id":36973,"date":"2004-12-16T07:00:00","date_gmt":"2004-12-16T15:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/2004\/12\/16\/optimization-is-often-counter-intuitive\/"},"modified":"2021-07-31T08:49:08","modified_gmt":"2021-07-31T15:49:08","slug":"20041216-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20041216-00\/?p=36973","title":{"rendered":"Optimization is often counter-intuitive"},"content":{"rendered":"<p>Anybody who&#8217;s done intensive optimization knows that optimization is often counter-intuitive. Things you think would be faster often aren&#8217;t.<\/p>\n<p>Consider, for example, the exercise of obtaining the current instruction pointer. There&#8217;s the na\u00efve solution:<\/p>\n<pre>__declspec(noinline)\r\nvoid *GetCurrentAddress()\r\n{\r\n  return _ReturnAddress();\r\n}\r\n\r\n...\r\nvoid *currentInstruction = GetCurrentAddress();\r\n<\/pre>\n<p>If you look at the disassembly, you&#8217;ll get something like this:<\/p>\n<pre>GetCurrentAddress:\r\n    mov eax, [esp]\r\n    ret\r\n\r\n...\r\n    call GetCurrentAddress\r\n    mov [currentInstruction], eax\r\n<\/pre>\n<p>&#8220;Pah,&#8221; you say to yourself, &#8220;look at how inefficient that is. I can reduce that to two instructions. Watch:<\/p>\n<pre>void *currentInstruction;\r\n__asm {\r\ncall L1\r\nL1: pop currentInstruction\r\n}\r\n<\/pre>\n<p>That&#8217;s half the instruction count of your bloated <a href=\"http:\/\/snltranscripts.jt.org\/88\/88ghansfranz.phtml\"> girly-code<\/a>.&#8221;<\/p>\n<p>But if you sit down and race the two code sequences, you&#8217;ll find that the function-call version is faster by a factor of two! How can that be?<\/p>\n<p>The reason is the &#8220;hidden variables&#8221; inside the processor. All modern processors contain much more state than you can see from the instruction sequence. There are TLBs, L1 and L2 caches, all sorts of stuff that you can&#8217;t see. The hidden variable that is important here is the return address predictor.<\/p>\n<p>The more recent Pentium (and I believe also Athlon) processors maintain an internal stack that is updated by each <code>CALL<\/code> and <code>RET<\/code> instruction. When a <code>CALL<\/code> is executed, the return address is pushed both onto the real stack (the one that the <code>ESP<\/code> register points to) as well as to the internal return address predictor stack; a <code>RET<\/code> instruction pops the top address of the return address predictor stack as well as the real stack.<\/p>\n<p>The return address predictor stack is used when the processor decodes a <code>RET<\/code> instruction. It looks at the top of the return address predictor stack and says, &#8220;I bet that <code>RET<\/code> instruction is going to return to that address.&#8221; It then speculatively executes the instructions at that address. Since programs rarely fiddle with return addresses on the stack, these predictions tend to be highly accurate.<\/p>\n<p>That&#8217;s why the &#8220;optimization&#8221; turns out to be slower. Let&#8217;s say that at the point of the <code>CALL L1<\/code> instruction, the return address predictor stack looks like this:<\/p>\n<table class=\"cp3\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th valign=\"BOTTOM\">Return address<br \/>\npredictor stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">caller1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<tr>\n<th valign=\"BOTTOM\">Actual stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">caller1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here, <code>caller1<\/code> is the function&#8217;s caller, <code>caller1<\/code> is the function&#8217;s caller&#8217;s caller, and so on. So far, the return address predictor stack is right on target. (I&#8217;ve drawn the actual stack below the return address predictor stack so you can see that they match.)<\/p>\n<p>Now you execute the <code>CALL<\/code> instruction. The return address predictor stack and the actual stack now look like this:<\/p>\n<table class=\"cp3\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th valign=\"BOTTOM\">Return address<br \/>\npredictor stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">L1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<tr>\n<th valign=\"BOTTOM\">Actual stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">L1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>But instead of executing a <code>RET<\/code> instruction, you pop off the return address. This removes it from the actual stack, but doesn&#8217;t remove it from the return address predictor stack.<\/p>\n<table class=\"cp3\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th valign=\"BOTTOM\">Return address<br \/>\npredictor stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">L1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<tr>\n<th valign=\"BOTTOM\">Actual stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">caller1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller4<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>I think you can see where this is going.<\/p>\n<p>Eventually your function returns. The processor decodes your <code>RET<\/code> instruction and looks at the return address predictor stack and says, &#8220;My predictor stack says that this <code>RET<\/code> is going to return to <code>L1<\/code>. I will begin speculatively executing there.&#8221;<\/p>\n<p>But oh no, the value on the top of the real stack isn&#8217;t <code>L1<\/code> at all. It&#8217;s <code>caller1<\/code>. The processor&#8217;s return address predictor predicted incorrectly, and it ended up wasting its time studying the wrong code!<\/p>\n<p>The effects of this bad guess don&#8217;t end there. After the <code>RET<\/code> instruction, the return address predictor stack looks like this:<\/p>\n<table class=\"cp3\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th valign=\"BOTTOM\">Return address<br \/>\npredictor stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">caller1<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<tr>\n<th valign=\"BOTTOM\">Actual stack:<\/th>\n<td valign=\"BOTTOM\">\u00a0<\/td>\n<td valign=\"BOTTOM\">caller2<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller3<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">caller4<\/td>\n<td valign=\"BOTTOM\">\u2192<\/td>\n<td valign=\"BOTTOM\">\u22ef<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Eventually your caller returns. Again, the processor consults its return address predictor stack and speculatively executes at <code>caller1<\/code>. But that&#8217;s not where you&#8217;re returning to. You&#8217;re really returning to <code>caller2<\/code>.<\/p>\n<p>And so on. By mismatching the <code>CALL<\/code> and <code>RET<\/code> instructions, you managed to cause every single return address prediction on the stack to be wrong. Notice in the diagram that, in the absence of somebody playing games with the return address predictor stack of the type that created the problem initially, <strong>not a single prediction on the return address predictor stack will be correct<\/strong>. None of the predicted return addresses match up with actual return addresses.<\/p>\n<p>Your peephole optimization has proven to be shortsighted.<\/p>\n<p>Some processors expose this predictor more explicitly. The Alpha AXP, for example, has several types of control flow instructions, all of which have the same logical effect, but which hint to the processor how it should maintain its internal predictor stack. For example, the <code>BR<\/code> instruction says, &#8220;Jump to this address, but do not push the old address onto the predictor stack.&#8221; On the other hand, the <code>JSR<\/code> instruction says, &#8220;Jump to this address, and push the old address onto the predictor stack.&#8221; There is also a <code>RET<\/code> instruction that says, &#8220;Jump to this address, and pop an address from the predictor stack.&#8221; (There&#8217;s also a fourth type that isn&#8217;t used much.)<\/p>\n<p>Moral of the story: Just because something looks better doesn&#8217;t mean that it necessarily <strong>is<\/strong> better.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Things you think would be faster often aren&#8217;t.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-36973","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>Things you think would be faster often aren&#8217;t.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/36973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=36973"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/36973\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=36973"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=36973"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=36973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}