{"id":103121,"date":"2019-11-21T07:00:00","date_gmt":"2019-11-21T15:00:00","guid":{"rendered":"http:\/\/devblogs.microsoft.com\/oldnewthing\/?p=103121"},"modified":"2019-11-21T06:28:37","modified_gmt":"2019-11-21T14:28:37","slug":"20191121-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20191121-00\/?p=103121","title":{"rendered":"Why does the Alpha AXP predict a coroutine transfer the way it does?"},"content":{"rendered":"<p>I noted some time ago that <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20170811-00\/?p=96805\"> the Alpha AXP has a dedicated <i>branch to coroutine<\/i> instruction<\/a>. The behavior of this instruction is to branch to the address held in the specified register, but the interesting part is how the instruction is predicted: The processor predicts a branch to the address at the top of the return address predictor stack, and the value at the top of the return address predictor stack is replaced by the address of the instruction that follows the <code>JSR_CO<\/code>.<\/p>\n<p>Why does the processor use this algorithm? Why doesn&#8217;t it just push the address of the following instruction onto the return address predictor stack?<\/p>\n<p>Let&#8217;s illustrate with an example. Note that the entire issue of <code>JSR_CO<\/code> applies only to stackful coroutines. Stackless coroutines use normal <code>JSR<\/code> subroutine calls.<\/p>\n<p>Consider two coroutines that call each other. On entry to the first coroutine, the return address predictor stack contains the return address predictions for the code that will run once the first coroutine returns.<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">X1<\/span><\/p>\n<p>The first coroutine does some stuff and then calls the second coroutine. The processor incorrectly predicts a transfer to <code>X1<\/code>. This initial misprediction is unavoidable because there was no opportunity to put it on the predictor stack in the first place. The processor then puts the first coroutine&#8217;s resumption address onto the top of the predictor stack, replacing <code>X1<\/code>.<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C1<\/span><\/p>\n<p>The second coroutine runs for a while, and then transfers control back to the first coroutine. This time, the processor correctly predicts a transfer to <code>C1<\/code>. It then puts the second coroutine&#8217;s resumption address onto the return address predictor stack, replacing <code>C1<\/code>.<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C2<\/span><\/p>\n<p>The first coroutine does some stuff, say it calls a helper function that returns. The call is a normal <code>JSR<\/code>, so it pushes <code>C1b<\/code> onto the return address predictor stack, and the helper function returns with a normal <code>RET<\/code>, which pops the correctly-predicted address off the stack. In general, normal non-coroutine calls are predicted in the usual way, and when control is in the first coroutine, the pushes and pops on the predictor stack cancel out, and the return address predictor stack still looks like this:<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C2<\/span><\/p>\n<p>The first coroutine now calls back into the second coroutine, which resumes where it left off, namely at <code>C2<\/code>, which also matches the predicted return address. The resumption location in the first coroutine replaces <code>C2<\/code>, leaving<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C1c<\/span><\/p>\n<p>And so on. Control transfers back and forth between the two coroutines, and their resumption locations take turns at the top of the predictor stack. Let&#8217;s say that the second coroutine is now finished, so it transfers control to the first coroutine for the last time.<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C2z<\/span><\/p>\n<p>Here, <code>C2z<\/code> is a resumption address for the second coroutine, but it will never be used since the coroutine is finished. (The processor doesn&#8217;t know that, so it puts it onto the return address predictor stack anyway.)<\/p>\n<p>The first coroutine now returns to its caller, which was <code>X1<\/code>, but that return address was lost to the predictor stack ages ago. The return is mispredicted.<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span><\/p>\n<p>But the bad state lasts for only one frame. When the caller returns to its own caller, the return address <code>X2<\/code> will be right there at the top of the return address predictor stack, and things are back to normal.<\/p>\n<p>Now, since these are stackful coroutines, there&#8217;s no requirement that the transfer back to the first coroutine happen at the top level of the second coroutine. In the case where the second coroutine transferred back to the first coroutine from a subroutine, the return address predictor will not only mispredict the transfer back to the first coroutine, but it&#8217;ll also have the other return addresses, too, corresponding to activation frames still active in the second coroutine.<\/p>\n<p>\u2026 <span style=\"border: solid 1px black; padding: 1ex;\">X2<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C2c<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C3<\/span> <span style=\"border: solid 1px black; padding: 1ex;\">C4<\/span><\/p>\n<p>This prediction stack will pay off if the first coroutine transfers back to the second coroutine, since it will resume with a stack whose <code>C2<\/code> through <code>C4<\/code> entries exactly match the activation frames.<\/p>\n<p>In general, the algorithm for managing the return address predictor stack works well if a coroutine transfers out of its stack at the same frame that it transferred in. If it transfers out in a nested call, then the predictor won&#8217;t work, but there wasn&#8217;t much you could have done about them anyway.<\/p>\n<p>You&#8217;re also out of luck if control cycles among three or more coroutines. Again, there wasn&#8217;t much you could have done about this either.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Working out the possibilities.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-103121","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Working out the possibilities.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/103121","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=103121"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/103121\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=103121"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=103121"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=103121"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}