November 21st, 2019

Why does the Alpha AXP predict a coroutine transfer the way it does?

I noted some time ago that the Alpha AXP has a dedicated branch to coroutine instruction. The behavior of this instruction is to branch to the address held in the specified register, but the interesting part is how the instruction is predicted: The processor predicts a branch to the address at the top of the return address predictor stack, and the value at the top of the return address predictor stack is replaced by the address of the instruction that follows the JSR_CO.

Why does the processor use this algorithm? Why doesn’t it just push the address of the following instruction onto the return address predictor stack?

Let’s illustrate with an example. Note that the entire issue of JSR_CO applies only to stackful coroutines. Stackless coroutines use normal JSR subroutine calls.

Consider two coroutines that call each other. On entry to the first coroutine, the return address predictor stack contains the return address predictions for the code that will run once the first coroutine returns.

… X2 X1

The first coroutine does some stuff and then calls the second coroutine. The processor incorrectly predicts a transfer to X1. This initial misprediction is unavoidable because there was no opportunity to put it on the predictor stack in the first place. The processor then puts the first coroutine’s resumption address onto the top of the predictor stack, replacing X1.

… X2 C1

The second coroutine runs for a while, and then transfers control back to the first coroutine. This time, the processor correctly predicts a transfer to C1. It then puts the second coroutine’s resumption address onto the return address predictor stack, replacing C1.

… X2 C2

The first coroutine does some stuff, say it calls a helper function that returns. The call is a normal JSR, so it pushes C1b onto the return address predictor stack, and the helper function returns with a normal RET, which pops the correctly-predicted address off the stack. In general, normal non-coroutine calls are predicted in the usual way, and when control is in the first coroutine, the pushes and pops on the predictor stack cancel out, and the return address predictor stack still looks like this:

… X2 C2

The first coroutine now calls back into the second coroutine, which resumes where it left off, namely at C2, which also matches the predicted return address. The resumption location in the first coroutine replaces C2, leaving

… X2 C1c

And so on. Control transfers back and forth between the two coroutines, and their resumption locations take turns at the top of the predictor stack. Let’s say that the second coroutine is now finished, so it transfers control to the first coroutine for the last time.

… X2 C2z

Here, C2z is a resumption address for the second coroutine, but it will never be used since the coroutine is finished. (The processor doesn’t know that, so it puts it onto the return address predictor stack anyway.)

The first coroutine now returns to its caller, which was X1, but that return address was lost to the predictor stack ages ago. The return is mispredicted.

… X2

But the bad state lasts for only one frame. When the caller returns to its own caller, the return address X2 will be right there at the top of the return address predictor stack, and things are back to normal.

Now, since these are stackful coroutines, there’s no requirement that the transfer back to the first coroutine happen at the top level of the second coroutine. In the case where the second coroutine transferred back to the first coroutine from a subroutine, the return address predictor will not only mispredict the transfer back to the first coroutine, but it’ll also have the other return addresses, too, corresponding to activation frames still active in the second coroutine.

… X2 C2c C3 C4

This prediction stack will pay off if the first coroutine transfers back to the second coroutine, since it will resume with a stack whose C2 through C4 entries exactly match the activation frames.

In general, the algorithm for managing the return address predictor stack works well if a coroutine transfers out of its stack at the same frame that it transferred in. If it transfers out in a nested call, then the predictor won’t work, but there wasn’t much you could have done about them anyway.

You’re also out of luck if control cycles among three or more coroutines. Again, there wasn’t much you could have done about this either.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

3 comments

Discussion is closed. Login to edit/delete existing comments.

  • Neil Rashbrook

    So what you're saying is that they're optimising for the common case where two coroutines are calling each other. Each coroutine resumes at the point where it called the other coroutine, so when that coroutine transfers back, you want to predict the coroutine transfer instruction to transfer to to the address after the previous coroutine transfer instruction. One way of doing that would be to have a dedicated predictor register for that address, but the...

    Read more
  • MNGoldenEagle

    So I’m a bit confused: why does the instruction replace the head of the predictor stack on a coroutine call instead of pushing it?

    • Raymond ChenMicrosoft employee Author

      Work it out. If it pushed without a corresponding pop, then none of the X predictions would be correct. The coroutines would gradually consume the entire prediction stack.