{"id":107954,"date":"2023-03-21T07:00:00","date_gmt":"2023-03-21T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=107954"},"modified":"2023-03-21T08:41:13","modified_gmt":"2023-03-21T15:41:13","slug":"20230321-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20230321-00\/?p=107954","title":{"rendered":"Why does the usage of the initial registers of a Win32 process depend on whether it is a 32-bit or 64-bit process?"},"content":{"rendered":"<p>Someone noticed that when you create a process suspended and snoop at its registers, the results vary depending on wither it is a 32-bit or 64-bit process.<\/p>\n<p>For a 32-bit process, the initial register state puts something in <code>eax<\/code> and something else in <code>ebx<\/code>.<\/p>\n<p>For a 64-bit process, the initial register state puts something in <code>rcx<\/code> and something else in <code>rdx<\/code>.<\/p>\n<p>Why do 32-bit and 64-bit processes use different registers to pass the initial state? Either make the 32-bit initial state use <code>ecx<\/code> and <code>edx<\/code>, or make the 64-bit initial state use <code>rax<\/code> and <code>rbx<\/code>. This appears to be an intentional divergence. What&#8217;s the reason for it?<\/p>\n<p>First of all, note that all of what I&#8217;m writing here is internal implementation detail that can change at any time. I&#8217;m discussing it to satisfy your curiosity, not to provide information that you can rely on.<\/p>\n<p>Okay, so back to the question. Why do 32-bit and 64-bit processes disagree?<\/p>\n<p>Well, really, the question is &#8220;What makes you think they should agree?&#8221;<\/p>\n<p>I sort of hid an assumption in the question. Did you spot it?<\/p>\n<p>The customer is asking not about 32-bit and 64-bit Windows, but about the x86-32 and x86-64 processor architectures. The question is based on a limited understanding of the world of CPUs. &#8220;<a title=\"We got both kinds. We got Country *and* Western.\" href=\"https:\/\/www.youtube.com\/watch?v=cSZfUnCK5qk\">We got both kinds. We got x86-32 <i>and<\/i> x86-64<\/a>!&#8221;<\/p>\n<p>Windows has supported many 32-bit processor architectures, and I&#8217;ve covered many of them in the past: <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190120-00\/?p=100745\"> x86-32<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20170807-00\/?p=96766\"> Alpha AXP<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180402-00\/?p=98415\"> MIPS III<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180806-00\/?p=99425\"> PowerPC<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190805-00\/?p=102749\"> SuperH-3<\/a>, and <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210531-00\/?p=105265\"> ARM<\/a>. It also has supported a number of 64-bit processor architectures, including <a href=\"https:\/\/docs.microsoft.com\/en-us\/previous-versions\/technet-magazine\/cc718978(v=msdn.10)\"> Alpha AXP<\/a> (using all 64 bits this time), <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20150727-00\/?p=90821\"> Itanium<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220831-00\/?p=107077\"> x86-64<\/a>, and <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220726-00\/?p=106898\"> AArch64<\/a>.<\/p>\n<p>All of these architectures are different, and there&#8217;s no <i>a priori<\/i> expectation that any two of them match up in register usage in any particular way.<\/p>\n<p>Here&#8217;s a comparison of calling conventions, with a lot of details omitted.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>32-bit architectures<\/th>\n<th>x86-32<\/th>\n<th>Alpha AXP<\/th>\n<th>MIPS III<\/th>\n<th>PowerPC<\/th>\n<th>SuperH-3<\/th>\n<th>ARM<\/th>\n<\/tr>\n<tr>\n<td>iarg1<\/td>\n<td>[<var>esp<\/var>+4]<\/td>\n<td><var>a0<\/var><\/td>\n<td><var>a0<\/var><\/td>\n<td><var>r3<\/var><\/td>\n<td><var>r4<\/var><\/td>\n<td><var>a1<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg2<\/td>\n<td>[<var>esp<\/var>+8]<\/td>\n<td><var>a1<\/var><\/td>\n<td><var>a1<\/var><\/td>\n<td><var>r4<\/var><\/td>\n<td><var>r5<\/var><\/td>\n<td><var>a2<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg3<\/td>\n<td>[<var>esp<\/var>+12]<\/td>\n<td><var>a2<\/var><\/td>\n<td><var>a2<\/var><\/td>\n<td><var>r5<\/var><\/td>\n<td><var>r6<\/var><\/td>\n<td><var>a3<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg4<\/td>\n<td>[<var>esp<\/var>+16]<\/td>\n<td><var>a3<\/var><\/td>\n<td><var>a3<\/var><\/td>\n<td><var>r6<\/var><\/td>\n<td><var>r7<\/var><\/td>\n<td><var>a4<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg5<\/td>\n<td>[<var>esp<\/var>+20]<\/td>\n<td><var>a4<\/var><\/td>\n<td><var>a4<\/var><\/td>\n<td><var>r7<\/var><\/td>\n<td>@(16, <var>r15<\/var>)<\/td>\n<td>[<var>sp<\/var>, #0]<\/td>\n<\/tr>\n<tr>\n<td>iarg6<\/td>\n<td>[<var>esp<\/var>+24]<\/td>\n<td><var>a5<\/var><\/td>\n<td>20(<var>sp<\/var>)<\/td>\n<td><var>r8<\/var><\/td>\n<td>@(20, <var>r15<\/var>)<\/td>\n<td>[<var>sp<\/var>, #4]<\/td>\n<\/tr>\n<tr>\n<td>iarg7<\/td>\n<td>[<var>esp<\/var>+28]<\/td>\n<td>0(<var>sp<\/var>)<\/td>\n<td>24(<var>sp<\/var>)<\/td>\n<td><var>r9<\/var><\/td>\n<td>@(24, <var>r15<\/var>)<\/td>\n<td>[<var>sp<\/var>, #8]<\/td>\n<\/tr>\n<tr>\n<td>iarg8<\/td>\n<td>[<var>esp<\/var>+32]<\/td>\n<td>8(<var>sp<\/var>)<\/td>\n<td>28(<var>sp<\/var>)<\/td>\n<td><var>r10<\/var><\/td>\n<td>@(28, <var>r15<\/var>)<\/td>\n<td>[<var>sp<\/var>, #12]<\/td>\n<\/tr>\n<tr>\n<td>iarg9<\/td>\n<td>[<var>esp<\/var>+36]<\/td>\n<td>16(<var>sp<\/var>)<\/td>\n<td>32(<var>sp<\/var>)<\/td>\n<td>32(<var>r1<\/var>)<\/td>\n<td>@(32, <var>r15<\/var>)<\/td>\n<td>[<var>sp<\/var>, #16]<\/td>\n<\/tr>\n<tr>\n<td>fpargs<\/td>\n<td>on stack<\/td>\n<td><var>f16<\/var>\u2026<var>f21<\/var><\/td>\n<td><var>f12<\/var>\u2026<var>f15<\/var><\/td>\n<td><var>f1<\/var>\u2026<var>f13<\/var><\/td>\n<td><var>fr4<\/var>\u2026<var>fr7<\/var><\/td>\n<td><var>d0<\/var>\u2026<var>d7<\/var><\/td>\n<\/tr>\n<tr>\n<td>iret<\/td>\n<td><var>eax<\/var>, <var>edx<\/var><\/td>\n<td><var>v0<\/var><\/td>\n<td><var>v0<\/var>, <var>v1<\/var><\/td>\n<td><var>r3<\/var><\/td>\n<td><var>r0<\/var><\/td>\n<td><var>a1<\/var>, <var>a2<\/var><\/td>\n<\/tr>\n<tr>\n<td>fpret<\/td>\n<td><var>st(0)<\/var>, <var>st(1)<\/var><\/td>\n<td><var>f0<\/var>, <var>f1<\/var><\/td>\n<td><var>f0<\/var>\/<var>f1<\/var>, <var>f2<\/var>\/<var>f3<\/var><\/td>\n<td><var>f1<\/var><\/td>\n<td><var>fr0<\/var><\/td>\n<td><var>d0<\/var>, <var>d1<\/var><\/td>\n<\/tr>\n<tr>\n<td>home space?<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<td>yes<\/td>\n<td>yes<\/td>\n<td>no<\/td>\n<\/tr>\n<tr>\n<td>i\/fp separate alloc<\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<td>sort-of<\/td>\n<td>yes<\/td>\n<\/tr>\n<tr>\n<td>reuse partial fp regs<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<td>yes<\/td>\n<\/tr>\n<tr>\n<td>stack alignment<\/td>\n<td>4<\/td>\n<td>16<\/td>\n<td>8<\/td>\n<td>8<\/td>\n<td>4<\/td>\n<td>8<\/td>\n<\/tr>\n<tr>\n<td>red zone<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>232<\/td>\n<td>0<\/td>\n<td>8<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And for 64-bit architectures:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>64-bit architectures<\/th>\n<th>Alpha AXP<\/th>\n<th>Itanium<\/th>\n<th>x86-64<\/th>\n<th>AArch64<\/th>\n<\/tr>\n<tr>\n<td>iarg1<\/td>\n<td><var>a0<\/var><\/td>\n<td><var>r32<\/var><\/td>\n<td><var>rcx<\/var><\/td>\n<td><var>x0<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg2<\/td>\n<td><var>a1<\/var><\/td>\n<td><var>r33<\/var><\/td>\n<td><var>rdx<\/var><\/td>\n<td><var>x1<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg3<\/td>\n<td><var>a2<\/var><\/td>\n<td><var>r34<\/var><\/td>\n<td><var>r8<\/var><\/td>\n<td><var>x2<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg4<\/td>\n<td><var>a4<\/var><\/td>\n<td><var>r35<\/var><\/td>\n<td><var>r9<\/var><\/td>\n<td><var>x3<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg5<\/td>\n<td><var>a4<\/var><\/td>\n<td><var>r36<\/var><\/td>\n<td>[<var>rsp<\/var>+32]<\/td>\n<td><var>x4<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg6<\/td>\n<td><var>a5<\/var><\/td>\n<td><var>r37<\/var><\/td>\n<td>[<var>rsp<\/var>+40]<\/td>\n<td><var>x5<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg7<\/td>\n<td>0(<var>sp<\/var>)<\/td>\n<td><var>r38<\/var><\/td>\n<td>[<var>rsp<\/var>+48]<\/td>\n<td><var>x6<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg8<\/td>\n<td>8(<var>sp<\/var>)<\/td>\n<td><var>r39<\/var><\/td>\n<td>[<var>rsp<\/var>+56]<\/td>\n<td><var>x7<\/var><\/td>\n<\/tr>\n<tr>\n<td>iarg9<\/td>\n<td>16(<var>sp<\/var>)<\/td>\n<td>[<var>sp<\/var>]<\/td>\n<td>[<var>rsp<\/var>+64]<\/td>\n<td>[<var>sp<\/var>, #0]<\/td>\n<\/tr>\n<tr>\n<td>fpargs<\/td>\n<td><var>f16<\/var>\u2026<var>f21<\/var><\/td>\n<td><var>f32<\/var>\u2026<var>f39<\/var><\/td>\n<td><var>xmm0<\/var>\u2026<var>xmm3<\/var><\/td>\n<td><var>v0<\/var>\u2026<var>v7<\/var><\/td>\n<\/tr>\n<tr>\n<td>iret<\/td>\n<td><var>v0<\/var><\/td>\n<td><var>ret0<\/var>\u2026<var>ret3<\/var><\/td>\n<td><var>rax<\/var><\/td>\n<td><var>x0<\/var><\/td>\n<\/tr>\n<tr>\n<td>fpret<\/td>\n<td><var>f0<\/var>, <var>f1<\/var><\/td>\n<td><var>f8<\/var><\/td>\n<td><var>xmm0<\/var><\/td>\n<td><var>d0<\/var>, <var>d1<\/var><\/td>\n<\/tr>\n<tr>\n<td>home space?<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<td>no<\/td>\n<\/tr>\n<tr>\n<td>i\/fp separate alloc<\/td>\n<td>yes<\/td>\n<td>yes<\/td>\n<td>no<\/td>\n<td>yes<\/td>\n<\/tr>\n<tr>\n<td>reuse partial fp regs<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<td>no<\/td>\n<\/tr>\n<tr>\n<td>stack alignment<\/td>\n<td>16<\/td>\n<td>16<\/td>\n<td>16<\/td>\n<td>16<\/td>\n<\/tr>\n<tr>\n<td>red zone<\/td>\n<td>0<\/td>\n<td><a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20040113-00\/?p=41073\">\u221216<\/a><\/td>\n<td>0<\/td>\n<td>16<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Certainly you don&#8217;t expect all of these process to agree on what registers to use. They don&#8217;t even all have the same registers to begin with!<\/p>\n<p>Okay, so maybe the question is &#8220;Yes, I know that Windows supports more than just x86-32 and x86-64, but the two architectures are clearly descended from each other, so why are things so different between them?&#8221;<\/p>\n<p>Well, why should they be the same? After all, x86-32 descended from 8086, but it&#8217;s not like we still using the 8086 calling convention in x86-64 code. With a newer processor, we can take advantage of newer features, and that means we can <a title=\"The x86-64 processor (aka amd64, x64): Whirlwind tour\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220831-00\/?p=107077\"> re-optimize the calling convention to take advantage of them<\/a>: Use the SSE registers for floating point instead of the legacy <var>st(n)<\/var> registers. Use compile-time exception handling tables instead of run-time stack threading. Increase the stack alignment requirements to be SSE-friendly. Pass parameters in registers rather than on the stack, now that we are no longer under severe register pressure.<\/p>\n<p>We also see changes when moving from 32-bit ARM to 64-bit AArch64: The number of register-based parameters increases from four to eight, the partial floating point register backfill was dropped, the stack alignment became stricter, and the red zone expanded.<\/p>\n<p>I mean, clearly you have to change <i>something<\/i> because the 64-bit registers are bigger than the 32-bit registers. At a minimum, you&#8217;ll have to expand the register sizes. And once you decide to expand the register sizes, you&#8217;ve committed to the cost of change, so you may as well get your money&#8217;s worth.<\/p>\n<p>One thing you might notice is that the 32-bit and 64-bit Alpha AXP calling conventions are identical. What happened here? Was the 32-bit calling convention so perfect that nothing had to be improved for the 64-bit convention? What about the whole &#8220;expanding registers from 32-bit to 64-bit requires a change at least for the new register sizes&#8221;?<\/p>\n<p>Recall that the Alpha AXP was always a 64-bit processor. There was no 32-bit Alpha AXP processor. The &#8220;32-bit&#8221; Alpha AXP calling convention was developed with a 64-bit processor in hand, just with the understanding that pointers are only 32 bits in size, for now. (Though you could <a title=\"Footnotes in Win32 history: VLM (Very Large Memory) support\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20070801-00\/?p=25763\"> ask for memory in the parts of the address space that require the use of 64-bit pointers<\/a>. It would then be on you to figure out how to <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20170823-00\/?p=96875\"> cajole the compiler into using 64-bit pointers<\/a>.)<\/p>\n<p>When the 32-bit ABI for Alpha AXP was invented, the processor already had 64-bit registers and full support for 64-bit operations. It&#8217;s just that 32-bit Windows voluntarily restricted itself to 32 bits of address space. When designing the calling convention, the ABI designers made parameters 64-bit values, even if only the lower 32 bits were significant in practice. Everything was carefully designed so that the 32-bit calling convention could be repurposed as a 64-bit calling convention without any changes.\u00b9 (This came in handy when the Alpha AXP was used as <a href=\"https:\/\/docs.microsoft.com\/en-us\/previous-versions\/technet-magazine\/cc718978(v=msdn.10)\"> a proof-of-concept hardware platform for 64-bit Windows<\/a>, since it avoided having to change large portions of the compiler.) In other words, the 32-bit ABI for Alpha AXP was invented with the power of clairvoyance: They knew what the 64-bit process was going to look like, and they could design the 32-bit ABI to be identical to the 64-bit one.<\/p>\n<p>One thing you may notice is that all of the calling conventions pass parameters in registers except for one: x86-32. Once again, <a title=\"The x86 architecture is the weirdo, part 2\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220418-00\/?p=106489\"> the x86 is the weirdo<\/a>. The internal kernel infrastructure for creating a process lets you specify an initial register state and an initial instruction pointer, but it doesn&#8217;t let you describe the contents of the stack. This means that all parameters must be passed in registers. This is straightforward for the register-based calling conventions, since they can just put the parameters directly in the initial register state. But for x86-32, that doesn&#8217;t work.<\/p>\n<p>What happens on x86-32 is that the kernel puts the parameters in some dummy registers, and then sets the initial instruction pointer not to the start of the process but rather to a helper function written in assembly language that takes those values from registers, pushes them onto the stack (to conform with the x86-32 calling convention), and then calls the <i>real<\/i> process start function.<\/p>\n<p>The registers to use for this &#8220;custom calling convention&#8221; are completely arbitrary, and the kernel folks chose <var>eax<\/var> and <var>ebx<\/var> out of alphabetical convenience.\u00b2 This choice was made several decades before the x86-64 convention was invented, so there was nothing to be compatible with.<\/p>\n<p>So that&#8217;s the reason why the x86-32 and x86-64 architectures disagree on how to pass the initial parameters to the process. There was no reason why they had to agree in the first place. The 32-bit version picked two registers arbitrarily, and those didn&#8217;t happen to correspond in an attractive way with the x86-64 calling convention that would come later.\u00b3<\/p>\n<p>\u00b9 In theory, this would even let 32-bit and 64-bit Alpha AXP code coexist within a process, since they could just call into each other without having to do any calling convention thunking. The 64-bit code would have to be careful to pass only pointers to memory addressible with 32-bit pointers.<\/p>\n<p>\u00b2 What the kernel folks could have done was declare the process start function as using <a href=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/cpp\/fastcall?view=msvc-170\"> the <code>__fastcall<\/code> calling convention<\/a>, which takes the first two parameters in <code>ecx<\/code> and <code>edx<\/code>. That would have avoided having to write the little helper function.<\/p>\n<p>\u00b3 I guess you could turn the question around and ask &#8220;Why doesn&#8217;t the x86-64 calling convention use <code>rax<\/code> and <code>rbx<\/code> for the first two register parameters, so it would align in an attractive manner with the custom calling convention used by this one specific corner of the kernel.&#8221; That was a one-off dark corner of the kernel that uses a custom calling convention that only a handful of people even know about, so there&#8217;s no reason that the people who designed the x86-64 calling convention even knew about it, much less possessed any desire to align with it in an attractive manner. And even if they knew about it, there&#8217;s really no need to accommodate it when designing a calling convention for general-purpose computing. There are probably quite a few one-off custom calling conventions scattered around the system. The one used by the kernel for starting new processes isn&#8217;t particularly prominent. (Indeed, it&#8217;s so deeply buried that it&#8217;s probably one of the most <i>obscure<\/i> ones.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Well, I mean, it&#8217;s a different processor.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-107954","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>Well, I mean, it&#8217;s a different processor.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107954","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=107954"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107954\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=107954"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=107954"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=107954"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}