The x86-64 processor (aka amd64, x64): Whirlwind tour
I figure I’d tidy up the processor overview series by covering the last¹ processor on my list of “processors Windows has supported in its history,” namely, the x86-64. Other names for this architecture are amd64 (because AMD invented it) and x64 (which is super-confusing because it doesn’t correspond with x86, a common nickname for the x86-32).
This is going to be a quick overview because the x86-64 is a natural extension of the i386, which we covered some time ago. I’ll just highlight the differences.
Each existing 32-bit general-purpose register has been extended from 32 bits to 64. The name of the 64-bit register is based on the name of the 32-bit register, but with the leading e changed to a leading r. Eight new 64-bit registers were introduced, bring the total to 16. Instead of giving quirky names to the new registers, they are just numbered: r8 through r15. To match the existing classic registers, the new registers also have aliases for referring to partial registers, and partial register aliases were invented for some of the classic registers that lacked them.
|Bits 31:0||Bits 15:0||Bits 15:8||Bits 7:0|
The eip and eflags registers are correspondingly expanded to 64-bit registers rip and rflags.
Additional restrictions have been imposed on the use of the ah, bh, ch, and dh registers. The details aren’t important for reading code, so I won’t bother digging into them.
Windows requires that the stack be 16-byte aligned at function call boundaries, and there is no red zone. Calling a function pushes the 8-byte return address onto the stack, so on entry to a function, the stack is misaligned. Functions typically realign the stack in their prologue.
The old 8087-based floating point registers are not used.² Instead, the SIMD XMM registers are used for floating point calculations. These registers are 128 bits wide and can be viewed as four single-precision floating point values or as two double-precision floating point values. When used to pass parameters or return floating point values, only the bottom lane is used.
Eight more XMM registers have been added, bringing the total to 16.
|XMM0||No||Parameter 1 and return value|
|XMM1||No||Parameter 2 and second return value|
The calling convention is register-based for the first four parameters, with remaining parameters on the stack. In practice, the stack-based parameters are not
push‘d, but rather the values are
mov‘d into the preallocated stack space.
For register-based parameters, integer parameters go into the general-purpose registers and floating point parameters go into the floating point registers. When a register is used to hold a parameter, its counterpart register goes unused. For example, a function that takes an integer and a double will pass the integer in rcx and the double in xmm1.
There are always 4 × 8 = 32 bytes of home space for the register-based parameters, even if the function has fewer than four formal parameters. (If this bothers you, then you can reinterpret the home space as a 32-byte red zone that resides above the return address.)
Integer return values up to 64 bits go into rax If the return value is a 128-bit value, then the rdx register holds the upper 64 bits. Floating point return values are returned in xmm0.
The caller is responsible for cleaning the stack. In practice, the caller does not clean the stack after every call, but rather preallocates the stack space in the prologue, reuses the stack space for multiple calls, and then cleans it all up in the epilogue.
Exception handling is done by unwind tables, not by threading exception handlers through the stack at runtime.
When a 32-bit partial register is the destination of an operation, the upper 32 bits are set to zero. For example, consider
add eax, ecx
On the 32-bit 80386, this adds the value of ecx to eax and puts the result back into eax. On x86-64, this performs the same calculation, but since the destination is the 32-bit partial register eax, the operation also zeroes out the upper 32 bits of rax.
Another way of looking at this is that writes to 32-bit partial registers are zero-extended to 64-bit values.³
Note, however, that operations on 16-bit and 8-bit partial registers leave the unused bits unchanged.
The 32-bit addressing modes carry over to 64-bit, with these exceptions:
- Absolute addressing mode is limited to signed 32-bit addresses.
- There is a new rip-relative addressing mode.
The offsets in the memory addressing modes are 32-bit signed values, for a reach of ±2GB.
The rip-relative addressing mode greatly reduces the number of fixups required to relocate a module. The enormous ±2GB reach means that any reasonably-sized module can use it to access all of its static data, be it a read-only table embedded in the code segment or read-write data in the data segment.
The disassembler automatically performs the necessary calculations to convert the rip-relative address to an absolute one at disassembly time, so you are unlikely even to realize that anything has changed.
In general, immediates are capped at 32 bits. The exception is that you can use a 64-bit immediate in the
mov reg, imm64 instruction.
Segmentation is architecturally dead. The processor is always in flat mode. The fs and gs selectors have been repurposed as two additional registers that add an operating-system-defined value to the effective address.
mov rax, qword ptr gs:[rcx*8+1480h]
The base address assigned to the gs register is added to the effective address rcx * 8 + 0x1480, producing a final address that is the target of the memory operation.
Windows sets the gs register’s base address to a block of per-thread data. During context switches, the base address of the gs register is updated to point to the per-thread data of the incoming thread. The fs register has not yet been assigned a meaning and should not be used.⁴ The Windows ABI forbids modifying either of these segment registers.
Instruction set changes
Some rarely-used instructions have been removed, primarily the binary-coded decimal instructions,
New instructions for dealing with 64-bit registers:
; sign-extend 32-bit to 64-bit movsxd r64, r32/m32
There is no need for a zero-extend instruction because operations on 32-bit registers automatically zero-extend to 64-bit values, so if the value was the result of a calculation, you probably got the zero-extended value anyeway. If you want to wipe out the top 32 bits of an existing 64-bit value, you could do
; zero-extend 32-bit to 64-bit mov r32, r32
This can result in some odd-looking instructions like
mov eax, eax ; zero-extend eax to rax
On its face, the instruction looks pointless, but we’re performing for the zero-extending side effect.⁵
There are also specialized instruction for certain sign-extending scenarios:
cwqe ; sign-extend eax to rax cqd ; sign-extend rax to rdx:rax
Lightweight leaf functions and exception handling
A lightweight leaf function is one which can perform all of its work using only non-preserved registers, the inbound parameter home space, and stack space occupied by stack-based inbound parameters (if any). Preserved registers and the stack pointer must remain unchanged for the entire lifetime of the function, and the return address must remain at the top of the stack.
The inability to move the stack pointer means that the stack pointer is not at a multiple of 16 for the lifetime of a lightweight leaf function.
The x86-64 ABI abandons the stack-based exception handling model of its 32-bit older brother and joins the RISC crowd by using table-based exception handling. With the exception of lightweight leaf functions, all functions must declare unwind codes that allow the exception unwinder to restore registers from the stack and find the return address. Any function that does not have unwind codes is assumed to be a lightweight leaf function.
I’ll defer to the existing documentation (which I wrote).
Instructions that operate on the classic 32-bit or 8-bit registers tend to have the most compact encodings. Using any of the new registers (r8 through r15, or xmm8 through xmm15, or the new aliases sil, dil, spl or bpl) typically requires a one-byte prefix. An instruction that operates on word-sized data typically incurs an additional byte encoding. And fancy addressing modes (involving scaling or multiple registers contributing to the effective address) also require yet another byte for the encoding.
I’m not sure how aggressively the compiler allocates registers and chooses instructions which have compact encodings. It certainly didn’t stand out to me.
Bonus reading: x64 software conventions.
Bonus chatter: Now that I’ve exhausted my list of processors that Windows has supported over the years, I’ll have to start branching out into other processors. I’m open to suggestions. Though I probably won’t be as detailed as these processor overviews have been, since the original goal of these overviews was to give you enough information to get started debugging on Windows. For other processors, I’ll probably just focus on the one or two things that make them interesting, like SPARC register windows, or 68000’s separate data and address registers.
¹ Early versions of Windows CE allegedly supported the StrongARM and possibly even M32R and other architectures, but I can’t find any binaries for those versions, so I have nothing to investigate.
² They are still physically present and usable, but in practice, nobody uses them,⁶ and they are not part of the calling convention.
³ I strongly suspect this design decision was made to avoid introduce spurious register dependencies due to partial register operations.
⁴ On x86-32, the fs register is used to access the per-thread data. Why did Windows switch to using gs on x86-64? One theory is that there is a special instruction on x86-64 called
SWAPGS that lets the kernel exchange the gs base address with another internal register. This instruction is used on transitions to and from user mode, so the kernel can quickly switch from user-mode thread data to kernel-mode thread data on entry and to switch it back on exit. No such courtesy instruction exists for the fs register. Another theory is that
fs is reserved for the 32-bit emulation layer.
⁵ It also means that the x86-32 pun of interpreting
xchg eax, eax does not work in x86-64. The self-exchange zeroes out the high 32 bits as a side effect. The Windows debugger doesn’t realize this, and if you ask it to assemble
xchg eax, eax, it encodes it as
90, using the one-byte encoding of
xchg eax, r32, unaware that this doesn’t work if the other register is also eax. The correct encoding of
xchg eax, eax is
87 c0, using the larger two-byte encoding.
⁶ Apparently, gcc and clang do use them for the 80-bit floating point
long double type.
Regarding suggestions for future processors to cover – I was anticipating this before the series ended and was going to suggest the 68000. It was an interesting architecture, and the first one I did assembly programming on. Arguably, it’s still fairly relevant today, with some active Amiga and Atari communities out there. As many would know, it was also the Macintosh’s first ISA.
Later members of the 68k family introduced some notable incompatibilities, such as the need to flush the I-cache after loading code into memory, exception stack frame differences and the MOVE from SR/MOVE from CCR instructions. The x86 family, on the other hand, seemed to place backwards compatibility higher on the priority list.
“I’ll have to start branching out into other processors. I’m open to suggestions.”
My childhood favorite was the really-different-for-its-time TMS 9900.
Those of us who know – for completely legitimate reasons, I swear – which instruction is represented by the e9 opcode (and, to a lesser extent, e8) definitely noticed the change from absolute to RIP-relative addressing.
I’d be very interested in SPARC if you had time for it
Wasn’t Windows NT originally developed on Intel 860, to make sure folks won’t sneak x86 specific hacks into the codebase? That would make it a Windows supported processor, won’t it?
I’d love a series specifically on SVE2–although a part of Armv9 (and the basic subset of AArch64 has been a prior series) it’s significantly interesting on its own.
68000 would be nice. I remember learning parts of 68000 ASM at one point and thinking that it was a far nicer and cleaner ISA than x86 with its nice flat memory space and no segmentation to worry about.
Imagine how different things might be if the engineers had gotten their way and the original PC had used the 68000…
The reason the 8086 had so many quirks is probably the same that led it to be used in the PC: compatibility with Intel’s 8080. The register and memory models of the 8086 allow you to take assembly or binary code from the 8080 (or even the Z80) and assemble or translate it into 8086 binary code. If you add an OS whose services are similar to those of CP/M (such as MS-DOS), you get instant compatibility with one of the biggest software libraries of the time, a big help when you are trying to launch a new computer. In particular, the segmented memory model allowed you to simulate a flat 64 KB memory map, just as the 8080, by setting all four segment registers to the same address. Also, the way some non-orthogonal instructions worked just with BX, CX or DX mirrored the way some 8080 instructions were tied to the BD, DE or HL registers. Even SI and DI were similar to the Z80’s IX and IY index registers.
In short: had the 8086 been a clean design like the Motorola 68000, it probably wouldn’t have been used in the PC.
Segmentation in the 8086 also made it very easy to upgrade an 8080 embedded system to use more than 64 KiB of address space without a full rewrite; assuming you didn’t use self-modifying code, CS and ES can point directly to ROM, DS and SS to RAM).
This lets you expand to 64 KiB of ROM (instead of whatever fitted around your RAM needs on the 8080) plus 64 KiB RAM with minimal software changes (mostly using ES instead of DS to access constants in ROM).
It was also not unknown to have a mix of “old” modules in 8080-converted code interfacing to “new” modules in 8086 native code that were called via a tiny stub in the 8080-converted code (change CS, call the new code, change CS back to the 8080 value), allowing a gradual migration from old to new.
As a 16 bit update of the 8080, the 8086 was a masterpiece of design – and it was never intended to be more than a holding pattern to keep customers on Intel CPUs while Intel perfected the iAPX 432 (but reality refused to conform to Intel’s plans).
> The legacy floating point registers are overlaid on top of the SIMD registers, and switching between legacy mode and SIMD mode requires the use of the very slow EMMS instruction.
This is true only for MMX (the legacy SIMD extensions), but not for SSE (the modern SIMD extensions, which are used in the calling convention). SSE registers (XMM0..XMM15) are completely separate from the 8087 floating-point unit. So technically you can use the 8087 floating-point registers in x64. No EMMS instruction is needed, but you will need to develop your own calling convention.
I wonder if this actually works. I can imagine the context switch not saving them.
FXSAVE and FXRSTOR
Of course they have to be saved otherwise they’d be useless.
Windows already saves them when doing a context switch. Most compilers don’t generate x87 code for x64, but it works and has higher precision (long double) than SSE.
si – source index
di – destination index
SPARC would be nice and nostalgic, having studied it in Uni. Same with the 68k, which we also wrote assembly for. The professor was kind enough to leave SPARC to just study material, and RISC programming was done on on some theoretical model of a RISC CPU, or maybe an older MIPS, I can’t remember. Apparently when the SPARC trapped it just spilled its guts and left the OS to work out what to do or something “inefficient” like that (not really, just a design choice)?
For more modern processors RISC V might be intersting (little Endian is still more efficient?). And for old school would be the VAX (they stuffed a polynomial instruction into it?).
How about anything created by TI, whose products have always been in a category all of their own. No matter how you class a CPU family, by architecture or ISA type, the TI ones have always been “Other”. And I don’t mean early things like the TMS9900 which date back to a time when companies would try anything, I mean the TMS320 series which is still in use today, or the MSP430, or various other lesser-known ones, all of which require that you purge your mind of any idea how a normal CPU is supposed to work and instead enter an alternate reality where things are really… weird. Example: Jump instructions. Now if you’ve never encountered how TI do things you’d wonder how strange a simple jump instruction could get. Well the TI version has two variants, on of which stops execution for several cycles while it thinks about things and the other which keeps executing more instructions after the jump until it eventually catches up and jumps to where it’s supposed to go, a sort of combined jump and come-from instruction. So while MIPS and Sparc had an easy-enough-to-think-about single-entry delay slot, TI’s DSPs may or may not have multiple-entry delay slots.
And that’s just scratching the surface of the weirdness…
Early versions of Windows CE allegedly supported the StrongARM
Not allegedly. It did. I remember having PXA250 series processor EVK boards under early-access from Intel (2003 timeframe?) that booted up into Windows CE. We went with QNX on our design with those chips since it was for an embedded application with no screen or keyboard.
It would be nice to see an overview of RISC-V.
“In practice, nobody uses” the 8087 FPU registers? Maybe not Microsoft’s compilers but both GCC and Clang support the long double type which does use them. There certainly are applications that need more precision and/or range than are available from double. Only recently I wanted to estimate the total number of 2048-bit primes (a figure very relevant in cryptography) and the answer, about 1E613, is bigger than can be represented by a double.
Would love to see 6502 and Z80, especially with MS BASIC notes! I’m a sucker for nostalgia though.
MC68000 has a very nice instruction set compared to x86, but learning about some DSP processor such as Blackfin would be much more interesting as it offers a lot of unusual concepts…
How about CPU platforms for NT that never had a commercial release like i860, SPARC or Clipper. As Alpha was always 64bit, I presume that the unreleased 64bit Windows on Alpha is much of the same.
As you’ve dug up some quite niche binaries, there should probably be something for these architectures in vast Microsoft archives.
Besides the already noted wrong statement that SSE and x87 registers a shared, the statement that RDX holds the upper 64 bits of the return value is wrong too: MSVC (and thus Windows) does NOT support 128-bit integers!
Bonus comment: while code for the i386 that returns a compound type like _div_t (a structure built of 2 integers) uses EAX and EDX, code for the AMD64 that returns a structure holding a pair of 64-bit integers uses a hidden first pointer argument to this structure.
Unfortunately only partially: the second sentence of “Integer return values up to 64 bits go into rax If the return value is a 128-bit value, then the rdx register holds the upper 64 bits. Floating point return values are returned in xmm0.” is still wrong and doesn’t match the graphic — which also fails to match the third (and corrent) sentence: XMM1 is not used for a second FP return value.