The post The AArch64 processor (aka arm64), part 13: Atomic access appeared first on The Old New Thing.
]]>; load exclusive register byte ldxrb Rd/zr, [Xn/sp] ; load exclusive register halfword ldxrh Rd/zr, [Xn/sp] ; load exclusive register ldxr Rd/zr, [Xn/sp] ; load exclusive register pair ldxp Rd1/zr, Rd2/zr, [Xn/sp]
These instructions atomically load a byte, halfword, word, doubleword, or pair of registers from memory. The instruction also tells the processor to monitor the memory address to see if any other processor writes to that same address, or addresses in the same “exclusive reservation granule”. (Implementations are allowed to have granules as large as 2KB.)
Note that the atomicity guarantee is only partial if you use LDXP
to load a pair of 64bit registers.¹ The entire 128bit value is not loaded atomically; instead, each 64bit portion is loaded atomically separately. You can still get tearing between the two registers.
The only supported addressing mode is register indirect. No offsets or indexes allowed.
After an exclusive load, you can attempt to store a value back to the same address:
; store exclusive register byte stxrb Rs/zr, Rt/zr, [Xn/sp] ; store exclusive register halfword stxrh Rs/zr, Rt/zr, [Xn/sp] ; store exclusive register stxr Rs/zr, Rt/zr, [Xn/sp] ; store exclusive register pair stxp Rs/zr, Rt1/zr, Rt2/zr, [Xn/sp]
If the reservation obtained by the previous LDX
instruction is still valid, then the value in Rt/zr is stored to memory, and Rs is set to 0. Otherwise, no store is performed, and Rs is set to 1.
Whether the store succeeds or fails, the STX
instructions clears the reservation.
For these exclusive load and store instructions, the address must be a multiple of the number of bytes being loaded. If not, then the behavior is undefined: There is no requirement that an exception be raised.
So don’t do that.
It is also required that the STX
match the LDX
both in address and operand sizes. You cannot perform an LDX
for one address and follow up with a STX
to a different address. You also cannot perform a LDXR
and follow up with a STXRH
to the same address. You aren’t even allowed to do a LDXP
with two 32bit registers and follow up with a STXR
with a single 64bit register. Again, the behavior is undefined if you break this rule.
The last instruction allows you to hit the reset button:
; clear exclusive clrex
The CLREX
discards any active reservation, and forces any subsequent STX
to fail. This typically happens as part of interrupt handling or context switching to ensure that undefined behavior doesn’t occur if the thread was interrupted while it was in the middle of a LDX
/STX
sequence.
These instructions are usually coupled with memory barriers, which we’ll look at soon, but the next entry will be a little diversion.
Bonus chatter: There is an optional instruction set extension (mandatory starting in version 8.4) which includes a large set of atomic readmodifywrite operations.
; atomic readmodifywrite operation ; Rt = previous value of [Xr] ; [Xr] = Rt op Rs ldadd Rs/zr, Rt/zr, [Xr/sp] ; add ldclr Rs/zr, Rt/zr, [Xr/sp] ; and not ldeor Rs/zr, Rt/zr, [Xr/sp] ; exclusive or ldset Rs/zr, Rt/zr, [Xr/sp] ; or ldsmax Rs/zr, Rt/zr, [Xr/sp] ; signed maximum ldsmin Rs/zr, Rt/zr, [Xr/sp] ; signed minimum ldumax Rs/zr, Rt/zr, [Xr/sp] ; unsigned maximum ldumin Rs/zr, Rt/zr, [Xr/sp] ; unsigned minimum
By default, there is no memory ordering. You can add the suffix a
to load with acquire, the suffix l
to store with release, or the suffix al
to get both. Note, however, that the acquire suffix is ignored if the destination register Rt is zr.
Furthermore, you can suffix b
for byte memory access or h
for halfword memory access.
The overall syntax is therefore
Prefix  Op  Acquire  Release  Size 

ld 
add clr eor set smax smin umax umin 
(none)a 
(none)l 
(none)b h 
For example, the instruction ldclrlh
means
ld
: Atomic load/modify/storeclr
: Clear bitsl
: Release on storeh
: Halfword size.If you don’t care about the previous value, then you can use a pseudoinstruction that uses zr as the destination.
; atomic readmodifywrite operation ; [Xr] = [Xr] op Rs stadd Rs/zr, [Xr/sp] ; add stclr Rs/zr, [Xr/sp] ; and not steor Rs/zr, [Xr/sp] ; exclusive or stset Rs/zr, [Xr/sp] ; or stsmax Rs/zr, [Xr/sp] ; signed maximum stsmin Rs/zr, [Xr/sp] ; signed minimum stumax Rs/zr, [Xr/sp] ; unsigned maximum stumin Rs/zr, [Xr/sp] ; unsigned minimum
You can add the l
suffix for store with release, and you can add b
and h
suffixes to operate on smaller sizes. You cannot request acquire on load for these instructions because the acquire is ignored due to the destination being zr.
The optional instruction set extension also provides for atomic exchanges:
; swap ; write Rs and return previous value in Rt (atomic) swp Rs/zr, Rt/zr, [Xn/sp] ; word or doubleword swpb Ws/zr, Wt/zr, [Xn/sp] ; byte swph Ws/zr, Wt/zr, [Xn/sp] ; halfword ; compare and swap ; if value is Rs, then write Rt; Rs receives previous value ; (atomic) cas Rs/zr, Rt/zr, [Xn/sp] ; word or doubleword casb Ws/zr, Wt/zr, [Xn/sp] ; byte cash Ws/zr, Wt/zr, [Xn/sp] ; halfword casp Rs/zr, Rt/zr, [Xn/sp] ; register pair ; Rs,R(s+1) and Rt,R(t+1) ; also a, l, and al versions for acquire/release semantics
The memory order modifiers go between the swp
/cas
prefix and the size suffix, except that they go after the p
. So you have casab
(compare and swap with acquire, byte size) but caspa
(compare and swap pair with acquire).
As with the ld
instructions, requests to aquire on load are ignored if the destination register is zr.
The memory operand must be writable, even if the comparison fails. If no value is stored, then any requested release semantics are ignored.
Bonus reading: Atomics in AArch64.
¹ The load is required to be fully atomic starting with version 8.4 of the AArch64. On older processors, Windows uses CASP
instead of LDXP
/STXP
.
The post The AArch64 processor (aka arm64), part 13: Atomic access appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 12: Memory access and alignment appeared first on The Old New Thing.
]]>; load word or doubleword register ldr Rn/zr, [...] ; load unsigned byte ldrb Wn/zr, [...] ; load signed byte ldrsb Rn/zr, [...] ; load unsigned halfword ldrh Wn/zr, [...] ; load signed halfword ldrsh Rn/zr, [...] ; load signed word ldrsw Xn/zr, [...] ; load pair of registers ldp Rd1/zr, Rd2/zr, [...] ; load pair of registers as signed word ldpsw Xd1/zr, Xd2/zr, [...]
AArch64 does not have AArch32’s LDM
instruction for loading up to 13 registers at once. As a consolation present, it gives you a LDP
instruction for loading two registers, either 32bit or 64bit, from consecutive bytes of memory. (The first register uses the lower address.) The LDP
instruction is commonly used with the 64bit registers to load spilled registers from the stack.
There is a corresponding selection of instructions for storing to memory, but obviously the sign extension variations are not relevant.
; store word or doubleword register str Rn/zr, [...] ; store byte strb Wn/zr, [...] ; store halfword strh Wn/zr, [...] ; store pair of registers stp Rd1/zr, Rd2/zr, [...]
Not all addressing modes are available for all variations. This is not something you worry about when reading assembly language, but it’s something you need to keep in mind when writing it.
Size  [Xn/sp, #imm] (−256 … +255) 
[Xn/sp, #imm] [Xn/sp, #imm]! [Xn/sp], #imm 
[pc, #imm] (±1MB) 
[Xn/sp, Rn/zr, extend] 

byte  •  •  •  
halfword  •  •  •  
word  •  •  loads only  • 
doubleword  •  •  loads only  • 
pair  • 
The reach of the second column is is (0 … 4095) × size, except that the reach of the the register pairs is (−64 … 63) × size.
All operand sizes support register indirect with offset. Only word and doubleword support pcrelative (and even those are supported only for loads). And register pairs support only register indirect with offset.
There are some ambiguous encodings, because a constant offset in the range 0 … 255 that is a multiple of the operand size can be encoded either as a 9bit signed byte offset, or as a 12bit unsigned element offset. By default, assemblers will use the 12bit unsigned element offset, but you can force the 9bit signed byte offset by changing the opcode from LDxxx
and STxxx
to LDUxxx
and STUxxx
. The U
stands for unscaled.
Windows enables automatic unaligned access fixups. Simple unaligned memory accesses are fixed up automatically by the processor, but you lose atomicity: It is possible for an unaligned memory access to read a torn value. Any such tearing is at the byte level.
Original value  12 
34 
56 
78 
aligned 
Processor 1 reads  misaligned  
Processor 2 writes  AB 
CD 
EF 
01 
aligned 
The misaligned halfword read from processor 1 could produce 3456
, 34EF
, CD56
, or CDEF
. But it won’t produce 3DEF
.
You can still take alignment faults if the misaligned memory access is fancy, such as a locked load, store exclusive, or a load with a memory barrier. We’ll learn about these special memory accesses next time.
The post The AArch64 processor (aka arm64), part 12: Memory access and alignment appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 11: Loading addresses appeared first on The Old New Thing.
]]>The way to do this is with the help of the ADR
and ADRP
instructions.
; form pcrelative address to nearby label (±1MB) ; Xn = pc + offset adr Xn, label ; form pcrelative address to nearby 4KB page (±4GB) ; Xn = (pc & ~4095) + (offset * 4096) adrp Xn, label
The ADR
instruction adds a 21bit signed immediate to the current instruction’s address.
The ADRP
instruction takes a 21bit signed immediate, shifts it left 12 positions, and then adds it to the address of the starting byte of the page the current instruction is on. The result is the address of the starting byte of a page nearby.
Since modules are unlikely to be bigger than 4GB, the ±4GB reach should be enough to cover accesses to any global variables in the module from any code in that same module.
You can use ADRP
with offset addressing to access any global variable in two instructions.
; first, set x0 to point to the start of the page ; that holds the global adrp x0, global ; then use a positive offset to access the global ldr w0, [x0, #PageOffset(global)]
All of the register offset addressing modes support offsets up to 4095 (most go even higher), so you are sure to be able to get it in two instructions.
If all you need is the address of a global variable, rather that its value, then you can use an ADD
instruction:
; first, set x0 to point to the start of the page ; that holds the global adrp x0, global ; then add the page offset to get the address of the global add x0, x0, #PageOffset(global)
Both the unsigned offset in the [Rn, #imm]
addressing mode as well as the unsigned immediate in the add
instruction are 12 bits long, which is certainlynotcoincidentally exactly the number of bits in a page offset.
The Microsoft compiler goes to some effort to consolidate address calculations for global variables. For example, if it knows for sure that two global variables are laid out in memory that is a known distance apart, then it will use an ADRP
to get the address of one, and then use that fixed offset from the first variable to get the second.
// global variables int a, b; // code generation for calling f(&a, b); ; x0 points to start of page containing variable "a" adrp x0, a ; adjust to point directly at "a" add x0, a, #PageOffset(a) ; "b" is right next to "a", so load w1 via register offset ldr w1, [x0, #8] ; ready to call "f" bl f
(We’ll learn more about the calling convention later.)
Note that this optimization is not available if a
and b
were declared extern
, since the compiler doesn’t know anything about the layout of the memory in that case.
Using pcrelative addresses makes it easier to generate positionindependent code. I don’t know whether Windows requires AArch64 code to be positionindependent, but doing so reduces the number of fixups needed, so it’s still a good thing even if not required.
The post The AArch64 processor (aka arm64), part 11: Loading addresses appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 10: Loading constants appeared first on The Old New Thing.
]]>; move wide with zero ; Rd = imm16 << n ; n can be 0, 16, 32, or 48 movz Rd, #imm16, LSL #n ; move wide with not ; Rd = ~(imm16 << n) ; n can be 0, 16, 32, or 48 movn Rd, #imm16, LSL #n ; move wide with keep ; Rd[n+15:n] = imm16 movk Rd, #imm16, LSL #n
The MOVZ
instruction loads a 16bit unsigned value into one of the four lanes of a 64bit destination, or one of the two lanes of a 32bit destination. All the remaining lanes are set to zero.
The MOVN
instruction does the same thing as MOVZ
, except the whole thing is bitwise negated. (Be careful not to confuse MOVN
with MVN
.)
The MOVK
instruction does the same thing as MOVZ
, except that instead of setting the other lanes to zero, the other lanes are left unchanged.
Loading a 32bit value can be done in two instructions by using MOVZ
to load 16 bits into half of the register, than the MOVK
into the other half.
movz r0, #0x1234 ; r0 = 0x00001234 movk r0, #0xABCD, LSL #16 ; r0 = 0xABCD1234
This technique can be extended to load a 64bit value in four steps, but that’s getting quite unwieldy. The compiler is more likely to store the value in the code segment and use a pcrelative addressing mode to load it.
; special syntax for pcrelative loads ldr x0, =0x123456789ABCDEF0 ; load 64bit value ldr w0, =0x12345678 ; load 32bit value
As I noted in the discussion of addressing modes, the assembler and disassembler use this special equalssign notation to represent a pcrelative load. It means that the value is stored in a literal pool in the code segment, and a pcrelative load is being used to fetch it. The assembler batches up all of these literals and emits them between functions. The pcrelative load has a reach of ±1MB, so you are unlikely to run into the problem that you had on AArch32, where the reach was only ±4KB, and you had to find a safe place to dump the literals in the middle of the function.
There are quite a number of instructions that generate constants, and if you use the MOV
pseudoinstruction, the assembler will try to find one that works.
; load up a constant somehow mov Rd, #imm
Instruction  Used for 

add Rd, zr, #imm12 
0x00000000`00000XXX 
add Rd, zr, #imm12, LSL #12 
0x00000000`00XXX000 
sub Wd, wzr, #imm12 
0x00000000`FFFFFXXX 
sub Wd, wzr, #imm12, LSL #12 
0x00000000`FFXXXFFF 
sub Xd, xzr, #imm12 
0xFFFFFFFF`FFFFFXXX 
sub Xd, xzr, #imm12, LSL #12 
0xFFFFFFFF`FFXXXFFF 
movz Rd, #imm16 
0x00000000`0000XXXX 
movz Rd, #imm16, LSL #16 
0x00000000`XXXX0000 
movz Rd, #imm16, LSL #32 
0x0000XXXX`00000000 
movz Rd, #imm16, LSL #48 
0xXXXX0000`00000000 
movn Wd, #imm16 
0x00000000`FFFFXXXX 
movn Wd, #imm16, LSL #16 
0x00000000`XXXXFFFF 
movn Xd, #imm16 
0xFFFFFFFF`FFFFXXXX 
movn Xd, #imm16, LSL #16 
0xFFFFFFFF`XXXXFFFF 
movn Xd, #imm16, LSL #32 
0xFFFFXXXX`FFFFFFFF 
movn Xd, #imm16, LSL #48 
0xXXXXFFFF`FFFFFFFF 
orr Xd, xzr, #imm 
Value can be expressed as a Bitwise operation constant 
orr Wd, wzr, #imm 
Value can be expressed as lower 32 bits of a Bitwise operation constant 
A common type of sortof constant is the address of a global variable. It’s a constant whose value isn’t discovered until runtime. We’ll look at those next time.
The post The AArch64 processor (aka arm64), part 10: Loading constants appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 9: Sign and zero extension appeared first on The Old New Thing.
]]>; unsigned extend byte ; Rd = (uint8_t)Rn uxtb Rd/zr, Rn/zr ; ubfx Rd, Rn, #0, #8 ; unsigned extend halfword ; Rd = (uint16_t)Rn uxth Rd/zr, Rn/zr ; ubfx Rd, Rn, #0, #16 ; signed extend byte ; Rd = (int8_t)Rn sxtb Rd/zr, Rn/zr ; sbfx Rd, Rn, #0, #8 ; signed extend halfword ; Rd = (int16_t)Rn sxth Rd/zr, Rn/zr ; sbfx Rd, Rn, #0, #16 ; unsigned extend word mov Wd, Wn ; signed extend word sxtw Xd/zr, Xn/zr ; sbfx Xd, Xn, #0, #32
The odd man out here is the lack of an unsigned extend word pseudoinstruction, but a MOV
works just as well, because the rule for 32bit destinations is that they are zeroextended to a 64bit value. So just moving a 32bit register secretly zeroextends it. In practice, you can usually get the zeroextension for free by using a 32bit register as the destination for the original calculation.
You can avoid having using these instructions if you can merge it into a subsequent extended register operation:
; as two instructions, using r3 as scratch register uxtb r3, r2 ; r3 = (uint8_t)r2 add r0, r1, r3 ; r0 = r1 + r3 ; merged into one instruction, avoids scratch register add r0, r1, r2, uxtb ; r0 = r1 + (uint8_t)r2
For extending a word to a doubleword, you can synthesize that easily enough:
; unsigned extend word in Xd to doubleword in Xd/X(d+1) mov X(d+1), #0 ; signed extend word in Xd to doubleword in Xd/X(d+1) asr X(d+1), Xd, #63 ; copy sign bits to all bits
Next time, we’ll look at ways of loading constants.
The post The AArch64 processor (aka arm64), part 9: Sign and zero extension appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 8: Bit shifting and rotation appeared first on The Old New Thing.
]]>The hardcoded shifts are done by repurposing the versatile bitfield manipulation instructions.
; logical shift left by fixed amount ; ubfiz Rd, Rn, #(sizeshift), #shift lsl Rd/zr, Rn/zr, #shift ; logical shift right by fixed amount ; ubfx Rd, Rn, #(sizeshift), #shift lsr Rd/zr, Rn/zr, #shift ; arithmetic shift right by fixed amount ; sbfx Rd, Rn, #(sizeshift), #shift asr Rd/zr, Rn/zr, #shift
Left shifting is done by doing a bit insertion of the surviving bits into the upper bits of the destination. It’s the special case where the number of bits is exactly equal to the register size minus the shift amount.
shift  size−shift  


zerofill  
size−shift  shift 
Right shifting is the same thing, but using the unsigned bitfield extract instruction to go in the opposite direction:
size−shift  shift  


zerofill  
shift  size−shift 
And arithmetic right shifting uses the signed bitfield extract in order to get signextension behavior.
size−shift  shift  
S  
⇓ 


signfill  S  
shift  size−shift 
Rotation can be synthesized from doubleregister extraction by using the rotation source as both of the source registers for extraction.
; rotate right by fixed amount ; extr Rd, Rs, Rs, #shift ror Rd/zr, Rs/zr, #shift
size  shift  
Rs

Rs




Rd 
Note that there is no “rotate with carry” instruction. The AArch32 rrx
instruction does not exist in AArch64.¹ It would have been handy for finding the average of two unsigned integers without overflow.
The variable shifts have their own dedicated instructions.
; logical shift left variable ; Wd = Wn << (Wm & 31) ; Xd = Xn << (Xm & 63) lslv Rd/zr, Rn/zr, Rm/zr ; logical shift right variable ; Wd = Wn >> (Wm & 31), unsigned shift ; Xd = Xn >> (Xm & 63), unsigned shift lsrv Rd/zr, Rn/zr, Rm/zr ; arithmetic shift right variable ; Wd = Wn >> (Wm & 31), signed shift ; Xd = Xn >> (Xm & 63), signed shift asrv Rd/zr, Rn/zr, Rm/zr ; rotate right variable ; Rd = Rn rotated right by Rm positions rorv Rd/zr, Rn/zr, Rm/zr
Note that the shift amount is taken modulo the bit size of the operand. (This doesn’t really matter for RORV
since rotating by the operand bit size has no effect.)
The pseudoinstructions LSL
LSR
, ASR
, and ROR
accept a register as the second input operand and convert it to the corresponding V
instruction. This means that when writing assembly, you can just write LSL
and let the assembler figure out which real opcode it corresponds to.
There are no S
variants to the bit shifting instructions. They never update flags, unlike AArch32, which updated the carry with the last bit shifted out. If you want to know what bit got shifted out, you’ll have to calculate it yourself, say by shifting the same value again, but by one less position, and then inspecting the top/bottom bit (depending on the shift direction).
I have my guesses as to why the designers removed the flags behavior from these instructions: First, it removes a partial register update (flags), which creates a usuallyunwanted dependency on the previous flags. Second, no major programming language gives you access to the bit that was shifted out, so it wasn’t used in practice anyway.
Exercise: Suppose there was no doubleregister extraction instruction or variable rotation instruction. Synthesize fixed and variable rotation from other instructions. (Answer below.)
Bonus chatter: In AArch32, the bottom 8 bits of the shiftcount register were used. But in AArch64, only the bottom 5 (for 32bit operands) or 6 (for 64bit operands) bits are used.
Answer to exercise: You can synthesize a fixed rotation from a shift and a bitfield insertion.
; rotate r1 left by #imm, producing r0 ; r1 = ABCDEFGH lsl r0, r1, #imm ; r0 = EFGH0000 bfxil r0, r1, #(sizeimm), #imm ; r0 = EFGHABCD
A variable rotation can be synthesized from a pair of shifts.
; rotate r1 left by r2, producing r0 ; (destroys r2) ; r1 = ABCDEFGH lslv r0, r1, r2 ; r0 = EFGH0000 mvn r2, r2 ; r2 = leftover bits lsrv r2, r1, r2 ; r2 = 0000ABCD orr r0, r0, r2 ; r0 = EFGHABCD
¹ Although it doesn’t explicitly have a “rotate left through carry” instruction, you can still do it in a single instruction:
adcs r0, r1, r1 ; r0 = r1 rotated left through carry
The post The AArch64 processor (aka arm64), part 8: Bit shifting and rotation appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 7: Bitfield manipulation appeared first on The Old New Thing.
]]>rlwinm
instruction which was the Swiss army knife of bit operations. Well, AArch64 has its own allpurpose instruction, known as UBFM
, which stands for unsigned bitfield move.
; unsigned bitfield move ; ; if immr ≤ imms: ; take bits immr through imms and rotate right by immr ; ; if immr > imms: ; take imms+1 low bits and rotate right by immr ubfm Rd/zr, Rn/zr, #immr, #imms
This instruction hurts my brain. Although the description of the instruction appears to be two unrelated cases, they are handled by the same complex formula internally. It’s just that the formula produces different results depending on which case you’re in. The complex formula is the same one that is used to generate immediates for logical operations, so I’ll give the processor designers credit for the clever way they reduced transistor count.
Fortunately, you never see this instruction in the wild. The two cases are split into separate pseudoinstructions, which reexpress the immr and imms values in a more intuitive way.
; unsigned bitfield extract ; (used when immr ≤ imms) ; extract w bits starting at position lsb ubfx Rd/zr, Rn/zr, #lsb, #w
The UBFX
instruction handles the case of UBFM
where immr ≤ imms and reinterprets it as a bitfield extraction:
w  lsb  


zerofill  
w 
Since immr ≤ imms, the rightrotation by immr is the same as a rightshift by immr.
And then we have the other case, where immr > imms:
; unsigned bitfield insert into zeroes ; (used when immr > imms) ; extract loworder w bits and shift left by lsb ubfiz Rd/zr, Rn/zr, #lsb, #w
The UBFIZ
instruction reinterprets the UBFM
as a bitfield insertion, and reinterprets the rightrotation as a leftshift. This reinterpretation is valid because immr > imms, so we are always rotating more bits than we extracted.
w  


zerofill  zerofill  
w  lsb 
There is also a signed version of this instruction:
; signed bitfield move ; ; if immr ≤ imms: ; take bits immr through imms and rotate right by immr ; signfill upper bits ; ; if immr > imms: ; take imms+1 low bits and rotate right by immr ; signfill upper bits sbfm Rd/zr, Rn/zr, #immr #imms
This behaves the same as the unsigned version, except that the upper bits are filled with the sign bit of the bitfield. Like UBFM
, the SBFM
instruction is also never seen in the wild; it is always replaced by a pseudoinstruction.
; signed bitfield extract ; (used when immr ≤ imms) ; extract w bits starting at position lsb ; signfill upper bits sbfx Rd/zr, Rn/zr, #lsb, #w ; signed bitfield insert into zeroes ; (used when immr > imms) ; extract loworder w bits and shift left by lsb ; signfill upper bits sbfiz Rd/zr, Rn/zr, #lsb, #w
Here is the operation of SBFX
in pictures:
w  lsb  
S  


signfill  ⇐⇐⇐  S  
w 
And here is SBFIZ
:
w  
S  


signfill ⇐⇐⇐  S  zerofill  
w  lsb 
Note that in the case of SBFIZ
, the lower bits are still zerofilled.
The last bitfield opcode is BFM
, which follows the same pattern, but just combines the results differently:
; bitfield move ; ; if immr ≤ imms: ; take bits immr through imms and rotate right by immr ; merge with existing bits in destination ; ; if immr > imms: ; take imms+1 low bits and rotate right by immr ; merge with existing bits in destination bfm Rd/zr, Rn/zr, #immr #imms
Again, you will never see this instruction in the wild because it always disassembles as a pseudoinstruction:
; bitfield extract and insert low ; (used when immr ≤ imms) ; replace bottom w bits in destination ; with w bits of source starting at lsb ; ; Rd[w1:0] = Rn[lsb+w1:lsb] ; bfxil Rd/zr, Rn/zr, #lsb, #w
The BFXIL
instruction is like the UBFX
and SBFX
instructions, but instead of filling the unused bits with zero or sign bits, the original bits of the destination are preserved.
w  lsb  


unchanged  
w 
; bitfield insert ; (used when immr > imms) ; replace w bits in destination starting at lsb ; with low w bits of source ; ; Rd[lsb+w1:lsb] = Rn[w1:0] ; bfi Rd/zr, Rn/zr, #lsb, #w
The BFI
instruction is like the UBFIZ
and SBFIZ
instructions, but instead of filling the unused bits with zero or sign bits, the original bits of the destination are preserved.
w  


unchanged  unchanged  
w  lsb 
; bitfield clear ; replace w bits in destination starting at lsb ; with zero ; ; Rd[lsb+w1:lsb] = 0 ; bfc Rd/zr, #lsb, #w ; bfi Rd/zr, zr, #lsb, #w
The BFC
instruction just inserts zeroes.
unchanged  zerofill  unchanged 
w  lsb 
The last instruction in the bitfield manipulation category is word/doubleword extraction.
; extract a register from a pair of registers ; ; Wd = ((Wn << 32)  Wm)[lsb+31:lsb] ; Xd = ((Xn << 64)  Xm)[lsb+63:lsb] ; extr Rd/zr, Rn/zr, Rm/zr, #lsb
The extract register instruction treats its inputs as a register pair and extracts a registersized stretch of bits from them. This can be used to synthesize multiword shifts.
size  shift  
Rn

Rm




Rd 
Note that the two input registers are concatenated in bigendian order.
It turns out that a lot of other operations can be reinterpreted as bitfield extractions. We’ll look at some of them next time.
Bonus chatter: AArch32 also had instructions bfi
, bfc
, ubfx
, and sbfx
, but each was treated as a unique instruction. AArch64 generalizes them to cover additional scenarios, leaving the classic instructions as special cases of the generalized instructions.
The post The AArch64 processor (aka arm64), part 7: Bitfield manipulation appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 6: Bitwise operations appeared first on The Old New Thing.
]]>Let’s get the boring part out of the way.
; bitwise and with immediate ; Rd = Rn & imm and Rd/sp, Rn/zr, #imm ; bitwise and with shifted register ; Rd = Rn & (Rm with shift) and Rd/zr, Rn/zr, Rm/zr, shift ; bitwise and with immediate, set flags ; Rd = Rn & #imm, set flags ands Rd/zr, Rn/zr, #imm ; bitwise and with shifted register, set flags ; Rd = Rn & (Rm with shift), set flags ands Rd/zr, Rn/zr, Rm/zr, shift ; bitwise clear ; Rd = Rn & ~(Rm with shift) bic Rd/zr, Rn/zr, Rm/zr, shift ; bitwise clear, set flags ; Rd = Rn & ~(Rm with shift), set flags bics Rd/zr, Rn/zr, Rm/zr, shift ; bitwise or with immediate ; Rd = Rn  imm orr Rd/sp, Rn/zr, #imm ; bitwise or with shifted register ; Rd = Rn  (Rm with shift) orr Rd/zr, Rn/zr, Rm/zr, shift ; bitwise or not with shifted register ; Rd = Rn  ~(Rm with shift) orn Rd/zr, Rn/zr, Rm/zr, shift ; bitwise exclusive or with immediate ; Rd = Rn ^ imm eor Rd/sp, Rn/zr, #imm ; bitwise exclusive or with shifted register ; Rd = Rn ^ (Rm with shift) eor Rd/zr, Rn/zr, Rm/zr, shift ; bitwise exclusive or not with shifted register¹ ; Rd = Rn ^ ~(Rm with shift) eon Rd/zr, Rn/zr, Rm/zr, shift
There are a lot of combinations here. Let’s put them in a table.
Instruction  Immediate  Shifted register  

to Rd/sp no flags 
to Rd/zr with flags 
to Rd/zr no flags 
to Rd/zr with flags 

AND 
•  •  •  • 
BIC 
•  •  
ORR 
•  •  
ORN 
•  
EOR 
•  •  
EON 
• 
For the instructions that set flags, the N and Z flags represent the result of the operation, and the C and V flags are cleared.²
Stare at this table a bit and you start to see patterns.
All of the bitwise operations support a shifted register, which could be a LSL #0
to mean “no shift”. The operations that do not complement the second input operand support an immediate. (There’s no need to support an immediate for the complement versions, because you can just complement the immediate.) And the AND
like operations are the only ones which support flags. We’ll see workarounds for the lack of flags support in the other bitwise operations when we get to control transfer.
With these instructions, we can create some pseudoinstructions:³
tst Rn/zr, #imm ; ands zr, Rn/zr, #imm tst Rn/zr, Rm/zr, shift ; ands zr, Rn/zr, Rm/zr, shift mov Rd, #imm ; orr Rd, zr, #imm mov Rd, Rn/zr, shift ; orr Rd, zr, Rn/zr, shift mvn Rd, Rn/zr, shift ; orn Rd, zr, Rn/zr, shift
The TST
pseudoinstruction performs a bitwise and of its arguments and sets flags, but discards the result. It’s common to use a poweroftwo immediate here, to test a specific bit.
The MOV
instruction set a register equal to the value of another register or a supported immediate.
The MVN
instruction sets a register to the bitwise inverse of another register.
Okay, so about those immediates.
The bitwise operations encode the immediates in a very strange way. If that’s the sort of thing that interests you, I encourage you to read Dominik Inführ’s explanation of how they are formed for the gory details.
The short version is that the immediate can encode
The pattern consists of a bunch of rightjustified 1’s, with leading bits filled with 0’s.
Finally, after concatenating the copies of the pattern, you can rotate the whole thing to the right by any amount.
For example, single bits are expressible in this format, because you can ask for a 64bit pattern consisting of a single rightmost set bit, and then rotate that single bit into the position you like.
Conversely, all bits set except one can be generated by asking for a 64bit pattern consisting of 63 rightmost set bits (a single clear bit in position 63), and then rotate that 0 bit into the position you like.
Interestingly, you cannot generate all ones or all zeros with this pattern. Fortunately, you don’t need to. You can use zr for zero and the complement instruction with zr for ones. And operations with all ones or all zeroes can often be simplified to another instruction anyway, often avoiding a register dependency.
Missing instruction  Replacement  Note 

and Rd, Rn, #0 
mov Rd, #0 
AND with zero is zero 
and Rd, Rn, #1 
mov Rd, Rn 
AND with 1 is unchanged 
orr Rd, Rn, #0 
mov Rd, Rn 
OR with zero is unchanged 
orr Rd, Rn, #1 
orn Rd, zr, zr 
OR with 1 is 1 
eor Rd, Rn, #0 
mov Rd, Rn 
EOR with zero is unchanged 
eor Rd, Rn, #1 
orn Rd, zr, Rn 
EOR with 1 is bitwise negation 
Okay, so that’s it for the bitwise logical operations. Next time, we’ll look at bit shifting.
¹ The EON
instruction is new for AArch64. AArch32 does not have this opcode.
² AArch32 left C and V unchanged. My guess is that AArch64 forces both bits clear in order to avoid partial flags updates, which creates unintended dependencies among instructions.
³ AArch64 lost the TEQ
instruction from AArch32, which I noted was of limited utility.
The post The AArch64 processor (aka arm64), part 6: Bitwise operations appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 5: Multiplication and division appeared first on The Old New Thing.
]]>; multiply and add ; Rd = Ra + (Rn × Rm) madd Rd/zr, Rn/zr, Rm/zr, Ra/zr ; multiply and subtract ; Rd = Ra  (Rn × Rm) msub Rd/zr, Rn/zr, Rm/zr, Ra/zr
The product is then added to or subtracted from a third register.
You get some pseudoinstructions if you hardcode the third input operand to zero.
; multiply mul a, b, c ; madd a, b, c, zr ; multiply and negate mneg a, b, c ; msub a, b, c, zr
The next fancier way of multiplying two registers is to multiply two 32bit registers and get a 64bit result.
; unsigned multiply and add long ; Xd = Xa + (Wn × Wm), unsigned multiply umaddl Xd/zr, Wn/zr, Wm/zr, Xa/zr ; unsigned multiply and subtract long ; Xd = Xa  (Wn × Wm), unsigned multiply umsubl Xd/zr, Wn/zr, Wm/zr, Xa/zr ; signed multiply and add long ; Xd = Xa + (Wn × Wm), signed multiply smaddl Xd/zr, Wn/zr, Wm/zr, Xa/zr ; signed multiply and subtract long ; Xd = Xa  (Wn × Wm), signed multiply smsubl Xd/zr, Wn/zr, Wm/zr, Xa/zr
Again, the result of the multiplication is added to or subtracted from an accumulator. The naming of this opcode is a little confusing, because the word long in the opcode talks about the multiplication, not the addition or subtraction. The multiplication is 32 × 32 → 64, and the result is then accumulated as a 64bit value.
You can probably guess what the pseudoinstructions are. Just hardcode the zero register as the accumulator.
; unsigned multiply long umull a, b, c ; umaddl a, b, c, zr ; unsigned multiply and negate long umnegl a, b, c ; umsubl a, b, c, zr ; signed multiply long smull a, b, c ; smaddl a, b, c, zr ; signed multiply and negate long smnegl a, b, c ; smsubl a, b, c, zr
The last multiplication instruction gives you the missing piece of the 64 × 64 → 128 multiply.
; unsigned multiply high ; Xd = (Xn × Xm) >> 64, unsigned multiply umulh Xd/zr, Xn/zr, Xm/zr ; signed multiply high ; Xd = (Xn × Xm) >> 64, signed multiply smulh Xd/zr, Xn/zr, Xm/zr
These give you the upper 64 bits of a 64 × 64 → 128 multiply. If you want the full 128 bits, you combine it with the corresponding 64 × 64 → 64 multiply to get the lower 64 bits.
; unsigned 64 × 64 → 128 ; r1:r0 = r2 × r3 mul r0, r2, r3 umulh r1, r2, r3 ; signed 64 × 64 → 128 ; r1:r0 = r2 × r3 mul r0, r2, r3 smulh r1, r2, r3
Don’t be fooled by the lack of symmetry: Even though there is a UMULL
instruction, it is not the counterpart to UMULH
, and SMULL
instruction is not the counterpart to SMULH
!
Whereas there are a large variety of ways to multiple two registers, there are only two ways to divide them.
; unsigned divide ; Rd = Rn ÷ Rm, unsigned divide, round toward zero udiv Rd/zr, Rn/zr, Rm/zr ; signed divide ; Rd = Rn ÷ Rm, signed divide, round toward zero sdiv Rd/zr, Rn/zr, Rm/zr
If you try to divide by zero, there is no exception. The result is just zero. If you want to trap division by zero, you’ll have to test for a zero denominator explicitly.
There is also no exception for dividing the most negative integer by −1. You just get the most negative integer back.
None of the multiplication or division operations set flags.
There is no instruction for calculating the remainder. You can do that manually by calculating r = n − (n ÷ d) × d. This can be done by following up the division with an msub
:
; unsigned remainder after division udiv Rq, Rn, Rm ; Rq = Rn ÷ Rm msub Rr, Rq, Rm, Rn ; Rr = Rn  Rq × Rm ; = Rn  (Rn ÷ Rm) × Rm ; signed remainder after division sdiv Rq, Rn, Rm ; Rq = Rn ÷ Rm msub Rr, Rq, Rm, Rn ; Rr = Rn  Rq × Rm ; = Rn  (Rn ÷ Rm) × Rm
Next time, we’ll look at the logical operations and their extremely weird immediates.
The post The AArch64 processor (aka arm64), part 5: Multiplication and division appeared first on The Old New Thing.
]]>The post The AArch64 processor (aka arm64), part 4: Addition and subtraction appeared first on The Old New Thing.
]]>op x, y, z x = y op z
They take two source operands, combine them according to some operation, and put the result in the destination register.
Similarly, most of the unary operation instructions look like
op x, y ; x = op y
The destination is typically a numbered register or sp, and can be a 64bit register or a 32bit subregister. If you use a 32bit subregister, then the result is zeroextended to a 64bit value.
Okay, let’s start with addition:
add Rd/sp, Rn/sp, #imm12 add Rd/sp, Rn/sp, #imm12, LSL #12 add Rd/zr, Rn/zr, Rm/zr, LSL #n add Rd/zr, Rn/zr, Rm/zr, LSR #n add Rd/zr, Rn/zr, Rm/zr, ASR #n add Rd/sp, Rn/sp, Rm/zr, UXTB #n ; 0 ≤ n ≤ 4 add Rd/sp, Rn/sp, Rm/zr, UXTH #n ; 0 ≤ n ≤ 4 add Rd/sp, Rn/sp, Rm/zr, UXTW #n ; 0 ≤ n ≤ 4 add Rd/sp, Rn/sp, Rm/zr, UXTX #n ; 0 ≤ n ≤ 4 add Rd/sp, Rn/sp, Rm/zr, SXTB #n ; 0 ≤ n ≤ 4 add Rd/sp, Rn/sp, Rm/zr, SXTH #n ; 0 ≤ n ≤ 4 add Rd/sp, Rn/sp, Rm/zr, SXTW #n ; 0 ≤ n ≤ 4 add Rd/sp, Rn/sp, Rm/zr, SXTX #n ; 0 ≤ n ≤ 4
To ask for flags to be set based on the result, apply an S
suffix to the opcode, producing ADDS
.
Note that some of these encodings permit the operand to be sp
, but others allow zr
.
The first two versions add an immediate. It is either a 12bit unsigned immediate (0 ≤ n ≤ 4095) or a 12bit unsigned immediate shifted left by 12. This means that you can express constants of the form 0x00000XXX
and 0x00XXX000
. The disassembler does the LSL #12
for you, so you won’t actually see the #imm12, LSL #12
version disassembled as such. Instead, you’ll see the shifted constant:
add x0, x1, #0x123000 ; encoded as #0x123, LSL #12
The next block of variants adds a shifted register. You are allowed to shift doublewords by up to 63 positions and words up to 31 positions. You don’t need any larger shifts, because unsigned shifting by an amount greater than or equal to the operand bit size just gives you zero, so you should just have used zr. And signed shifting right by an amount greater than or equal to the operand size is the same as shifting right by one less than the operand bit size.
The last block lets you use the extended registers. You can use all of the extended forms, and the shift amount can be up to four positions. These extended registers with shifts are convenient for calculating array offsets:
; x0 = x1 + (int32_t)x2 * 16 add x0, x1, x2, SXTW #4
In this case, x1 is the base of an array where each element is of size 16, and x2 is a 32bit signed array index, and we calculate the address of the element into x0.
The ARM uses true carry. This means that for subtraction, the carry is clear when a borrow occurs, and subtract with carry subtracts an additional unit if inbound carry is clear.
The subtraction instruction has the same available variants as the addition instructions.
; calculate x = y  z sub x, y, z ; same options as add ; calculate x = y  z, set flags subs x, y, z ; same options as adds
Adding and subtracting with carry have only one encoding option.
; Rd = Rn + Rm + carry adc Rd/zr, Rn/zr, Rm/zr ; Rd = Rn + Rm + carry, set flags adcs Rd/zr, Rn/zr, Rm/zr ; Rd = Rn  Rm  !carry sbc Rd/zr, Rn/zr, Rm/zr ; Rd = Rn  Rm  !carry, set flags sbcs Rd/zr, Rn/zr, Rm/zr
From the addition and subtraction instructions, we can construct these pseudoinstructions, taking advantage of literal zeros and the hardcoded zero register: Reads from the zero register produce zero, and writes to the zero register are discarded.
; move register to/from sp mov sp, Rn ; add sp, Rn, #0 mov Rn, sp ; add Rn, sp, #0 ; move constant to register mov Rn, #imm12 ; add Rn, zr, #imm12 mov Rn, #imm12, LSL #12 ; add Rn, zr, #imm12, LSL #12
Adding zero gives you the ability to move between sp and the generalpurpose registers. And adding an immediate to the zero register loads a constant. We’ll see later that other registertoregister moves are encoded with a different pseudoinstruction, and there are plenty of options for loading constants beyond just this one.
The use of true carry permits the following group of pseudoinstructions for adding or subtracting negative numbers:
add a, b, #n ; sub a, b, #n adds a, b, #n ; subs a, b, #n sub a, b, #n ; add a, b, #n subs a, b, #n ; adds a, b, #n
The immediate operand to the ADD
and SUB
instruction families is treated as unsigned, but you can switch to the opposite instruction to get negative values (provided n ≠ 0). Note that this works due to ARM’s use of true carry. (If ARM had used borrow, then this conversion would set the carry bit incorrectly.)
cmp x, y ; subs zr, x, y cmn x, y ; adds zr, x, y
The compare and compare negative instructions are just subtraction and addition that set flags and throw away the result. Beware of the lie hiding inside the CMN instruction.
; negate (possibly setting flags) neg x, y, shift ; sub x, zr, y, shift negs x, y, shift ; subs x, zr, y, shift ; negate with carry (possibly setting flags) ngc x, y, shift ; sbc x, zr, y, shift ngcs x, y, shift ; sbcs x, zr, y, shift
Subtracting from zero gives you the ability to negate a value. Note that these pseudoinstructions are available only with shifted registers because the corresponding subtraction instructions support zr as the first input only when the second input is a shifted register. (Of course, you can shift by #0 if you didn’t really want to shift the second input.)
That turned out to be a lot to say about addition and subtraction. Next time, we’ll look at the fancier arithmetic operations: Multiplication and division.
The post The AArch64 processor (aka arm64), part 4: Addition and subtraction appeared first on The Old New Thing.
]]>