{"id":105265,"date":"2021-05-31T07:00:00","date_gmt":"2021-05-31T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105265"},"modified":"2021-06-01T08:35:26","modified_gmt":"2021-06-01T15:35:26","slug":"20210531-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210531-00\/?p=105265","title":{"rendered":"The ARM processor (Thumb-2), part 1: Introduction"},"content":{"rendered":"<p>I&#8217;ve run out of historical processors that Windows supported, so I&#8217;m moving on to processors that are still in support. First up in this series is 32-bit ARM.<\/p>\n<p>As with all of these series, I&#8217;m focusing on how Windows 10\u00b9 uses the processor in user mode, with particular focus on the instructions you are most likely to encounter in compiler-generated code.<\/p>\n<p>The classic ARM processor generally follows the principles of Reduced Instruction Set Computing (RISC): It has fixed-length instructions, a large uniform register set, and the only operations on memory are loading and storing. However, Windows doesn&#8217;t use the ARM processor in classic mode, so some of the above statements aren&#8217;t true any more.<\/p>\n<p>Windows uses the ARM in a mode known as Thumb-2 mode.\u00b2 In Thumb-2 mode, some classic features are not available, such as most forms of predication. The Thumb-2 mode instruction encoding is variable-length, with a mix of 16-bit instructions and 32-bit instructions. Every instruction is required to begin on an even address, but 32-bit instructions are permitted to straddle a 4-byte boundary.<\/p>\n<p>In addition to classic ARM mode, Thumb mode, and Thumb-2 mode, there are also <a href=\"https:\/\/en.wikipedia.org\/wiki\/Jazelle\"> Jazelle<\/a> mode (which executes Java bytecode) and ThumbEE mode. I&#8217;m not going to cover them at all in this series, since Windows doesn&#8217;t use them. <b>From now on, I&#8217;m talking only about Thumb-2 mode<\/b>.<\/p>\n<p>The ARM architecture permits little-endian or big-endian operation. Windows runs the processor in little-endian mode and disables the <code>SETEND<\/code> instruction, so you can&#8217;t switch to big-endian even if you tried.<\/p>\n<p>The architectural terms for data sizes are<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Term<\/th>\n<th>Size<\/th>\n<\/tr>\n<tr>\n<td>byte<\/td>\n<td>\u20078 bits<\/td>\n<\/tr>\n<tr>\n<td>halfword<\/td>\n<td>16 bits<\/td>\n<\/tr>\n<tr>\n<td>word<\/td>\n<td>32 bits<\/td>\n<\/tr>\n<tr>\n<td>doubleword<\/td>\n<td>64 bits<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The ARM instruction set has 16 general-purpose integer registers, each 32 bits wide, and formally named <var>r0<\/var> through <var>r15<\/var>. They are conventionally used as follows:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Register<\/th>\n<th>Mnemonic<\/th>\n<th>Meaning<\/th>\n<th>Preserved?<\/th>\n<\/tr>\n<tr>\n<td><var>r0<\/var><\/td>\n<td>(<var>a1<\/var>)<\/td>\n<td>argument 1 and return value<\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td><var>r1<\/var><\/td>\n<td>(<var>a2<\/var>)<\/td>\n<td>argument 2 and second return value<\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td><var>r2<\/var><\/td>\n<td>(<var>a3<\/var>)<\/td>\n<td>argument 3<\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td><var>r3<\/var><\/td>\n<td>(<var>a4<\/var>)<\/td>\n<td>argument 4<\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td><var>r4<\/var><\/td>\n<td>(<var>v1<\/var>)<\/td>\n<td>&nbsp;<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r5<\/var><\/td>\n<td>(<var>v2<\/var>)<\/td>\n<td>&nbsp;<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r6<\/var><\/td>\n<td>(<var>v3<\/var>)<\/td>\n<td>&nbsp;<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r7<\/var><\/td>\n<td>(<var>v4<\/var>)<\/td>\n<td>&nbsp;<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r8<\/var><\/td>\n<td>(<var>v5<\/var>)<\/td>\n<td>&nbsp;<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r9<\/var><\/td>\n<td>(<var>v6<\/var>)<\/td>\n<td>&nbsp;<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r10<\/var><\/td>\n<td>(<var>v7<\/var>)<\/td>\n<td>&nbsp;<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r11<\/var><\/td>\n<td><var>fp<\/var> (<var>v8<\/var>)<\/td>\n<td>frame pointer<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r12<\/var><\/td>\n<td>(<var>ip<\/var>)<\/td>\n<td>intraprocedure call scratch<\/td>\n<td>Volatile<\/td>\n<\/tr>\n<tr>\n<td><var>r13<\/var><\/td>\n<td><var>sp<\/var><\/td>\n<td>stack pointer<\/td>\n<td>Yes<\/td>\n<\/tr>\n<tr>\n<td><var>r14<\/var><\/td>\n<td><var>lr<\/var><\/td>\n<td>link register<\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td><var>r15<\/var><\/td>\n<td><var>pc<\/var><\/td>\n<td>program counter<\/td>\n<td>N\/A<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The names in parentheses are used by some assemblers, but Microsoft&#8217;s toolchain doesn&#8217;t use those names. Some operating systems use <var>r9<\/var> for special purposes (usually as a table of contents\/gp or a thread-local pointer), but Windows does not assign it any special meaning. On Windows, it is available for general use, as long as the value is preserved across calls.<\/p>\n<p>The meanings of the last three registers (<var>sp<\/var>, <var>lr<\/var>, <var>pc<\/var>) are architectural.\u00b3 The rest are convention. We&#8217;ll learn more about register conventions later.<\/p>\n<p>The processor enforces 4-byte alignment for the <var>sp<\/var> register. Operations which misalign the stack result in unpredictable behavior.\u2074 Windows requires further that the stack be 8-byte aligned at function call boundaries.<\/p>\n<p>The ARM is notable for putting the program counter in the general-purpose register category, a feature which has been called &#8220;<a href=\"https:\/\/groups.google.com\/g\/comp.arch\/c\/xf7eQ0e8TZQ\/m\/cLFC_uYiWkcJ\">overly uniform<\/a>&#8221; by noted processor architect <a href=\"https:\/\/www.linkedin.com\/in\/mitch-alsup-8691537\"> Mitch Alsup<\/a>. The program counter register reads as the address of the current instruction plus four: The +4 is due to the pipelining of the original ARM implementation: By the time the pipeline gets to fetching the value of the register, the CPU has already advanced the instruction pointer four bytes. Even though later implementations of ARM have deeper pipelining, they continue to emulate the original pipelining for the purpose of reading from the program counter.\u2075 Writing to the program counter acts like a jump instruction:\u00a0The next instruction to be executed is the one at the address you wrote.<\/p>\n<p>This magic treatment of the program counter register is a bit mind-blowing when you first encounter it.<\/p>\n<p>Floating point and SIMD support (Neon) is optional in the ARM architecture, but Windows requires both. This means that you also have 32 double-precision (64-bit) floating point registers, which can also be split into 64 single-precision (32-bit) floating point registers.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: center;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th colspan=\"2\">Registers<\/th>\n<th>Preserved?<\/th>\n<\/tr>\n<tr>\n<td><var>s0\u2007<\/var> + <var>s1\u2007<\/var><\/td>\n<td><var>d0\u2007<\/var><\/td>\n<td rowspan=\"4\" valign=\"center\">No<\/td>\n<\/tr>\n<tr>\n<td><var>s2\u2007<\/var> + <var>s3\u2007<\/var><\/td>\n<td><var>d1\u2007<\/var><\/td>\n<\/tr>\n<tr>\n<td>\u22ee<\/td>\n<td>\u22ee<\/td>\n<\/tr>\n<tr>\n<td><var>s14<\/var> + <var>s15<\/var><\/td>\n<td><var>d7\u2007<\/var><\/td>\n<\/tr>\n<tr>\n<td><var>s16<\/var> + <var>s17<\/var><\/td>\n<td><var>d8\u2007<\/var><\/td>\n<td rowspan=\"3\" valign=\"center\">Yes<\/td>\n<\/tr>\n<tr>\n<td>\u22ee<\/td>\n<td>\u22ee<\/td>\n<\/tr>\n<tr>\n<td><var>s30<\/var> + <var>s31<\/var><\/td>\n<td><var>d15<\/var><\/td>\n<\/tr>\n<tr>\n<td><var>s32<\/var> + <var>s33<\/var><\/td>\n<td><var>d16<\/var><\/td>\n<td rowspan=\"3\" valign=\"center\">No<\/td>\n<\/tr>\n<tr>\n<td>\u22ee<\/td>\n<td>\u22ee<\/td>\n<\/tr>\n<tr>\n<td><var>s62<\/var> + <var>s63<\/var><\/td>\n<td><var>d31<\/var><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The ARM does not have branch delay slots. You can breathe a sigh of relief.<\/p>\n<p>The flags register is formally known as the Application Program Status Register (APSR). These flags are available to user mode:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Mnemonic<\/th>\n<th>Meaning<\/th>\n<th>Notes<\/th>\n<\/tr>\n<tr>\n<td>N<\/td>\n<td>Negative<\/td>\n<td>Set if the result is negative<\/td>\n<\/tr>\n<tr>\n<td>Z<\/td>\n<td>Zero<\/td>\n<td>Set if the result is zero<\/td>\n<\/tr>\n<tr>\n<td>C<\/td>\n<td>Carry<\/td>\n<td>Multiple purposes<\/td>\n<\/tr>\n<tr>\n<td>V<\/td>\n<td>Overflow<\/td>\n<td>Signed overflow<\/td>\n<\/tr>\n<tr>\n<td>Q<\/td>\n<td>Saturation<\/td>\n<td>Accumulated overflow<\/td>\n<\/tr>\n<tr>\n<td>GE[n]<\/td>\n<td>Greater than or equal to<\/td>\n<td>4 flags (SIMD)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The overflow flag records whether the most recent operation resulted in signed overflow. The saturation flag is used by multimedia instructions to accumulate whether any overflow occurred since it was last cleared. The GE flags record the result of SIMD operations. Flags are not preserved across calls.<\/p>\n<p>Under the Windows ABI, there is an 8-byte red zone beneath the stack pointer. However, you&#8217;ll never see the compiler using it because the red zone is reserved. It&#8217;s there for intrusive profilers.<\/p>\n<p>Intrusive profilers inject code into your binary to update hit counts. The ARM does not have an absolute addressing mode; access to memory is always indirect through registers. Therefore, the profiler needs to be able to &#8220;borrow&#8221; a register in order to access memory, and it does so by saving the current contents of two temporary registers to the red zone. This frees up just enough registers to be able to update profiling information.<\/p>\n<pre>    str     r12, [sp, #-4]  ; save r12 into the red zone\r\n    str     r0,  [sp, #-8]  ; save r0  into the red zone\r\n\r\n    ; We can now use r12 and r0 to update profiling statistics.\r\n    ... do profiling stuff with r12 and r0 ...\r\n\r\n    ; All done. Restore the registers we borrowed.\r\n    ldr     r0,  [sp, #-8]  ; recover r0  from the red zone\r\n    ldr     r12, [sp, #-4]  ; recover r12 from the red zone\r\n<\/pre>\n<p>\u00b9 Windows CE also supported ARM, it supported both Thumb-2 mode and classic ARM, so <a title=\"ARM Calling Sequence Specification (Windows CE 5.0)\" href=\"https:\/\/docs.microsoft.com\/en-us\/previous-versions\/windows\/embedded\/ms933779(v=msdn.10)\"> its ABI was different<\/a>. This series covers the Windows 10 ABI.<\/p>\n<p>\u00b2 Thumb-2 is an expansion of an earlier instruction set known unsurprisingly as Thumb. (Exercise: Why didn&#8217;t they call it Thumb-1?) The idea of using a 16-bit instruction set <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190805-00\/?p=102749\"> came from the SuperH<\/a>, and ARM licensed it from Hitachi for use in Thumb mode.<\/p>\n<p>\u00b3 The use of <var>r13<\/var> as the stack pointer is not architectural in classic ARM, but it is architectural in Thumb-2. Doing so frees up space in the tight 16-bit instruction encoding space.<\/p>\n<p>\u2074 In processor-speak, <i>unpredictable<\/i> means that the processor can perform any operations it likes, provided they are permissible at the current privilege level. For example, an <i>unpredictable<\/i> operation in user mode can set all registers to 42. But it cannot perform privileged operations, and the result cannot be dependent upon state that is not visible to user mode.<\/p>\n<p>\u2075 As with branch delay slots, the +4 effect of reading from the program counter is another example of how a clever hack in a processor&#8217;s original architecture turns into a compatibility constraint for future implementations.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Moving into the present.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-105265","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Moving into the present.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105265","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105265"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105265\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105265"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105265"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105265"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}