{"id":105324,"date":"2021-06-18T07:00:00","date_gmt":"2021-06-18T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105324"},"modified":"2021-06-18T07:39:32","modified_gmt":"2021-06-18T14:39:32","slug":"20210618-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210618-00\/?p=105324","title":{"rendered":"The ARM processor (Thumb-2), part 15: Miscellaneous instructions"},"content":{"rendered":"<p>There are far more ARM instructions than I&#8217;m going to cover here. I&#8217;ve skipped over the floating point instructions, the SIMD instructions, and some other specialty instructions that I haven&#8217;t yet seen come out of the compiler.<\/p>\n<p>Here are a few that are still interesting, even if I haven&#8217;t seen the compiler generate them.<\/p>\n<pre>    ; count leading zeroes (high order bits)\r\n    clz     Rd, Rm          ; Rd = number of leading zeroes in Rm\r\n\r\n    ; reverse bits\r\n    rbit    Rd, Rm          ; Rd = Rm bitwise reversed\r\n\r\n    ; reverse bytes\r\n    rev     Rd, Rm          ; Rd = Rm bytewise reversed\r\n\r\n    ; reverse bytes in each halfword\r\n    rev16   Rd, Rm          ; Rd[31:24] = Rm[23:16]\r\n                            ; Rd[23:16] = Rm[31:24]\r\n                            ; Rd[15: 8] = Rm[ 7: 0]\r\n                            ; Rd[ 7: 0] = Rm[15: 8]\r\n\r\n    ; reverse bytes in lower halfword and sign extend\r\n    revsh   Rd, Rm          ; Rd[31:8] = Rm[ 7:0] sign extended\r\n                            ; Rd[ 7:0] = Rm[15:8]\r\n<\/pre>\n<p>A few miscellaneous bit-fiddling instructions. The reversal instructions are primarily for changing data endianness.<\/p>\n<p>The next few instructions provide multiprocessing hints.<\/p>\n<pre>    ; yield to other threads\r\n    yield\r\n\r\n    ; wait for interrupt\r\n    wfi\r\n<\/pre>\n<p>The <code>YIELD<\/code> instruction is a hint to multi-threading processors that the current thread should be de-prioritized in favor of other threads. You typically see this instruction dropped into spin loops, via the intrinsic <code>__yield()<\/code>.<\/p>\n<p>The <code>WFI<\/code> instruction instructs the processor to go into a low-power state until an interrupt occurs. There are other instructions related to &#8220;events&#8221; which I won&#8217;t bother going into.<\/p>\n<p>The next few instructions are for communicating with the operating system:<\/p>\n<pre>        svc     #imm8       ; system call\r\n        bkpt    #imm8       ; software breakpoint\r\n        udf     #imm8       ; undefined opcode\u00b9\r\n<\/pre>\n<p>The system call and breakpoint instructions both carry an 8-bit immediate that the operating system can choose to use for whatever purpose it desires. The breakpoint instruction breaks the rules and always executes even if an encompassing <code>IT<\/code> instruction would normally cause it to be ignored. In other words, <code>bkpt<\/code> overrides <code>IT<\/code>.<\/p>\n<p>The undefined opcode is a block of 256 instructions from <code>0xde00<\/code> through <code>0xdeff<\/code> that are architecturally set aside as undefined instructions and which will not be given meaning in future versions of the processor.<\/p>\n<p>But just because the processor leaves them undefined doesn&#8217;t mean that operating system can&#8217;t <a title=\"The hunt for a faster syscall trap\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20041215-00\/?p=37003\"> give them special meaning<\/a>. Windows defines custom artificial instructions in the undefined space.\u00b2<\/p>\n<pre>    __debugbreak            ; udf #0xFE\r\n    __debugservice          ; udf #0xFD\r\n    __assertfail            ; udf #0xFC\r\n    __fastfail              ; udf #0xFB\r\n    __rdpmccntr64           ; udf #0xFA\r\n    __brkdiv0               ; udf #0xF9\r\n<\/pre>\n<p>Most of these are special ways of manually generating specific exceptions.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Opcode<\/th>\n<th>Exception<\/th>\n<th>Notes<\/th>\n<\/tr>\n<tr>\n<td><code>__debugbreak<\/code><\/td>\n<td><code>STATUS_<wbr \/>BREAKPOINT<\/code><\/td>\n<td>The &#8220;real&#8221; breakpoint instruction.<\/td>\n<\/tr>\n<tr>\n<td><code>__debugservice<\/code><\/td>\n<td><code>STATUS_<wbr \/>BREAKPOINT<\/code><\/td>\n<td>Communicate with debugger, <var>r12<\/var> is function code.<\/td>\n<\/tr>\n<tr>\n<td><code>__assertfail<\/code><\/td>\n<td><code>STATUS_<wbr \/>ASSERTION_<wbr \/>FAILURE<\/code><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td><code>__fastfail<\/code><\/td>\n<td><code>STATUS_<wbr \/>STACK_<wbr \/>BUFFER_<wbr \/>OVERRUN<\/code><\/td>\n<td><a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190108-00\/?p=100655\"> Misleadingly-named<\/a>.<\/td>\n<\/tr>\n<tr>\n<td><code>__brkdiv0<\/code><\/td>\n<td><code>STATUS_<wbr \/>INTEGER_<wbr \/>DIVIDE_<wbr \/>BY_<wbr \/>ZERO<\/code><\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <code>__brkdiv0<\/code> instruction is emitted by the compiler if it detects a zero denominator.<\/p>\n<pre>    cbnz    r0, @F          ; jump if denominator is nonzero\r\n    __brkdiv0               ; oops: manually raise div0 exception\r\n@@: bl      __rt_sdiv       ; software divide\/remainder\r\n                            ; (r0, r1) = (r1 \u00f7 r0, r1 mod r0)\r\n<\/pre>\n<p>The last artificial instruction is <code>__rdpmccntr64<\/code>, which reads a 64-bit cycle counter. This special instruction has a dedicated fast path through the trap handler, so it can produce the result in around 60 cycles.<\/p>\n<p>There is also an instruction to access coprocessor registers.<\/p>\n<pre>    ; move register from coprocessor\r\n    mrc (a bunch of stuff)\r\n<\/pre>\n<p>The coprocessor registers are encoded in a totally wacky way. There&#8217;s no point learning what each of the values means. All that matters is that they represent the register you want to read.<\/p>\n<p>There are a few coprocessor registers named <i>software thread ID register<\/i> which are not used by the processor, but are provided with the intention that operating systems use them to record per-thread information. The two available from user mode are named <code>TPIDRURW<\/code> and <code>TPIDRURO<\/code>; the first is read-write and the second is read-only. Windows uses <code>TPIDRURW<\/code> to hold the thread information.<\/p>\n<p>And of course, we have this guy:<\/p>\n<pre>    nop\r\n<\/pre>\n<p>Actually, there are two of this guy, a 16-bit <code>NOP<\/code> and a 32-bit <code>NOP<\/code>. The <code>NOP<\/code> instruction does nothing but occupy space. Use it to pad code to meet alignment requirements, but do not use it for timing because processors are allowed to optimize it out, or even run <i>faster<\/i>.<\/p>\n<p>Now that we have the basic instruction set under our belt, we&#8217;ll look at the calling convention next time.<\/p>\n<p><b>Bonus chatter<\/b>: Why doesn&#8217;t Windows use <code>udf #0xff<\/code>? The gcc toolchain uses <code>udf #0xff<\/code> as its &#8220;We should never get here&#8221; trap instruction. Putting an artificial instruction there would cause such a program to continue executing after it thought it had triggered a fatal exception.<\/p>\n<p>\u00b9 Although the ARM documentation provides the <code>udf<\/code> mnemonic for the undefined instruction, not all assemblers recognize it, so you may be forced to encode the hex value directly into your code if that&#8217;s what you want.<\/p>\n<p>\u00b2 I don&#8217;t know why Windows chose the <code>udf<\/code> space for these artificial opcodes instead of using the <code>svc<\/code> space. Maybe there&#8217;s some fine print in the processor manual that makes <code>svc<\/code> unsuitable for this sort of thing. We know that <code>bkpt<\/code> is a bad choice for an artificial opcode because <code>bkpt<\/code> executes even if an encompassing <code>IT<\/code> instruction would have skipped it.<\/p>\n<p>Then again, use of <code>udf<\/code> to create artificial instructions is explicitly listed in the processor architecture manual as a valid use of the <code>udf<\/code> instruction, so at least it&#8217;s not breaking any unwritten rules.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The stuff that didn&#8217;t fit anywhere else.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-105324","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>The stuff that didn&#8217;t fit anywhere else.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105324","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105324"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105324\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105324"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105324"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105324"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}