{"id":110374,"date":"2024-10-15T07:00:00","date_gmt":"2024-10-15T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=110374"},"modified":"2024-10-15T08:30:04","modified_gmt":"2024-10-15T15:30:04","slug":"20241015-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20241015-00\/?p=110374","title":{"rendered":"A quick introduction to return address protection technologies"},"content":{"rendered":"<p>Return Oriented Programming (ROP) is a malware technique that takes advantage of a memory write vulnerability to populate the stack with synthesized return addresses, each of which points to a code fragment (known as a <i>gadget<\/i>) that executes a few instructions before performing a return instruction. The idea is that an attacker can gain arbitrary code execution by cobbling together these small sequences of instructions into a larger operation.<\/p>\n<p>A common defense against ROP techniques is to use some form of return address protection by confirming that the return address that is about to be used matches the return address received at the start of the function. In the case of a ROP, the synthesized return addresses do not correspond to any call, and this gives the system an opportunity to detect that something bad has happened.<\/p>\n<p>We saw some time ago that <a title=\"The AArch64 processor (aka arm64), part 18: Return address protection\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220819-00\/?p=107020\"> the AArch64 architecture contains hardware support for return address validation<\/a> through the use of the <tt>pacibsp<\/tt> and <tt>autibsp<\/tt> pair of instruction which sign a return address and validate the signature, respectively.<\/p>\n<p>Another approach is to use a <i>shadow stack<\/i>, which is another stack in memory into which copies of the original return addresses are recorded, and against which those return addresses are validated before being used.<\/p>\n<p>There are two common patterns for shadow stacks, known as <i>parallel shadow stacks<\/i> and <i>compact shadow stacks<\/i>.<\/p>\n<p>The <i>compact shadow stack<\/i> reserves another register to be used as a shadow stack pointer. For example, you might do this:<\/p>\n<pre>; function entry with return address on CPU stack\r\n; assume r15 is the shadow stack pointer\r\n\r\n    ; retrieve return address\r\n    mov     rax, [rsp]\r\n\r\n    ; push onto shadow stack\r\n    mov     [r15-8], rax\r\n    lea     r15, [r15-8]\r\n\r\n    \u27e6 main function body goes here \u27e7\r\n\r\n    ; before returning, pop the return address\r\n    ; from the shadow stack\r\n    mov     r11, [r15]\r\n    lea     r15, [r15+8]\r\n\r\n    ; and check that it matches the CPU stack\r\n    cmp     r11, [rsp]\r\n    jnz     fatal\r\n\r\n    ret\r\n<\/pre>\n<p>This is called a compact shadow stack because all the return addresses are stored in contiguous memory. The amount of memory required for the shadow stack is <code>sizeof(address)<\/code> \u00d7 call depth.<\/p>\n<table style=\"border-collapse: collapse; text-align: center;\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>CPU stack<\/td>\n<td style=\"width: 10ex;\">\u00a0<\/td>\n<td>shadow stack<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">\u22ee<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u22ee<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">retaddr1<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">retaddr1<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">retaddr2<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">retaddr3<\/td>\n<td style=\"text-align: left;\">\u2190 r15<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">retaddr2<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">retaddr3<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td style=\"text-align: left;\">\u2190 rsp<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>By comparison the <i>parallel shadow stack<\/i> allocates a block of memory the same size as the CPU stack, and there is a buddy system between each byte of the CPU stack and each byte of the shadow stack. Access to the shadow stack is usually mediated by an otherwise-unused selector.<\/p>\n<pre>; function entry with return address on CPU stack\r\n; assume fs has a base address equal to the distance\r\n; between the CPU stack and the shadow stack\r\n\r\n    ; retrieve return address\r\n    mov     rax, [rsp]\r\n\r\n    ; copy to shadow stack\r\n    mov     fs:[rsp], rax\r\n\r\n    \u27e6 main function body goes here \u27e7\r\n\r\n    ; before returning, compare the return address\r\n    ; to the shadow stack\r\n    mov     r11, fs:[rsp]\r\n    cmp     r11, [rsp]\r\n    jnz     fatal\r\n\r\n    ret\r\n<\/pre>\n<p>This is called a parallel shadow stack because the two stacks run parallel to each other.<\/p>\n<table style=\"border-collapse: collapse; text-align: center;\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>CPU stack<\/td>\n<td style=\"width: 10ex;\">\u00a0<\/td>\n<td>shadow stack<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">\u22ee<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u22ee<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">retaddr1<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">retaddr1<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u00a0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u00a0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">retaddr2<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">retaddr2<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u00a0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u00a0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u00a0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">retaddr3<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">retaddr3<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td>&nbsp;<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u00a0<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px currentcolor;\">local var<\/td>\n<td style=\"text-align: left;\">\u2190 rsp<\/td>\n<td style=\"border: solid 1px currentcolor;\">\u00a0<\/td>\n<td style=\"text-align: left;\">\u2190 fs:rsp<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here&#8217;s a table of pros and cons:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>\u00a0<\/th>\n<th>Compact<\/th>\n<th>Parallel<\/th>\n<\/tr>\n<tr>\n<td>Code size<\/td>\n<td>Larger<\/td>\n<td>Smaller<\/td>\n<\/tr>\n<tr>\n<td>Memory consumption<\/td>\n<td>Smaller<\/td>\n<td>Larger<\/td>\n<\/tr>\n<tr>\n<td>Register pressure<\/td>\n<td>Greater<\/td>\n<td>Smaller<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Although both the compact and parallel stacks require a new dedicated register, the compact stack takes the register from the general purpose registers, which makes it unavailable for code generation. The parallel stack uses a selector that would otherwise go unused.<\/p>\n<p>A significant problem with software-based return address protection on x86-64 is that <a title=\"The Evolution of CFI Attacks and Defenses\" href=\"https:\/\/github.com\/microsoft\/MSRC-Security-Research\/blob\/master\/presentations\/2018_02_OffensiveCon\/The%20Evolution%20of%20CFI%20Attacks%20and%20Defenses.pdf\"> the return address is passed from the caller to the callee via memory, which opens a race condition<\/a> (page 29) where an attacker can modify the return address in memory after it has been pushed by the <tt>call<\/tt> instruction but before it is loaded by the <tt>mov rax, [rsp]<\/tt> at the start of the called function. (This is not a problem for processors which use a link register to pass the return address.)<\/p>\n<p>Intel Control-flow Enforcement Technology (CET) implements a compact shadow stack in hardware using a dedicated register not visible to user mode. When active, <tt>call<\/tt> instructions automatically push return addresses on to the shadow stack, and <tt>ret<\/tt> instructions automatically pop and validate return addresses from the shadow stack. Performing the shadow store as part of the <tt>call<\/tt> instruction removes the race condition.<\/p>\n<p>Okay, that was a lot of stuff just to provide the required reading in anticipation of the real topic, which we&#8217;ll pick up next time.<\/p>\n<p><b>Bonus chatter<\/b>: Some versions of return address protection simply ignore the return address on the CPU stack and just use the value from the shadow stack. Corrupt the return address all you want; we don&#8217;t use it!<\/p>\n<pre>; compact shadow stack version\r\n\r\n    ; on function entry,\r\n    ; push return address onto shadow stack\r\n    mov     rax, [rsp]\r\n    mov     [r15-8], rax\r\n    lea     r15, [r15-8]\r\n\r\n    \u27e6 main function body goes here \u27e7\r\n\r\n    ; return to the address on the shadow stack\r\n    pop     r11             ; discard CPU stack\r\n    mov     r11, [r15]      ; fetch from shadow stack\r\n    lea     r15, [r15+8]    ; pop from shadow stack\r\n    jmp     r11             ; go to where the shadow stack tells us\r\n\r\n; parallel shadow stack version\r\n\r\n    ; on function entry,\r\n    ; copy return address to shadow stack\r\n    mov     rax, [rsp]\r\n    mov     fs:[rsp], rax\r\n\r\n    \u27e6 main function body goes here \u27e7\r\n\r\n    ; return to the address on the shadow stack\r\n    pop     r11             ; discard CPU stack\r\n    mov     r11, fs:[rsp]   ; fetch from shadow stack\r\n    jmp     r11             ; go to where the shadow stack tells us\r\n<\/pre>\n<p>You could go even further and remove the return address from the CPU stack entirely, which saves an instruction and also permits a more compact encoding.<\/p>\n<pre>; compact shadow stack version\r\n\r\n    ; on function entry,\r\n    ; pop return address from CPU stack\r\n    ; and push to shadow stack\r\n    pop     rax\r\n    mov     [r15-8], rax\r\n    lea     r15, [r15-8]\r\n\r\n    \u27e6 main function body goes here \u27e7\r\n\r\n    ; return to the address on the shadow stack\r\n    mov     r11, [r15]      ; fetch from shadow stack\r\n    lea     r15, [r15+8]    ; pop from shadow stack\r\n    jmp     r11             ; go to where the shadow stack tells us\r\n<\/pre>\n<p><b>Exercise<\/b>: Why can&#8217;t we use the &#8220;transfer the return address to the shadow stack and remove it from the CPU stack&#8221; technique for parallel shadow stacks?<\/p>\n<p>This technique has multiple downsides. One is that it makes building stack traces much harder since you have to consult the shadow stack to figure out who the caller is. And the <tt>jmp<\/tt> instruction at the end unbalances the return address predictor. And this technique does not play friendly with CET: The shadow stack just grows and grows because no <tt>ret<\/tt> instruction is ever executed. And finally, this technique is not compatible with the Windows x86-64 ABI, which requires that return addresses be on the CPU stack.<\/p>\n<p><b>Answer to exercise<\/b>: You might think you could transfer the return address to the parallel shadow stack like this:<\/p>\n<pre>; parallel shadow stack version\r\n\r\n    ; on function entry,\r\n    ; pop return address from CPU stack\r\n    ; and copy to shadow stack\r\n    pop     rax\r\n    mov     fs:[rsp], rax\r\n\r\n    \u27e6 main function body goes here \u27e7\r\n\r\n    ; return to the address on the shadow stack\r\n    mov     r11, fs:[rsp]   ; fetch from shadow stack\r\n    jmp     r11             ; go to where the shadow stack tells us\r\n<\/pre>\n<p>However, this doesn&#8217;t work because it would mean that if your function consumes no stack space, then any function you call will overwrite your shadow stack entry with their return address.<\/p>\n<p><b>Bonus bonus chatter<\/b>: Shadow stacks adds another reason why Windows insists on allocating thread and fiber stacks rather than letting programs provide their own stack memory: A program-provided stack doesn&#8217;t have an associated shadow stack.<\/p>\n<p>(We learned another reason some time ago: <a title=\"The Itanium's so-called stack\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20050421-28\/?p=35833\"> The Itanium&#8217;s backing store stack<\/a>.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Detecting attempts to manipulate the return address.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-110374","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Detecting attempts to manipulate the return address.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110374","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=110374"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/110374\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=110374"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=110374"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=110374"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}