{"id":107195,"date":"2022-09-19T07:00:00","date_gmt":"2022-09-19T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=107195"},"modified":"2022-09-19T07:10:44","modified_gmt":"2022-09-19T14:10:44","slug":"20220919-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220919-00\/?p=107195","title":{"rendered":"Why load fs:[0x18] into a register and then dereference that, instead of just going for fs:[n] directly?"},"content":{"rendered":"<p>For the purpose of this discussion, I&#8217;m going to assume you are familiar with the x86 32-bit and 64-bit instruction set architectures.<\/p>\n<p>In Windows on x86, a pointer to per-thread information is kept in the <code>fs<\/code> register (for x86-32) or the <code>gs<\/code> register (for x86-64). If you disassemble through the kernel, you&#8217;ll see that accesses to the per-thread information usually goes through two steps:<\/p>\n<pre>    mov     eax, dword ptr fs:[0x00000018]\r\n    mov     eax, dword ptr [eax+n]\r\n<\/pre>\n<p>Why do it this way when you could combine it into one?<\/p>\n<pre>    mov     eax, dword ptr fs:[n]\r\n<\/pre>\n<p>Access to the per-thread data is abstracted into the helper function <a href=\"https:\/\/docs.microsoft.com\/en-us\/windows\/win32\/api\/winnt\/nf-winnt-ntcurrentteb\"> <code>NtCurrentTeb<\/code><\/a>, and the definition of that function changes based on the processor architecture.<\/p>\n<pre>\/\/ x86-32\r\nreturn (struct _TEB *) (ULONG_PTR) __readfsdword (0x18);\r\n\r\n\/\/ x86-64\r\nreturn (struct _TEB *)__readgsqword(FIELD_OFFSET(NT_TIB, Self));\r\n<\/pre>\n<p>One option for optimizing access to per-thread data is to create a custom accessor for each thing you need.<\/p>\n<pre>inline DWORD GetLastErrorFromTEB()\r\n{\r\n#if defined(_M_IX86)\r\n    return __readfsdword(FIELD_OFFSET(TEB, LastErrorCode));\r\n#elif defined(_M_AMD64)\r\n    return __readgsdword(FIELD_OFFSET(TEB, LastErrorCode));\r\n#else\r\n    ... other architectures here ...\r\n#endif\r\n}\r\n<\/pre>\n<p>This gets rather unwieldy as the number of fields in the <code>TEB<\/code> increases, since each one needs its own custom accessor. So maybe you abstract it into a macro.<\/p>\n<pre>#if defined(_M_IX86)\r\n#define GetDwordFieldFromTEB(Field) \\\r\n    __readfsdword(FIELD_OFFSET(TEB, Field))\r\n#define GetPointerFieldFromTEB(Field) \\\r\n    __readfsdword(FIELD_OFFSET(TEB, Field))\r\n#elif defined(_M_AMD64)\r\n#define GetDwordFieldFromTEB(Field) \\\r\n    __readgsdword(FIELD_OFFSET(TEB, Field))\r\n#define GetPointerFieldFromTEB(Field) \\\r\n    __readgsqword(FIELD_OFFSET(TEB, Field))\r\n#else\r\n    ... other architectures here ...\r\n#endif\r\n\r\nDWORD GetLastError()\r\n{\r\n    return GetDwordFieldFromTEB(LastErrorCode);\r\n}\r\n\r\nPEB GetProcessEnvironmentBlock()\r\n{\r\n    return (PEB)GetPointerFieldFromTEB(Peb);\r\n}\r\n<\/pre>\n<p>Another option is to teach the compiler&#8217;s peephole optimizer that, if compiling in Windows ABI mode, it can fold the instruction sequence, provided the first <code>fs<\/code> access is to exactly the value <code>0x18<\/code>. This would be a very special compiler optimization that would kick in only for a limited audience. While it&#8217;s true that <a href=\"https:\/\/docs.microsoft.com\/en-us\/previous-versions\/technet-magazine\/cc742531(v=msdn.10)\"> the compiler team has been known to produce custom versions of the compiler for one-off situations<\/a>, the savings for this particular micro-optimization would be, if you&#8217;re lucky, a few thousands of instructions. That&#8217;s a lot of work to save a few dozen kilobytes.<\/p>\n<p>Furthermore, switching over could end up being worse: The offset field of an absolute selector-relative load is a 32-bit field, whereas the offset when applied to a general-purpose register can be made as small as eight bits. There is typically a penalty for accessing memory through a segment register whose base is not zero.\u00b9 Furthermore, you have to pay for the prefix byte, and are probably taking the processor down an execution path that is not heavily optimized. If you are going to access per-thread data more than once in a function, you may very well have been better off just caching the result in a general-purpose register so you could use the smaller (and probably more efficient) flat addressing mode instead.<\/p>\n<p>Even if all the calculations show that accessing the per-thread data from scratch is better than caching it in a general-purpose register (say, because the x86-32 is so register-starved that freeing up even one register is a big deal), you have to redo the cost\/benefit calculations for the other architectures.<\/p>\n<pre>\/\/ arm\r\nreturn (struct _TEB *)(ULONG_PTR)_MoveFromCoprocessor(CP15_TPIDRURW);\r\n\r\n\/\/ arm64\r\nreturn (struct _TEB *)__getReg(18); \/\/ register x18\r\n\r\n\/\/ alpha\r\nreturn (struct _TEB *)_rdteb(); \/\/ PAL instruction\r\n\r\n\/\/ ia64\r\nreturn (struct _TEB *)_rdtebex(); \/\/ register r13\r\n\r\n\/\/ MIPS\r\nreturn (struct _TEB *)((PCR *)0x7ffff000)-&gt;Teb;\r\n\r\n\/\/ PowerPC\r\nreturn (struct _TEB *)__gregister_get(13); \/\/ register r13\r\n<\/pre>\n<p>These architectures break down into two categories: Those for which the offset from the TEB register can be folded into the instruction, and those for which it cannot.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Architecture<\/th>\n<th>Can fold?<\/th>\n<th>Notes<\/th>\n<\/tr>\n<tr>\n<td>x86-32<\/td>\n<td>Yes<\/td>\n<td><code>mov eax, fs:[n]<\/code><\/td>\n<\/tr>\n<tr>\n<td>x86-64<\/td>\n<td>Yes<\/td>\n<td><code>mov rax, gs:[n]<\/code><\/td>\n<\/tr>\n<tr>\n<td>arm<\/td>\n<td>No<\/td>\n<td>Cannot use offset load from coprocessor register<\/td>\n<\/tr>\n<tr>\n<td>arm64<\/td>\n<td>Yes<\/td>\n<td><code>ldr x0, [x18, #n]<\/code><\/td>\n<\/tr>\n<tr>\n<td>alpha<\/td>\n<td>No<\/td>\n<td>Address returned by dedicated instruction<\/td>\n<\/tr>\n<tr>\n<td>ia64<\/td>\n<td>Yes<\/td>\n<td>Can calculate effective address directly from r13<\/td>\n<\/tr>\n<tr>\n<td>MIPS<\/td>\n<td>No<\/td>\n<td>No absolute addressing mode<\/td>\n<\/tr>\n<tr>\n<td>PowerPC<\/td>\n<td>Yes<\/td>\n<td><code>lwz r0, n(r13)<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For the architectures that don&#8217;t support folding, there&#8217;s no benefit to the optimization. And fetching the address from scratch is a pessimization on Alpha, since the call to get the address of per-thread data is a system call.<\/p>\n<p>In order to benefit from this optimization on processors where it&#8217;s helpful, without hurting the ones where it&#8217;s harmful, you may end up having to write two versions of every function: One which gets the per-thread pointer from scratch every time, and one which caches it. The compiler cannot do this transformation for you because it doesn&#8217;t know whether the per-thread data can safely be cached: If there is a fiber switch, then the per-thread data cannot be cached since the fiber might be resuming on a different thread!<\/p>\n<p>Given that this optimization may not actually be beneficial, and even if it is, the benefit is slight, and even that slight benefit comes at a cost to other architectures, you&#8217;re probably better off not bothering with this micro-optimization and just writing portable code.<\/p>\n<p>\u00b9 <a href=\"https:\/\/www.intel.com\/content\/dam\/doc\/manual\/64-ia-32-architectures-optimization-manual.pdf\"> <i>Intel 64 and IA-32 Architectures Optimization Reference Manual<\/i><\/a>, April 2012.<\/p>\n<ul>\n<li>Chapter 2.1 <i>Intel Microarchitecture Code Name Sandy Bridge<\/i>, Section 2.1.5.2 <i>L1 DCache<\/i>: &#8220;If segment base is not zero, load latency increases.&#8221;<\/li>\n<li>Chapter 13 <i>Intel Atom Microarchitecture and Software Optimization<\/i>, Section 13.3.3.3 <i>Segment Base<\/i>: &#8220;Non-zero segment base will cause load and store operations to experience a delay.&#8221;<\/li>\n<\/ul>\n<p>Fog, Agner. <a href=\"https:\/\/agner.org\/optimize\/microarchitecture.pdf\"> The microarchitecture of Intel, AMD, and VIA CPUs<\/a>, August 17, 2021.<\/p>\n<ul>\n<li>Chapter 19 <i>AMD K8 and K10 pipeline<\/i>: &#8220;The time it takes to calculate an address and read from that address in the level-1 cache is 3 clock cycles if the segment base is zero and 4 clock cycles if the segment base is nonzero, according to my measurements.&#8221;<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Simplifying the compiler.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-107195","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Simplifying the compiler.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=107195"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107195\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=107195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=107195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=107195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}