For the purpose of this discussion, I’m going to assume you are familiar with the x86 32-bit and 64-bit instruction set architectures.
In Windows on x86, a pointer to per-thread information is kept in the fs
register (for x86-32) or the gs
register (for x86-64). If you disassemble through the kernel, you’ll see that accesses to the per-thread information usually goes through two steps:
mov eax, dword ptr fs:[0x00000018] mov eax, dword ptr [eax+n]
Why do it this way when you could combine it into one?
mov eax, dword ptr fs:[n]
Access to the per-thread data is abstracted into the helper function NtCurrentTeb
, and the definition of that function changes based on the processor architecture.
// x86-32 return (struct _TEB *) (ULONG_PTR) __readfsdword (0x18); // x86-64 return (struct _TEB *)__readgsqword(FIELD_OFFSET(NT_TIB, Self));
One option for optimizing access to per-thread data is to create a custom accessor for each thing you need.
inline DWORD GetLastErrorFromTEB() { #if defined(_M_IX86) return __readfsdword(FIELD_OFFSET(TEB, LastErrorCode)); #elif defined(_M_AMD64) return __readgsdword(FIELD_OFFSET(TEB, LastErrorCode)); #else ... other architectures here ... #endif }
This gets rather unwieldy as the number of fields in the TEB
increases, since each one needs its own custom accessor. So maybe you abstract it into a macro.
#if defined(_M_IX86) #define GetDwordFieldFromTEB(Field) \ __readfsdword(FIELD_OFFSET(TEB, Field)) #define GetPointerFieldFromTEB(Field) \ __readfsdword(FIELD_OFFSET(TEB, Field)) #elif defined(_M_AMD64) #define GetDwordFieldFromTEB(Field) \ __readgsdword(FIELD_OFFSET(TEB, Field)) #define GetPointerFieldFromTEB(Field) \ __readgsqword(FIELD_OFFSET(TEB, Field)) #else ... other architectures here ... #endif DWORD GetLastError() { return GetDwordFieldFromTEB(LastErrorCode); } PEB GetProcessEnvironmentBlock() { return (PEB)GetPointerFieldFromTEB(Peb); }
Another option is to teach the compiler’s peephole optimizer that, if compiling in Windows ABI mode, it can fold the instruction sequence, provided the first fs
access is to exactly the value 0x18
. This would be a very special compiler optimization that would kick in only for a limited audience. While it’s true that the compiler team has been known to produce custom versions of the compiler for one-off situations, the savings for this particular micro-optimization would be, if you’re lucky, a few thousands of instructions. That’s a lot of work to save a few dozen kilobytes.
Furthermore, switching over could end up being worse: The offset field of an absolute selector-relative load is a 32-bit field, whereas the offset when applied to a general-purpose register can be made as small as eight bits. There is typically a penalty for accessing memory through a segment register whose base is not zero.¹ Furthermore, you have to pay for the prefix byte, and are probably taking the processor down an execution path that is not heavily optimized. If you are going to access per-thread data more than once in a function, you may very well have been better off just caching the result in a general-purpose register so you could use the smaller (and probably more efficient) flat addressing mode instead.
Even if all the calculations show that accessing the per-thread data from scratch is better than caching it in a general-purpose register (say, because the x86-32 is so register-starved that freeing up even one register is a big deal), you have to redo the cost/benefit calculations for the other architectures.
// arm return (struct _TEB *)(ULONG_PTR)_MoveFromCoprocessor(CP15_TPIDRURW); // arm64 return (struct _TEB *)__getReg(18); // register x18 // alpha return (struct _TEB *)_rdteb(); // PAL instruction // ia64 return (struct _TEB *)_rdtebex(); // register r13 // MIPS return (struct _TEB *)((PCR *)0x7ffff000)->Teb; // PowerPC return (struct _TEB *)__gregister_get(13); // register r13
These architectures break down into two categories: Those for which the offset from the TEB register can be folded into the instruction, and those for which it cannot.
Architecture | Can fold? | Notes |
---|---|---|
x86-32 | Yes | mov eax, fs:[n] |
x86-64 | Yes | mov rax, gs:[n] |
arm | No | Cannot use offset load from coprocessor register |
arm64 | Yes | ldr x0, [x18, #n] |
alpha | No | Address returned by dedicated instruction |
ia64 | Yes | Can calculate effective address directly from r13 |
MIPS | No | No absolute addressing mode |
PowerPC | Yes | lwz r0, n(r13) |
For the architectures that don’t support folding, there’s no benefit to the optimization. And fetching the address from scratch is a pessimization on Alpha, since the call to get the address of per-thread data is a system call.
In order to benefit from this optimization on processors where it’s helpful, without hurting the ones where it’s harmful, you may end up having to write two versions of every function: One which gets the per-thread pointer from scratch every time, and one which caches it. The compiler cannot do this transformation for you because it doesn’t know whether the per-thread data can safely be cached: If there is a fiber switch, then the per-thread data cannot be cached since the fiber might be resuming on a different thread!
Given that this optimization may not actually be beneficial, and even if it is, the benefit is slight, and even that slight benefit comes at a cost to other architectures, you’re probably better off not bothering with this micro-optimization and just writing portable code.
¹ Intel 64 and IA-32 Architectures Optimization Reference Manual, April 2012.
- Chapter 2.1 Intel Microarchitecture Code Name Sandy Bridge, Section 2.1.5.2 L1 DCache: “If segment base is not zero, load latency increases.”
- Chapter 13 Intel Atom Microarchitecture and Software Optimization, Section 13.3.3.3 Segment Base: “Non-zero segment base will cause load and store operations to experience a delay.”
Fog, Agner. The microarchitecture of Intel, AMD, and VIA CPUs, August 17, 2021.
- Chapter 19 AMD K8 and K10 pipeline: “The time it takes to calculate an address and read from that address in the level-1 cache is 3 clock cycles if the segment base is zero and 4 clock cycles if the segment base is nonzero, according to my measurements.”
Nice, another reference on how to get the TEB for all architectures where there's a public nt port using msvc intrinsics.
I did essentially the same thing some years ago (except for the PEB, so did indeed take advantage of such a manual optimisation, for x86 and amd64): https://gist.github.com/Wack0/849348f9d4f3a73dac864a556e9372a5
It's missing E2; although I know where the TEB pointer is on that architecture from an architectural point of view, I don't know the MSVC intrinsic for it.
IIRC,...
The two MOV instructions are 9 bytes, while one MOV eax, fs:[0x34] (the last error number) is 6 bytes. So it's definitely shorter.
I don't understand the argument about the non-zero segment base. Don't we have a non-zero base in mov eax, dword ptr fs:[0x00000018] ?
We have to access it via fs: in both versions of the code, so why adding the second instruction would help with the performance?
This would...
I think you repeated what I said in the paragraph (read to the end).
The TEB might grow but not every field is as important. Imagine how many billions of times Set/GetLastError is called every day around the world. Any speed/power optimizations surely would have been worth it for someone to assemble custom versions for x86…
Not sure why my brain latched on to this but I don't think:
<code>
is equivalent to
<code>
at all.
In the first moves the value at into then it defreferences that via which is the critical difference. So is not equivalent because that loads the value at which may or may not be the same as what is being searched for. The point being there is an indirection here...
0x18 is the offset of _NT_TIB.Self on x86 – i.e. the linear address of the TIB, which coincides with the address of the TEB since the TIB is inlined as the first field of the TEB. And the TEB starts at the base of fs. In other words on x86 Windows *(fsDescriptorBase+0x18) == fsDescriptorBase. Therefore *(fsDescriptorBase+n) == *(*(fsDescriptorBase+0x18)+n)
OUCH: why haven't you looked at the TEB alias NT_TIB first?
dword ptr fs:[0x18] alias NT_TIB.self holds the linear address of the TEB
dword ptr fs:[44] and qword ptr gs:[88] alias TEB.ThreadLocalStoragePointer hold the address of an extraneous memory block, the _tls_array, which holds the addresses of the static TLS blocks.
OTOH accesses of TEB.TlsSlots[64], which holds the dynamic TLS data, could be folded.
> OUCH: why haven’t you looked at the TEB alias NT_TIB first?
> dword ptr fs:[0x18] alias NT_TIB.self holds the linear address of the TEB
I find this incredibly rude, the article doesn't indicate anywhere this might be the case and a reader would have no knowledge of this. I did not and I'm also not a kernel mode developer. So assuming that knowledge should be present is extremely rude and presumptuous. The article should be...
AGAIN: you were to lazy to take a look at the definition of NT_TIB provided in WINNT.H!
“I’m also not a kernel mode developer”
But a naughty and apparently clueless kid.
OUCH²: NT_TIB alias TEB are user-mode components. The kernel-mode component is known as ETHREAD
The fold takes advantage of the Windows implementation detail that
fs:[0x18]
contains the base address of thefs
register.Ah ok, that’s rather critical to call out, because the post is written as if it was replacing an
lea
and not amov
.