The case of the application that used thread local storage it never allocated

Raymond Chen

An application compatibility issue was found with a program that crashed because its Thread Local Storage (TLS) slots were being corrupted.

Upon closer inspection, the real problem was not that the application’s TLS was being corrupted. The problem was that the application was using TLS slots it never allocated, so it was inadvertently using somebody else’s TLS slots as its own. And of course, when the true owner updated the TLS value, the application interpreted that as corruption.

It’s like just parking your car in a reserved space that belongs to someone else, and then when that other person parks in their own space, you complain, “Hey, what are you doing in my parking space?”

The program wants to allocate 38 TLS slots, and some reverse-compiling reveals that it does so like this:

DWORD g_firstTls;

bool AllocateContiguousTlsSlots()
    g_firstTls = TlsAlloc();

    for (int i = 1; i < 38; i++) {
        DWORD tls = TlsAlloc();
        if (tls != g_firstTls + i) return false;
    return true;

The program calls TlsAlloc() 38 times, and requires that the values it receives are all consecutive. If any non-consecutive value is received, then it declares that TLS slot allocation has failed.

The problem is that the code that calls Allocate­Contiguous­Tls­Slots() doesn’t check the return value. It assumes that the allocations all succeeded!

What started happening is that the program allocated its first TLS slot and got 9. Then 10, then 11, and so on up to 15. But when it called TlsAlloc() again, the slot index it got back was 17, not the expected value of 16. This caused Allocate­Contiguous­Tls­Slots() to fail, but since the program never checked whether Allocate­Contiguous­Tls­Slots() succeeded, it just assumed that it had ownership of slots 9 through 46, even though it had only allocated 9 through 15, and then 17. Slots 16 and 18 through 46 were never allocated, but the program used them anyway.

Why did this start happening all of a sudden?

Because I made a performance optimization that reduced the memory usage of the system by 8KB per thread.

Over the decades of its existence, the main DLL used by the shell, shell32.dll, accumulated quite a few features, and some of those features require TLS slots, so they allocate the TLS slots at initialization. It got to the point where a program that used shell32.dll and other frequently-used DLLs would end up allocating a total of over 64 TLS slots during initialization.

The value of 64 is important, because once the 65th TLS slot is allocated, the kernel goes into “overflow” mode and allocates an extra 1024 TLS slots for each thread, bringing the total to the magic number of 1088. On a 32-bit system, 1024 TLS slots occupy 4KB of memory. On a 64-bit system, it takes 8KB of memory. The result was that nearly every application that used shell32.dll (which in practice is nearly every application) was paying an extra 8KB of memory per thread.

But it’s rare that a single program exercises every single feature of shell32.dll.

I changed the code so that instead of pre-allocating the TLS slots, the various components allocated their TLS slots on demand when the component was used for the first time. Depending on the application, this resulted in a typical savings of 13 to 19 TLS slots, bringing the total number of allocated TLS slots below the magic number of 64.

Hooray, I saved 8KB of memory per thread in nearly every process in the system. It also meant that certain buggy programs would not crash when they allocated a TLS slot and got an index larger than 64.

In the case of the application that was having compatibility problems, the reason they couldn’t get their 38 contiguous TLS slots was that I was too good at reducing the number of TLS slots used during initialization. With the specific mix of DLLs this particular application used, the number of allocated TLS slots was only 10. The in-use slots were 0 through 8, and 16.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

What’s up with that weirdo TLS slot out at index 16?

It turns out that one particular TLS slot has to be fixed at index 16 for a different compatibility reason. I don’t know the details, but accepted that it was locked to index 16.

The compatibility team wrote a shim that kicks in when you run the program that was making untoward assumptions about TLS slot numbers. If the program calls TlsAlloc() and gets an index less than 16, then the compatibility shim loops back and calls TlsAlloc() again until it gets a value larger than 16. That way, when this program allocates TLS slots, it starts at slot 17. Early in the initialization of the program, all of the slots from 17 and beyond are all available, so its 38 consecutive calls get consecutive TLS indices 17 through 54.

The Allocate­Contiguous­Tls­Slots() function succeeds, and the program doesn’t crash.

Another day in the life of application compatibility.


Discussion is closed. Login to edit/delete existing comments.

  • Sacha Roscoe 2

    The app compatibility stories are my favourite content on this site. Not checking for failure is of course one of the classic errors, and I can easily imagine that they would not have encountered this sort of situation in testing – as the article points out, most programs had over 64 slots in use before the optimisation (though would App Verifier have helped? I wouldn’t be surprised if it does things like non-contiguous slot allocation, and allocations higher than 64 even when lower slots are free).

    Applause and thanks to the app compat team for allowing us to run our software on ever newer versions of Windows! (For the most part anyway.)

  • Joshua Hudson 0

    You know, I feel for these guys. Assuming that if you allocate a bunch of TLS slots at the top of main that they’re all in sequence is much more reasonable assumption than a random object is in TLS slot 16 already while the first TLS slot to be allocated is in slot 8.

    Also, whatever is forcing that TLS slot into 16 is really dodgy. I delay-load shell32.dll. What’s that gonna do if 16 is already taken?

    • Ian Boyd 0

      > Assuming that if you allocate a bunch of TLS slots at the top of main that they’re all in sequence is much more reasonable assumption than a random object is in TLS slot 16 already while the first TLS slot to be allocated is in slot 8

      Ideally the kernel would hand you back a randon 32-bit “handle”, rather than a “slot”.

      That way it is made clear to people that there is no sequence, or ordering, or range, and just a bunch of slots.

      I’m sure this is the sort of thing the App Verifier could do

      – convert the slot index to a random 32-bit value
      – hand *you* the random 32-bit value
      – and intercept your requests for that “slot”, rewriting it to the “real” slot

      TlsAlloc’s only guarantee of the return value is that it is a **DWORD** that you then later pass to TlsGetValue, TlsSetValue, and TlsFree. And anyone trying to do anything fancier with the return value than treating it as an opaque cookie gets a swift kick in the pants.

      • Richard Thompson 0

        I presume this was attempting to use the slots to contain an entire structure as an optimisation to avoid the pointer deref. As it was previously always in the 1024 separate block, that always worked.

        As Microsoft did a compatibility fix I presume it’s inside some fairly commonly-used framework.

        Of course, it will still break horribly if it ever spans the 64 and 1024 blocks, as the underlying problem of failing to check if it worked remains.
        [[nodiscard]] (and Rust’s #[must_use]) helps with that, but of course, the compiler cannot ever check whether you did something useful with the result…

    • Raymond ChenMicrosoft employee Author 0

      It’s not shell32 that reserved slot 16. It’s the kernel.

  • Motti Lanzkron 0

    Am I being particularly obtuse or does this function attempt to allocate 37 rather than 38 TLS slots?

    • Hubert Bartwinski 1

      The for loop does indeed call TlsAlloc 37 times. But there is an additional call to TlsAlloc before the the for loop at the beginning of the function.

      g_firstTls = TlsAlloc();
  • John Wiltshire 0

    Is there a good reason to use so many slots? I’ve always assumed they were a limited system resource and just used one to point to a structure I can manage myself.

Feedback usabilla icon