April 7th, 2025

On priority inversion in the use of a spinlock to ensure atomic access to a shared_ptr

In my discussion of the internal implementation of std::atomic<std::shared_ptr<T>>, I noted that the use of a spinlock without a blocking fallback could result in a deadlock due to priority inversion. Commenter Anton Siluanov noted, “While priority inversion is a thing, from atomic I’d expect as quick as possible operation. And mutex impl can be done manually.”

It’s true that the spinlock is not held for long, but even the tiniest window will get hit, and probably sooner than you would like. My colleague Larry Osterman phrases this as “One in a million is next Tuesday.” James Hamilton (formerly of Microsoft, now at Amazon) described it more mundanely as “At scale, rare event’s aren’t rare.”

In this case, the race condition occurs if a higher priority thread tries to enter the spinlock while a lower priority thread holds it. And even though it’s a small race window by instruction count, it can actually be quite a long time if there is a poorly-timed context switch, and an even longer time if the control block has been paged out.

Thread 1 (low priority) Thread 2 (high priority)
set lock bit  
increment refcount in control block danger zone
clear the lock bit  
set lock bit  

If the high priority thread runs during the danger zone, and the process has no other idle processors (say, because it’s a uniprocessor system, or the other processors are busy running medium-priority threads), then you have a priority inversion deadlock, because the high priority thread is consuming the resource that the low priority thread needs in order to release the lock.

Yes, it is a narrow race window, but narrow windows will get hit, and if they hit in just the wrong way, it will ruin somebody’s day.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

4 comments

Discussion is closed. Login to edit/delete existing comments.

  • Ian Boyd

    The archive of the link of Larry's blog post to a joke page his wife thought applied:

    https://web.archive.org/web/20040618021742/https://www.jumbojoke.com/000036.html

    Is 99.9% "good enough"? If so...

    - Two million documents will be lost by the IRS this year.
    - 811,000 faulty rolls of 35mm film will be loaded this year.
    - 22,000 checks will be deducted from the wrong bank accounts in the next 60 minutes.
    - 1,314 phone calls will be misplaced by telecommunication services every minute.
    - 12 babies will be given to the wrong parents each day.
    - 268,500 defective tires will be shipped this year. [Gee: with the Firestones on Explorers,...

    Read more
  • Shawn Van Ness

    I’d love to hear some tips on how folks test code like this. During development, I sometimes sprinkle Sleep(rand() % 64) calls between each line. But I don’t leave those in place, in the final checkin.

    Right now I’m helping debug a rare race condition between 2 threads in an old late-90s game engine codebase, and I find myself doing exactly this kind of thing. (We made a nice new class to encapsulate the difficult locking and syncing.. but of course no meaningful unit-tests.)

    • Sander SaaresMicrosoft employee · Edited

      Rust has loom which can be useful for testing hand-crafted synchronization logic. It essentially substitutes built-in synchronization primitives with its own that allow it to manipulate the sequencing of operations behind the scenes to run your tests under all the possible permutations of events.

    • Mike Winterberg

      Application Verifier has “Concurrency Fuzzing” that isn’t exactly what you do because it only applies the random delays at existing synchronization points, but it may help.

      If the codebase is portable to non-Windows, there’s also Thread Sanitizer. Since it is in league with the compiler, it can presumably find more issues.

      https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/application-verifier-tests-within-application-verifier#cuzz

      I don’t think either will help much with “I’m writing library code that needs to survive first contact with the enemy (consumers that decide to muck about with priorities)” if you’re not already testing for that.