On priority inversion in the use of a spinlock to ensure atomic access to a `shared_ptr`

Raymond Chen

In my discussion of the internal implementation of std::atomic<std::shared_ptr<T>>, I noted that the use of a spinlock without a blocking fallback could result in a deadlock due to priority inversion. Commenter Anton Siluanov noted, “While priority inversion is a thing, from atomic I’d expect as quick as possible operation. And mutex impl can be done manually.”

It’s true that the spinlock is not held for long, but even the tiniest window will get hit, and probably sooner than you would like. My colleague Larry Osterman phrases this as “One in a million is next Tuesday.” James Hamilton (formerly of Microsoft, now at Amazon) described it more mundanely as “At scale, rare event’s aren’t rare.”

In this case, the race condition occurs if a higher priority thread tries to enter the spinlock while a lower priority thread holds it. And even though it’s a small race window by instruction count, it can actually be quite a long time if there is a poorly-timed context switch, and an even longer time if the control block has been paged out.

Thread 1 (low priority)	Thread 2 (high priority)
set lock bit
increment refcount in control block	danger zone
clear the lock bit
set lock bit

If the high priority thread runs during the danger zone, and the process has no other idle processors (say, because it’s a uniprocessor system, or the other processors are busy running medium-priority threads), then you have a priority inversion deadlock, because the high priority thread is consuming the resource that the low priority thread needs in order to release the lock.

Yes, it is a narrow race window, but narrow windows will get hit, and if they hit in just the wrong way, it will ruin somebody’s day.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

4 comments

Discussion is closed. Login to edit/delete existing comments.

Ian Boyd April 9, 2025

The archive of the link of Larry's blog post to a joke page his wife thought applied:

https://web.archive.org/web/20040618021742/https://www.jumbojoke.com/000036.html

Is 99.9% "good enough"? If so...

- Two million documents will be lost by the IRS this year.
- 811,000 faulty rolls of 35mm film will be loaded this year.
- 22,000 checks will be deducted from the wrong bank accounts in the next 60 minutes.
- 1,314 phone calls will be misplaced by telecommunication services every minute.
- 12 babies will be given to the wrong parents each day.
- 268,500 defective tires will be shipped this year. [Gee: with the Firestones on Explorers,...
Read more
The archive of the link of Larry’s blog post to a joke page his wife thought applied:

https://web.archive.org/web/20040618021742/https://www.jumbojoke.com/000036.html

Is 99.9% “good enough”? If so…

– Two million documents will be lost by the IRS this year.
– 811,000 faulty rolls of 35mm film will be loaded this year.
– 22,000 checks will be deducted from the wrong bank accounts in the next 60 minutes.
– 1,314 phone calls will be misplaced by telecommunication services every minute.
– 12 babies will be given to the wrong parents each day.
– 268,500 defective tires will be shipped this year. [Gee: with the Firestones on Explorers, it looks like this one came true!]
– 14,208 defective personal computers will be shipped this year. [Well, yeah….]
– 103,260 income tax returns will be processed incorrectly this year. [ditto!]
– 2,488,200 books will be shipped with the wrong cover in the next 12 months.
– 132,412,800 cans of soft drinks produced in the next 12 months will be flatter than one of the 268,500 defective tires.
– Two plane landings daily at O’Hare International Airport will be unsafe.
– 3,056 copies of tomorrow’s Wall Street Journal will be missing one of the three sections.
– 18,322 pieces of mail will be mishandled in the next hour.
– 291 pacemaker operations will be performed incorrectly this year.
– 880,000 credit cards in circulation will turn out to have incorrect cardholder information on their magnetic strips.
– $9,690 will be spent every day on defective, often unsafe sporting equipment.
– 55 malfunctioning automatic teller machines will be installed in the next 12 months.
– 20,000 incorrect drug prescriptions will be written in the next 12 months.
– 114,500 mismatched pairs of shoes will be shipped this year.
– $761,900 will be spent on tapes and CDs that won’t play.
– 107 incorrect medical procedures will be performed each day.
– 315 entries in Webster’s Third New International Dictionary of English Language will be misspelled.

And you thought 99.9% was good enough!!

Read less
Shawn Van Ness April 8, 2025

I’d love to hear some tips on how folks test code like this. During development, I sometimes sprinkle Sleep(rand() % 64) calls between each line. But I don’t leave those in place, in the final checkin.

Right now I’m helping debug a rare race condition between 2 threads in an old late-90s game engine codebase, and I find myself doing exactly this kind of thing. (We made a nice new class to encapsulate the difficult locking and syncing.. but of course no meaningful unit-tests.)
- Sander Saares April 8, 2025 · Edited
  
  Rust has loom which can be useful for testing hand-crafted synchronization logic. It essentially substitutes built-in synchronization primitives with its own that allow it to manipulate the sequencing of operations behind the scenes to run your tests under all the possible permutations of events.
- Mike Winterberg April 8, 2025
  
  Application Verifier has “Concurrency Fuzzing” that isn’t exactly what you do because it only applies the random delays at existing synchronization points, but it may help.
  
  If the codebase is portable to non-Windows, there’s also Thread Sanitizer. Since it is in league with the compiler, it can presumably find more issues.
  
  https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/application-verifier-tests-within-application-verifier#cuzz
  
  I don’t think either will help much with “I’m writing library code that needs to survive first contact with the enemy (consumers that decide to muck about with priorities)” if you’re not already testing for that.

On priority inversion in the use of a spinlock to ensure atomic access to a `shared_ptr`

Author

4 comments

Read next

Why can’t I use `SEC_LARGE_PAGES` with a file-based file mapping?

Function overloading is more flexible (and more convenient) than template function specialization

Author

4 comments

Read next

Why can’t I use SEC_LARGE_PAGES with a file-based file mapping?

Function overloading is more flexible (and more convenient) than template function specialization

Stay informed

Why can’t I use `SEC_LARGE_PAGES` with a file-based file mapping?