August 8th, 2014

The case of the orphaned LpdrLoaderLock critical section

A customer found that under heavy load, their application would occasionally hang. Debugging the hang shows that everybody was waiting for the Lpdr­Loader­Lock critical section. The catch was that the critical section was reported as locked, but the owner thread did not exist.

0:000> !critsec ntdll!LdrpLoaderLock
CritSec ntdll!LdrpLoaderLock+0 at 7758c0a0
WaiterWoken        No
LockCount          4
RecursionCount     2
OwningThread       150c
EntryCount         0
ContentionCount    4
*** Locked
0:000> ~~[150c]k
              ^ Illegal thread error in '~~[150c]k'

How can a critical section be owned by thread that no longer exists?

There are two ways this can happen. One is that there is a bug in the code that manages the critical section such that there is some code path that takes the critical section but forgets to release it. This is unlikely to be the case for the loader lock (since a lot of really smart people are in charge of the loader lock), but it’s a theoretical possibility. We’ll keep that one in our pocket.

Another possibility is that the code to exit the critical section never got a chance to run. For example, somebody may have thrown an exception across the stack frames which manage the critical section, or somebody may have tried to exit the thread from inside the critical section, or (gasp) somebody may have called Terminate­Thread to nuke the thread from orbit.

I suggested that the Terminate­Thread theory was a good one to start with, because even if it wasn’t the case, the investigation should go quickly because the light is better. You’re not so much interested in finding it as you are in ruling it out quickly.¹

The customer replied, “We had one call to Terminate­Thread in our application, and we removed it, but the problem still occurs. Is it worth also checking the source code of the DLLs our application links to?”

Okay, the fact that they found one at all means that there’s a good chance others are lurking.

Before we could say, “Yes, please continue your search,” the customer followed up. “We found a call to Terminate­Thread in a component provided by another company. The code created a worker thread, and decided to clean up the worker thread by terminating it. We commented out the call just as a test, and it seems to fix the problem. So at least now we know where the problem is and we can try to fix it properly.”

¹ In the medical profession, there’s the term ROMI, which stands for rule out myocardial infarction. It says that if a patient comes to you with anything that could possibly remotely be the symptom of a heart attack, like numbness in the arm or chest pain, you should perform a test to make sure. Even though the test is probably going to come back negative, you have to check just to be safe. Here, we’re ruling out Terminate­Thread, which I guess could go by the dorky acronym ROTT.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.