September 12th, 2005

Understanding the consequences of WAIT_ABANDONED

One of the important distinctions between mutexes and the other synchronization objects is that mutexes have owners. If the thread that owns a mutex exits without releasing the mutex, the mutex is automatically released on the thread’s behalf.

But if this happens, you’re in big trouble.

One thing many people gloss over is the WAIT_ABANDONED return value from the synchronization functions such as WaitForSingleObject. They typically treat this as a successful wait, because it does mean that the object was obtained, but it also tells you that the previous owner left the mutex abandoned and that the system had to release it on the owner’s behalf.

Why are you in big trouble when this happens?

Presumably, you created that mutex to protect multiple threads from accessing a shared object while it is an unstable state. Code enters the mutex, then starts manipulating the object, temporarily making it unstable, but eventually restabilizing it and then releasing the mutex so that the next person can access the object.

For example, you might have code that manages an anchored doubly-linked list in shared memory that goes like this:

void MyClass::ReverseList()
{
 WaitForSingleObject(hMutex, INFINITE);
 int i = 0; // anchor
 do {
  int next = m_items[i].m_next;
  m_items[i].m_next = m_items[i].m_prev;
  m_items[i].m_prev = next;
  i = next;
 } while (i != 0);
 ReleaseMutex(hMutex);
}

There is nothing particularly exciting going on. Basic stuff, right?

But what if the program crashes while holding the mutex? (If you believe that your programs are bug-free, consider the possiblity that the program is running over the network and the network goes down, leading to an in-page exception. Or simply that the user went to Task Manager and terminated your program while this function is running.)

In that case, the mutex is automatically released by the operating system, leaving the linked list in a corrupted state. The next program to claim the mutex will receive WAIT_ABANDONED as the status code. If you ignore that status code, you end up operating on a corrupted linked list. Depending on how that linked list is used, it may result in a resource leak or the system creating an unintended second copy of something, or perhaps even a crash. The unfortunate demise of one program causes other programs to start behaving strangely.

Then again, the question remains, “What do you do, then, if you get WAIT_ABANDONED?” The answer is, “Good question.”

You might try to repair the corruption, if you keep enough auxiliary information around to recover a consistent state. You might even design your data structures to be transactional, so that the death of a thread manipulating the data structures does not leave them in a corrupted state. Or you might just decide that since things are corrupted, you should throw away everything and start over, losing the state of work in progress, but at least allowing new work to proceed unhindered.

Or you might simply choose to ignore the error and continue onwards with a corrupt data structure, hoping that whatever went wrong won’t result in cascade failures down the line. This is what most people do, though usually without even being aware that they’re doing it. And it’s really hard to debug the crashes that result from this approach.

Exercise: Why did we use indices instead of pointers in our linked list data structure?

[Raymond is currently away; this message was pre-recorded.]

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

0 comments

Discussion are closed.