This is a general problem, but I’m going to give a specific example.
C++ standard library collections do not support concurrent mutation. It is the caller’s responsibility to serialize mutation operations, and to ensure that no mutation occurs at the same time as any other operation. And the usual way of accomplishing this is to use a mutex of some kind.
class ThingManager { private: std::mutex things_lock_; std::vector<std::shared_ptr<Thing>> things_; public: void AddThing(std::shared_ptr<Thing> thing) { std::lock_guard guard(things_lock_); things_.push_back(std::move(thing)); } void RemoveThingById(int32_t id) { std::lock_guard guard(things_lock_); auto it = std::find_if(things_.begin(), things_.end(), [&](auto&& thing) { return thing->id() == id; }); if (it != things_.end()) { things_.erase(it); } } };
The idea here is that you give the ThingÂManager
a bunch of things, and then you can later remove them by providing the Thing
‘s ID. Presumably there are also methods to search for Thing
s or to perform some operation across all Thing
s, but those are just distractions from the exercise.
This particular object wants to support concurrent operation, so it internally uses a mutex to ensure safe operation.
Now, you can quibble about the use of find_if
instead of remove_if
, or using a std::vector
instead of a std::map
, but let’s set that aside.
The question is: What’s wrong with this code?
I sort of gave it away in the title: We are calling out to external code while holding our lock.
You probably know not to call out to external code when holding an internal lock, and the act of invoking a method on an object may remind you of that fact. But destructors just run by themselves. You don’t typically write code the trigger the destruction of an object. The object just destructs when it destructs.
Removing the shared_ptr<Thing>
from our vector could result in the Thing
‘s destruction if this was the last strong reference to the Thing
. And that destructor runs while the things_lock_
is still locked.
Now things get interesting.
You may not know all that happens inside the Thing
destructor. It may have been written by another team, or by you, six months ago. Or somebody could have derived from Thing
and given you a shared pointer to the derived object.¹ Or you might be given a shared pointer to a Thing
embedded inside a larger object.
Let’s do that thing with the derived type:
class SuperThing : Thing { private: ThingManager& manager_; int32_t helper_id_ = 0; public: SuperThing(ThingManager& manager) : manager_(manager) { auto helper = std::make_shared<Thing>(); helper_id_ = helper->id(); manager_.AddThing(helper); } ~SuperThing() { manager_.RemoveThingById(helper_id_); } };
The SuperÂThing
object is itself a Thing
, but it also uses a helper thing. At construction, it creates a helper thing and registers it with the manager, retaining the ID. And at destruction, it removes its helper thing.
And then this happens:
void test(ThingManager& manager) { auto s = std::make_shared<SuperThing>(); auto id = s->id(); manager.AddThing(s); s = nullptr; manager.RemoveThingById(id); }
This little test function creates a SuperÂThing
, adds it to the thing manager, and then immediately removes it.
The RemoveÂThingÂByÂId
function looks for a matching Id and finds it, so it erases the corresponding Thing
from the vector. That erasure destroys the shared_
ptr
, and since this is the last strong reference, the underlying Thing
is also destroyed.
This runs the destructor of our SuperÂThing
, which tries to remove its helper Thing
. And that calls back into the ThingManager
, which gets stuck trying to acquire a mutex that is already held (unwittingly, by itself).
This is not a purely theoretical exercise. This sort of thing happens, and it’s a source of bugs.
Next time, we’ll look at how to address these types of problems.
¹ If you work in Windows, a common scenario for this is that the shared_ptr
in the example above takes the form of a COM reference-counted pointer.
The mutex is irrelevant here; I've seen an example where object A releases a reference to object B, which causes code to run which tries to mutate object A, which is in an inconsistent state because it's in the middle of releasing its reference to object B. It's particularly fun when it tries to release its reference to object B again...
The solution is to transfer the reference to a temporary, so that by the time...
This is probably only a solution for the specific example and not the general problem, but one thing you could do is this:
<code>
Making a local copy of the shared_ptr before erasing it postpones the destructor call.
I can has recursive mutex?
This is a perfectly serious comment. I’m accustomed to recursive mutexes being the natural solution to this.
Recursive mutexes actually have similar problems. While holding the lock you typically also assume some pre- and post-conditions, but if in the middle of such a function you call another function that requires the lock those conditions may not be met.
You might also have an iterator for example, that is invalidated when the function you call also modifies the same vector.
Finally when you have more than one mutex in your program recursive mutexes make it...
Yup, I’ve run into this very problem many times of “Thing tries to remove itself from the list of Things managed by the ThingManager in its destructor”. IIRC I don’t think I’ve hit this specific issue of deadlock when trying to lock a mutex, but I’ve definitely run into other sorts of reentrancy problems. Care is very much needed here.
(I won’t spoil the answer and steal Raymond’s thunder from tomorrow.)
Minor issue, in
I believed you meant to write