When managing reference counts, there is an asymmetry between incrementing and decrementing: Incrementing the reference count can use relaxed semantics, but decrementing requires release semantics (and destroying requires acquire semantics).
The asymmetry may strike you as odd, but maybe it shouldn’t. After all, it’s not surprising that it’s easier to pull your toys out than to put them away.
Incrementing a reference count can be done with relaxed semantics (no memory ordering with respect to other memory locations) because the object is not at risk of being destroyed, and any memory operations that occur after the increment may as well have occurred before the increment. Incrementing a reference count doesn’t really impose any ordering requirements on memory accesses to the object.
Decrementing a reference count is a different story.
The danger with decrementing a reference count is that the object is destructed when the reference count goes to zero. Now, maybe you didn’t decrement the reference count to zero, but it’s possible that another thread decrements it to zero after you do. Therefore, any decrement must be done with release semantics so that any straggling writes to memory are visible to the destructing thread before it frees the memory. One reason is that you want the destructor to see a consistent object. And even if the delayed write doesn’t affect consistency, you don’t want it to complete after the memory is freed. That would be a use-after-free, which is undefined behavior. In practice, this will corrupt whatever object was allocated into the memory that was previously occupied by the destructed object.
Meanwhile, the thread that decrements the reference count to zero must perform an acquire to ensure that it doesn’t start destructing the object until all previous writes have drained.
There are two approaches to this double responsibility on the decrement.
One is to decrement with release semantics, and then establish an acquire fence if you realize that you are the one to do the final decrement. This is the strategy employed by C++/WinRT:
static uint32_t __stdcall Release(fast_abi_forwarder* self) noexcept
{
uint32_t const remaining = self->m_references.
fetch_sub(1, std::memory_order_release) - 1;
if (remaining == 0)
{
std::atomic_thread_fence(std::memory_order_acquire);
delete self;
}
return remaining;
}
Another approach is to use an acquire-release on the decrement, thereby avoiding the need for a separate acquire when the reference count goes to zero. This is the strategy employed by Microsoft’s STL:
void _Decref() noexcept { // decrement use count
if (_MT_DECR(_Uses) == 0) {
_Destroy();
_Decwref();
}
}
void _Decwref() noexcept { // decrement weak reference count
if (_MT_DECR(_Weaks) == 0) {
_Delete_this();
}
}
where _MT_DECR is defined as
#define _MT_DECR(x) _INTRIN_ACQ_REL(_InterlockedDecrement)(reinterpret_cast(&x))
and _INTRIN_ACQ_REL performs an acquire-release atomic operation, or at least the closest version supported by the processor provided it is at least as strong as an acquire-release.
The libc++ library (llvm) also uses acquire-release, as does the libstdc++ library (gcc).
This is largely irrelevant for x86. Any CPU instruction with the lock prefix has a strong memory ordering. In the case of x86, the only time the memory ordering or memory fences make a difference is for the SIMD instructions. This being said, memory ordering affects how code can be moved by the compiler during optimization. This is usually irrelevant for function calls, like InterlockedDecrement/InterlockedIncrement, but it does affect inlined intrinsics. ARM is an entirely different animal, but function calls to InterlockedDecrement/InterlockedIncrement always enforce the strong memory ordering.
I am relatively new to the space of memory-ordering .. I feel like we programmed successfully for decades, just using interlocked instructions to incr/decr reference counts. Why is this a thing I have to worry about now?
It feels like there’s a camp that wants to make C++ programming harder, not easier. I am not a member of that camp.
You might have noticed that computers nowadays have more cores than they did decades ago, and software is more heavily multithreaded now. We can’t scale up any more, so we have to scale out.
@Shawn Van Ness The problem isn't that there are processors where interlocked instructions don't act as a fence. The problem is that there are processors where full fences are a lot more expensive than partial fences, so you want to use the weakest possible fence that gets the job done.
"In the headers I have, _INTRIN_ACQ_REL is just washed out to nothing, for both x64 and ARM. #define _INTRIN_ACQ_REL(x) x"
But look at how the macros are used:
They wash out to nothing, leaving , so they do call the interlocked function. The point is that they don't expand to, say, , like...
@Shawn Van Ness The issue is the interlocked stuff you’re referring to is too heavy-handed, leaving performance on the table. The other memory orderings allow the processor to do less work and allow the compiler to do better optimization.
I get that we must use interlocked inc/dec because of threading. I am asking, why is that not sufficient? Are there processor archs where the interlocked instructions don’t inherently act as a fence? (Is this an ARM64 thing?)
In the headers I have, _INTRIN_ACQ_REL is just washed out to nothing, for both x64 and ARM. #define _INTRIN_ACQ_REL(x) x
(It sounds very scary to me, if someone were to start making chips where the interlocked instructions don’t have the same sequential ordering guarantee.)
This has nothing to do with C++. Memory ordering is specifically for the processor to know how to handle memory caching. The compiler just also uses it to know how it is allowed to perform optimizations.
Tiny nitpick on names: LLVM’s C++ library is called “libc++”, and GCC’s is called “libstdc++”.
There's an extra wrinkle / strategy used by shared_ptr and weak_ptr in LLVM's libc++. For the weak count, libc++ does a load acquire on the weak ref count, and if this is the last reference, it doesn't even do the decrement. If it's not the last one, then it does the acq_rel decrement. This saves a potentially expensive atomic store in the extremely common case of going from a ref count of 1 to 0, at the expense of a unnecessary loads when there are a lot of weak_ptrs around. I even left a big comment...