October 15th, 2025

4 reactions

Why can you increment a reference count with relaxed semantics, but you have to decrement with release semantics?

Raymond Chen

When managing reference counts, there is an asymmetry between incrementing and decrementing: Incrementing the reference count can use relaxed semantics, but decrementing requires release semantics (and destroying requires acquire semantics).

The asymmetry may strike you as odd, but maybe it shouldn’t. After all, it’s not surprising that it’s easier to pull your toys out than to put them away.

Incrementing a reference count can be done with relaxed semantics (no memory ordering with respect to other memory locations) because the object is not at risk of being destroyed, and any memory operations that occur after the increment may as well have occurred before the increment. Incrementing a reference count doesn’t really impose any ordering requirements on memory accesses to the object.

Decrementing a reference count is a different story.

The danger with decrementing a reference count is that the object is destructed when the reference count goes to zero. Now, maybe you didn’t decrement the reference count to zero, but it’s possible that another thread decrements it to zero after you do. Therefore, any decrement must be done with release semantics so that any straggling writes to memory are visible to the destructing thread before it frees the memory. One reason is that you want the destructor to see a consistent object. And even if the delayed write doesn’t affect consistency, you don’t want it to complete after the memory is freed. That would be a use-after-free, which is undefined behavior. In practice, this will corrupt whatever object was allocated into the memory that was previously occupied by the destructed object.

Meanwhile, the thread that decrements the reference count to zero must perform an acquire to ensure that it doesn’t start destructing the object until all previous writes have drained.

There are two approaches to this double responsibility on the decrement.

One is to decrement with release semantics, and then establish an acquire fence if you realize that you are the one to do the final decrement. This is the strategy employed by C++/WinRT:

static uint32_t __stdcall Release(fast_abi_forwarder* self) noexcept
{
    uint32_t const remaining = self->m_references.
        fetch_sub(1, std::memory_order_release) - 1;
    if (remaining == 0)
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        delete self;
    }
    return remaining;
}

Another approach is to use an acquire-release on the decrement, thereby avoiding the need for a separate acquire when the reference count goes to zero. This is the strategy employed by Microsoft’s STL:

    void _Decref() noexcept { // decrement use count
        if (_MT_DECR(_Uses) == 0) {
            _Destroy();
            _Decwref();
        }
    }

    void _Decwref() noexcept { // decrement weak reference count
        if (_MT_DECR(_Weaks) == 0) {
            _Delete_this();
        }
    }

where _MT_DECR is defined as

#define _MT_DECR(x) _INTRIN_ACQ_REL(_InterlockedDecrement)(reinterpret_cast(&x))

and _INTRIN_ACQ_REL performs an acquire-release atomic operation, or at least the closest version supported by the processor provided it is at least as strong as an acquire-release.

The libc++ library (llvm) also uses acquire-release, as does the libstdc++ library (gcc).

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

Discussion is closed. Login to edit/delete existing comments.

Roman Shpount October 17, 2025

This is largely irrelevant for x86. Any CPU instruction with the lock prefix has a strong memory ordering. In the case of x86, the only time the memory ordering or memory fences make a difference is for the SIMD instructions. This being said, memory ordering affects how code can be moved by the compiler during optimization. This is usually irrelevant for function calls, like InterlockedDecrement/InterlockedIncrement, but it does affect inlined intrinsics. ARM is an entirely different animal, but function calls to InterlockedDecrement/InterlockedIncrement always enforce the strong memory ordering.
Shawn Van Ness October 16, 2025

I am relatively new to the space of memory-ordering .. I feel like we programmed successfully for decades, just using interlocked instructions to incr/decr reference counts. Why is this a thing I have to worry about now?

It feels like there’s a camp that wants to make C++ programming harder, not easier. I am not a member of that camp.
- Raymond Chen Author October 17, 2025
  
  You might have noticed that computers nowadays have more cores than they did decades ago, and software is more heavily multithreaded now. We can’t scale up any more, so we have to scale out.
  - Raymond Chen Author October 20, 2025 · Edited
    
    @Shawn Van Ness The problem isn't that there are processors where interlocked instructions don't act as a fence. The problem is that there are processors where full fences are a lot more expensive than partial fences, so you want to use the weakest possible fence that gets the job done.
    
    "In the headers I have, _INTRIN_ACQ_REL is just washed out to nothing, for both x64 and ARM. #define _INTRIN_ACQ_REL(x) x"
    
    But look at how the macros are used:
    
    They wash out to nothing, leaving , so they do call the interlocked function. The point is that they don't expand to, say, , like...
    Read more
    @Shawn Van Ness The problem isn’t that there are processors where interlocked instructions don’t act as a fence. The problem is that there are processors where full fences are a lot more expensive than partial fences, so you want to use the weakest possible fence that gets the job done.
    
    “In the headers I have, _INTRIN_ACQ_REL is just washed out to nothing, for both x64 and ARM. #define _INTRIN_ACQ_REL(x) x”
    
    But look at how the macros are used:
    
    #define _MT_DECR(x) _INTRIN_ACQ_REL(_InterlockedDecrement)(reinterpret_cast<volatile long*>(&x))
    
    They wash out to nothing, leaving _InterlockedDecrement(reinterpret_cast<volatile long*>(&x)), so they do call the interlocked function. The point is that they don’t expand to, say, _InterlockedDecrement_acq, like _INTRIN_ACQUIRE does.
    
    Read less
  - LB October 18, 2025 · Edited
    
    @Shawn Van Ness The issue is the interlocked stuff you’re referring to is too heavy-handed, leaving performance on the table. The other memory orderings allow the processor to do less work and allow the compiler to do better optimization.
  - Shawn Van Ness October 18, 2025
    
    I get that we must use interlocked inc/dec because of threading. I am asking, why is that not sufficient? Are there processor archs where the interlocked instructions don’t inherently act as a fence? (Is this an ARM64 thing?)
    
    In the headers I have, _INTRIN_ACQ_REL is just washed out to nothing, for both x64 and ARM. #define _INTRIN_ACQ_REL(x) x
    
    (It sounds very scary to me, if someone were to start making chips where the interlocked instructions don’t have the same sequential ordering guarantee.)
- LB October 17, 2025
  
  This has nothing to do with C++. Memory ordering is specifically for the processor to know how to handle memory caching. The compiler just also uses it to know how it is allowed to perform optimizations.
Sebastian Redl October 15, 2025

Tiny nitpick on names: LLVM’s C++ library is called “libc++”, and GCC’s is called “libstdc++”.
Ben Craig October 15, 2025

There's an extra wrinkle / strategy used by shared_ptr and weak_ptr in LLVM's libc++. For the weak count, libc++ does a load acquire on the weak ref count, and if this is the last reference, it doesn't even do the decrement. If it's not the last one, then it does the acq_rel decrement. This saves a potentially expensive atomic store in the extremely common case of going from a ref count of 1 to 0, at the expense of a unnecessary loads when there are a lot of weak_ptrs around. I even left a big comment...
Read more
There’s an extra wrinkle / strategy used by shared_ptr and weak_ptr in LLVM’s libc++. For the weak count, libc++ does a load acquire on the weak ref count, and if this is the last reference, it doesn’t even do the decrement. If it’s not the last one, then it does the acq_rel decrement. This saves a potentially expensive atomic store in the extremely common case of going from a ref count of 1 to 0, at the expense of a unnecessary loads when there are a lot of weak_ptrs around. I even left a big comment block talking about it in the source!

Read less

Why can you increment a reference count with relaxed semantics, but you have to decrement with release semantics?

Author

9 comments

Read next

Using RAII to remedy a defect where not all code paths performed required exit actions, follow-up

What makes `cheap_steady_clock` faster than `std::chrono::high_resolution_clock`?

Author

9 comments

Read next

Using RAII to remedy a defect where not all code paths performed required exit actions, follow-up

What makes cheap_steady_clock faster than std::chrono::high_resolution_clock?

Stay informed

What makes `cheap_steady_clock` faster than `std::chrono::high_resolution_clock`?