Why are we worried about memory access semantics? Full barriers should be enough for anybody

Raymond Chen

Commenter Shawn wondered why we are so worried about memory access semantics. Back in my day, we just used full barriers everywhere, and I didn’t hear nobody complainin’.

Moore’s law says that the number of transistors in an integrated circuit doubles about every two years. For a long time, these resources were employed to make CPUs faster: Accelerating clock speeds, expanding caches, deepening pipelines, performing more prediction and speculation, increasing the number of registers, that sort of thing. All of these improvements sped up single-threaded operations.

The problem is that you can make things only so small before you start running into problems. It’s hard to build things that are smaller than atoms. The speed of light constrains how quickly two components can communicate with each other. You have to look for other ways to get your performance gains.

CPU speeds have largely leveled off at the 4 GHz range. You can’t go faster that way any more. Instead of scaling up, we have to scale out. Rather than making a CPU core faster, we give each CPU more cores.

By analogy, if you have to transport more people per hour by bus, and you can’t make the bus go any faster, then you have to get more buses.¹

More CPU cores means that programs need to use concurrency in order to extract maximum performance. That in turn led to discoveries like lock convoys, the performance cost of fairness when under heavy load, and (the issue at hand) too many barriers.

Continuing the bus analogy: If you have a fleet of a dozen buses, you need to make sure that nobody tries to use the same ticket to board two different buses. This means that the buses have to somehow communicate with each other to make sure each ticket is used only once. If you have a lot of buses, the ticket validation could end up being the slowest part of loading passengers!

The x86 family of processors have fairly strong memory model semantics. You can dig into the Intel documents to see the precise wording, but it roughly comes down to all memory operations coming with acquire-release semantics by default. Most other processors, however, have a more relaxed memory model by default, allowing them to do more aggressive reordering to improve performance.

In order to ensure any ordering beyond the default, you need to issue explicit fences. And those fences tend to be expensive. (After all, if they were cheap, then the architecture would just do them by default.) Your CPU that has grown to have caches the size of a small country? Yeah, those caches need to be flushed because the memory now has to be made visible to another processor, and they need to be invalidated because they may have been changed by another processor.

Therefore, when writing code that may be used in high-performance scenarios, you want to avoid unnecessary stalls and flushes. And that means choosing the weakest barrier that still achieves the desired result.

In other words, people didn’t complain back then because it wasn’t a problem back then. CPUs were not fast enough and programs not sufficiently multi-threaded that fences showed up in performance traces. But now they do.

Bonus chatter: Commenter Ben noted that the libc++ implementation of shared_ptr is even more aggressive about avoiding barriers and skips the write-release barrier if the reference count is decrementing from 1 to 0, because decrementing to zero means that nobody else has access to the object, so it doesn’t matter that they can’t see what you’re doing with the memory.

Bonus bonus chatter: That article was primarily written as a note to myself for future reference.

¹ You could also make the bug bigger so it carries more passengers. That’s what SIMD tried to do: Let the CPU process more data at once. But it requires workloads that parallelize in a SIMD-friendly way.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

7 comments

Discussion is closed. Login to edit/delete existing comments.

Kyle Sluder December 27, 2025 · Edited

The anticipated benefit is directly tied to the use case. For example, on Apple platforms, the macro implements an equivalent of for plain C (and Objective-C) code. As you might guess, it wraps a state flag that indicates if the critical section has been or is being executed.

Since is often used to implement thread-safe static initialization, the expected runtime pattern is that many threads will hit the same over the runtime of the program, with only the first few threads actually needing to synchronize on the critical section. For most of the program’s lifetime, the critical...
Read more
The anticipated benefit is directly tied to the use case. For example, on Apple platforms, the dispatch_once macro implements an equivalent of std::call_once for plain C (and Objective-C) code. As you might guess, it wraps a state flag that indicates if the critical section has been or is being executed.

Since dispatch_once is often used to implement thread-safe static initialization, the expected runtime pattern is that many threads will hit the same dispatch_once over the runtime of the program, with only the first few threads actually needing to synchronize on the critical section. For most of the program’s lifetime, the critical section will have already been executed. A standard lock would require a full synchronization and cache flush, which is potentially devastating for statics that are accessed in a hot loop. So the implementation of dispatch_once uses an acquire-load to read the state flag, which allows it to skip over an already-executed critical section without dirtying the caches of other CPU cores that happen to have the state flag in their cache lines.

This improves performance significantly in cases where many threads are fetching the same global. If your program doesn’t do that, it sees less of a benefit from the way dispatch_once is implemented.

Read less
Shawn Van Ness December 27, 2025

:heart: but I still would like to better understand, a real-life example where this makes a meaningful, measurable difference — is it only useful within the std lib implementations of shared_ptr, or mutex lock/unlock and condition-variables etc?

(As an application-tier developer, what are the scenarios where I would want to think deeply and care about this? Is the bus-ticket-validation example meant as a real example where throughput would improve?)

And by accepting the cognitive load of this into my codebase, and testing.. do I expect to gain 1%, 10% or 100% throughput? only on ARM or x64 also?
- Simon Farnsworth January 2, 2026
  
  It depends on the scale of systems you're running on, and whether you're using raw atomics. If you just use std::shared_ptr, mutex lock/unlock, condition variables, and other library facilities, you get the benefit automatically. If you use any atomic operations yourself, then you ought to think about it.
  
  You can, however, get an awful long way with just two rules of thumb:
  
  1. Where a standard library facility exists, use it instead of a home-rolled option. E.g. use C++20 stop_source instead of a shutdown flag, or library queues instead of a mutex and condition variables.
  
  2. Only use atomics when a relaxed memory...
  Read more
  It depends on the scale of systems you’re running on, and whether you’re using raw atomics. If you just use std::shared_ptr, mutex lock/unlock, condition variables, and other library facilities, you get the benefit automatically. If you use any atomic operations yourself, then you ought to think about it.
  
  You can, however, get an awful long way with just two rules of thumb:
  
  1. Where a standard library facility exists, use it instead of a home-rolled option. E.g. use C++20 stop_source instead of a shutdown flag, or library queues instead of a mutex and condition variables.
  
  2. Only use atomics when a relaxed memory ordering is good enough. That’s cases where the value of the atomic tells you nothing about the state of other variables in the program; but, for example, it’s fine for things like progress counters, where the actual result is handled via mutexes protecting memory, and joining threads before you assume they’ve finished, but it’s useful for UI or debugging to know how far through the work a bunch of threads have got.
  
  Beyond this, you get into “cognitive load” and needing to understand what each of the orderings actually gets you; the biggest mistake I’ve seen (and fixed, for a few % reduction in CPU time per request) is using sequential consistency when a weaker memory ordering is good enough. While sequential consistency is very strong, it’s also virtually always stronger than you need, and you pay a cost for that strength, since CPUs need to agree on a global ordering for all sequentially consistent atomics on all CPUs, whereas even acquire/release only needs a single CPU at a time to make decisions.
  
  Read less
- LB December 29, 2025
  
  It is useful any time you are working with atomics for any reason, not just for library authors. All architectures will see benefits because of the weaker memory orders allowing the compiler to perform more optimizations than it would be able to with the stronger memory orders. As always though, the actual performance benefits depend quite heavily on your specific workload and processor. If it’s easy enough to pick an appropriate memory order though then you may as well do it instead of fussing over benchmarks, the cognitive load is quite low in most cases.
Dmitry December 26, 2025

> You could also make the bug bigger so it carries more passengers.

Francœur from ”Un monstre à Paris” seems to have found a new job.
Yexuan Xiao December 26, 2025 · Edited

High-level programming languages are not assembly. Even if the CPU guarantees that load/store operations have acquire/release semantics, it doesn’t mean that code written in a high-level language automatically inherits these guarantees. Therefore, high-level programming languages must establish their own memory models, and compilers are responsible for mapping them onto the CPU’s model.
- Simon Farnsworth January 5, 2026
  
  Just to emphasize this point - I've seen improvements (a while ago, so details are incomplete) on x86-64 by changing an atomic load from acquire to relaxed, even though x86 load instructions imply acquire memory ordering.
  
  The atomic load was for a "quit now" flag, which was checked before loading a new work item, and then once on each iteration of working (entirely locally - no inter-thread communication) on that work item. By making the flag load "relaxed", I enabled the optimizer to hoist a bunch of loop-invariant work (from inlined functions) out of the loop, for a significant speed-up...
  Read more
  Just to emphasize this point – I’ve seen improvements (a while ago, so details are incomplete) on x86-64 by changing an atomic load from acquire to relaxed, even though x86 load instructions imply acquire memory ordering.
  
  The atomic load was for a “quit now” flag, which was checked before loading a new work item, and then once on each iteration of working (entirely locally – no inter-thread communication) on that work item. By making the flag load “relaxed”, I enabled the optimizer to hoist a bunch of loop-invariant work (from inlined functions) out of the loop, for a significant speed-up on large items – the worst case went from minutes of processing time to seconds, while still quitting in milliseconds.
  
  And sure, I could have done a chunk of manual optimization work to get the same effect, but that would have come at the expense of making the code less clear, and more prone to future errors.
  
  Read less