Commenter Shawn wondered why we are so worried about memory access semantics. Back in my day, we just used full barriers everywhere, and I didn’t hear nobody complainin’.
Moore’s law says that the number of transistors in an integrated circuit doubles about every two years. For a long time, these resources were employed to make CPUs faster: Accelerating clock speeds, expanding caches, deepening pipelines, performing more prediction and speculation, increasing the number of registers, that sort of thing. All of these improvements sped up single-threaded operations.
The problem is that you can make things only so small before you start running into problems. It’s hard to build things that are smaller than atoms. The speed of light constrains how quickly two components can communicate with each other. You have to look for other ways to get your performance gains.
CPU speeds have largely leveled off at the 4 GHz range. You can’t go faster that way any more. Instead of scaling up, we have to scale out. Rather than making a CPU core faster, we give each CPU more cores.
By analogy, if you have to transport more people per hour by bus, and you can’t make the bus go any faster, then you have to get more buses.¹
More CPU cores means that programs need to use concurrency in order to extract maximum performance. That in turn led to discoveries like lock convoys, the performance cost of fairness when under heavy load, and (the issue at hand) too many barriers.
Continuing the bus analogy: If you have a fleet of a dozen buses, you need to make sure that nobody tries to use the same ticket to board two different buses. This means that the buses have to somehow communicate with each other to make sure each ticket is used only once. If you have a lot of buses, the ticket validation could end up being the slowest part of loading passengers!
The x86 family of processors have fairly strong memory model semantics. You can dig into the Intel documents to see the precise wording, but it roughly comes down to all memory operations coming with acquire-release semantics by default. Most other processors, however, have a more relaxed memory model by default, allowing them to do more aggressive reordering to improve performance.
In order to ensure any ordering beyond the default, you need to issue explicit fences. And those fences tend to be expensive. (After all, if they were cheap, then the architecture would just do them by default.) Your CPU that has grown to have caches the size of a small country? Yeah, those caches need to be flushed because the memory now has to be made visible to another processor, and they need to be invalidated because they may have been changed by another processor.
Therefore, when writing code that may be used in high-performance scenarios, you want to avoid unnecessary stalls and flushes. And that means choosing the weakest barrier that still achieves the desired result.
In other words, people didn’t complain back then because it wasn’t a problem back then. CPUs were not fast enough and programs not sufficiently multi-threaded that fences showed up in performance traces. But now they do.
Bonus chatter: Commenter Ben noted that the libc++ implementation of shared_ptr is even more aggressive about avoiding barriers and skips the write-release barrier if the reference count is decrementing from 1 to 0, because decrementing to zero means that nobody else has access to the object, so it doesn’t matter that they can’t see what you’re doing with the memory.
Bonus bonus chatter: That article was primarily written as a note to myself for future reference.
¹ You could also make the bug bigger so it carries more passengers. That’s what SIMD tried to do: Let the CPU process more data at once. But it requires workloads that parallelize in a SIMD-friendly way.
The anticipated benefit is directly tied to the use case. For example, on Apple platforms, the macro implements an equivalent of for plain C (and Objective-C) code. As you might guess, it wraps a state flag that indicates if the critical section has been or is being executed.
Since is often used to implement thread-safe static initialization, the expected runtime pattern is that many threads will hit the same over the runtime of the program, with only the first few threads actually needing to synchronize on the critical section. For most of the program’s lifetime, the critical...
:heart: but I still would like to better understand, a real-life example where this makes a meaningful, measurable difference — is it only useful within the std lib implementations of shared_ptr, or mutex lock/unlock and condition-variables etc?
(As an application-tier developer, what are the scenarios where I would want to think deeply and care about this? Is the bus-ticket-validation example meant as a real example where throughput would improve?)
And by accepting the cognitive load of this into my codebase, and testing.. do I expect to gain 1%, 10% or 100% throughput? only on ARM or x64 also?
It depends on the scale of systems you're running on, and whether you're using raw atomics. If you just use std::shared_ptr, mutex lock/unlock, condition variables, and other library facilities, you get the benefit automatically. If you use any atomic operations yourself, then you ought to think about it.
You can, however, get an awful long way with just two rules of thumb:
1. Where a standard library facility exists, use it instead of a home-rolled option. E.g. use C++20 stop_source instead of a shutdown flag, or library queues instead of a mutex and condition variables.
2. Only use atomics when a relaxed memory...
It is useful any time you are working with atomics for any reason, not just for library authors. All architectures will see benefits because of the weaker memory orders allowing the compiler to perform more optimizations than it would be able to with the stronger memory orders. As always though, the actual performance benefits depend quite heavily on your specific workload and processor. If it’s easy enough to pick an appropriate memory order though then you may as well do it instead of fussing over benchmarks, the cognitive load is quite low in most cases.
> You could also make the bug bigger so it carries more passengers.
Francœur from ”Un monstre à Paris” seems to have found a new job.
High-level programming languages are not assembly. Even if the CPU guarantees that load/store operations have acquire/release semantics, it doesn’t mean that code written in a high-level language automatically inherits these guarantees. Therefore, high-level programming languages must establish their own memory models, and compilers are responsible for mapping them onto the CPU’s model.
Just to emphasize this point - I've seen improvements (a while ago, so details are incomplete) on x86-64 by changing an atomic load from acquire to relaxed, even though x86 load instructions imply acquire memory ordering.
The atomic load was for a "quit now" flag, which was checked before loading a new work item, and then once on each iteration of working (entirely locally - no inter-thread communication) on that work item. By making the flag load "relaxed", I enabled the optimizer to hoist a bunch of loop-invariant work (from inlined functions) out of the loop, for a significant speed-up...