High-performance multithreading is very hard
Among other things, you need to understand weak memory models. Hereby incorporating by reference Brad Abrams‘ discussion of volatile and MemoryBarrier(). In particular, Vance Morrison’s discussion of memory models is important reading. (Though I think Brad is being too pessimistic about volatile. Ensuring release semantics at the store of “singleton” is all you really need – you want to make sure the singleton is fully constructed before you let the world see it. volatile here is overkill.) Vance’s message also slyly introduces the concepts of “acquire” and “release” memory semantics. An interlocked operation with “acquire” semantics prevents future reads from being advanced to before the acquisition. An interlocked operation with “release” semantics prevents previous writes from being delayed until after the release.
In the absence of explicitly-named memory semantics, the Win32 Interlocked* functions by default provide full memory barrier semantics. However, some functions, like InterlockedIncrementAcquire, forego the full memory barrier semantics and provide only acquire or release semantics.