Most web sites out there tell you to use the concurrency primitives provided by your OS because this stuff is hard to understand. That's just not useful advice for RAMCloud, since we care so much about performance (and we're willing to go through as much pain as necessary to get it).
Main Issues
We assume the processor may reorder instructions and delay stores indefinitely unless told otherwise.
We assume the compiler may reorder instructions or remove them altogether for efficiency unless told otherwise.
A correct concurrency primitive must account for both of these issues.
Processor Tools
Memory fences: Mfence
Compiler Tools
Inline assembly
asm vs __asm__
The two keywords behave the same. The keyword asm is not available in ISO C programs, so if you want compatibility with those, you should use the alternate keyword __asm__. See Alternate Keywords in the GCC manual for details.
volatile vs __volatile__
This otherwise useful howto claims the following:
If our assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization), put the keyword volatile after asm and before the ()'s. So to keep it from moving, deleting and all...
However, this contradicts the gcc manual, which clearly states that the volatile keyword on asm statements will not stop the compiler from moving the asm instructions, including across jump instructions (see Extended Asm in the GCC manual).