Most web sites out there tell you to use the concurrency primitives provided by your OS because this stuff is hard to understand. That's just not useful advice for RAMCloud, since we care so much about performance (and we're willing to go through as much pain as necessary to get it).
Main Issues
We assume the processor may reorder instructions and delay stores indefinitely unless told otherwise.
We assume the compiler may reorder instructions or remove them altogether for efficiency unless told otherwise.
A correct concurrency primitive must account for both of these issues.
Processor Tools
Memory fences: Mfence
Compiler Tools
Inline assembly
asm vs __asm__
The two keywords behave the same. The keyword asm is not available in ISO C programs, so if you want compatibility with those, you should use the alternate keyword __asm__. See Alternate Keywords in the GCC manual for details.