Inside Concurrency Primitives
Most web sites out there tell you to use the concurrency primitives provided by your OS because this stuff is hard to understand. That's just not useful advice for RAMCloud, since we care so much about performance (and we're willing to go through as much pain as necessary to get it).
Main Issues
We assume the processor may reorder instructions and delay stores indefinitely unless told otherwise.
We assume the compiler may reorder instructions or remove them altogether for efficiency unless told otherwise.
A correct concurrency primitive must account for both of these issues.
Processor Tools
Memory fences: Mfence
See Section 8.2 (Memory Ordering) in Intel 3a System Programming Manual. This suggests you can actually get away with a fair bit. I think it also says that the LOCK prefix and XCHG instruction act like an mfence (TODO: confirm).
Compiler Tools
Inline assembly
asm vs __asm__
The two keywords behave the same. The keyword asm is not available in ISO C programs, so if you want compatibility with those, you should use the alternate keyword __asm__. See Alternate Keywords in the GCC manual for details.
Since RAMCloud is written in C++, we should use asm rather than the uglier __asm__.
volatile vs __volatile__
The volatile keyword on asm statements keeps gcc from deleting those asm statements as useless code. You must inform gcc of all side effects of your asm statements. If your asm statements have important side effects that can't be expressed as output operands (or you don't use the output operands but still want the asm instructions to execute), you should use the volatile keyword. This keeps gcc from deleting the asm block as part of its optimization passes. From Extended Asm in the GCC manual:
The volatile keyword indicates that the instruction has important side-effects. GCC will not delete a volatile asm if it is reachable. (The instruction can still be deleted if GCC can prove that control-flow will never reach the location of the instruction.)
Volatile does not, however, prevent GCC from moving the asm instruction. From Extended Asm in the GCC manual:
Note that even a volatile asm instruction can be moved relative to other code, including across jump instructions.
The gcc manual is still unsatisfying in this regard. What can asm volatile statements be reordered with? If all the compiler knows is that running the asm is important, what does it assume about where the asm must run? The manual suggests it's not entirely conservative.
In a 2007 mailing list post Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily, Linus says:
[I]t's a good idea to have the "memory" clobber there too, because the semantics of the "volatile" just aren't very well defined from a code movement standpoint.
This otherwise useful howto (GCC-Inline-Assembly-HOWTO) claims the following:
If our assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization), put the keyword volatile after asm and before the ()'s. So to keep it from moving, deleting and all...
However, this contradicts the gcc manual, which clearly states that the volatile keyword on asm statements will not stop the compiler from moving the asm instructions, including across jump instructions (see Extended Asm in the GCC manual).