Most web sites out there tell you to use the concurrency primitives provided by your OS because this stuff is hard to understand. That's just not useful advice for RAMCloud, since we care so much about performance (and we're willing to go through as much pain as necessary to get it).

Main Issues

We assume the processor may reorder instructions and delay stores indefinitely unless told otherwise.

We assume the compiler may reorder instructions or remove them altogether for efficiency unless told otherwise.

A correct concurrency primitive must account for both of these issues.

Processor Tools

Memory fences: Mfence

See Section 8.2 (Memory Ordering) in Intel 3a System Programming Manual. This suggests you can actually get away with a fair bit. I think it also says that the LOCK prefix and XCHG instruction act like an mfence (TODO: confirm).

Compiler Tools

Inline assembly

asm vs __asm__

The two keywords behave the same. The keyword asm is not available in ISO C programs, so if you want compatibility with those, you should use the alternate keyword __asm__. See Alternate Keywords in the GCC manual for details.

Since RAMCloud is written in C++, we should use asm rather than the uglier __asm__.

volatile vs __volatile__

The volatile keyword on asm statements keeps gcc from deleting those asm statements as useless code. You must inform gcc of all side effects of your asm statements. If your asm statements have important side effects that can't be expressed as output operands (or you don't use the output operands but still want the asm instructions to execute), you should use the volatile keyword. This keeps gcc from deleting the asm block as part of its optimization passes. From Extended Asm in the GCC manual:

The volatile keyword indicates that the instruction has important side-effects. GCC will not delete a volatile asm if it is reachable. (The instruction can still be deleted if GCC can prove that control-flow will never reach the location of the instruction.)

Volatile does not, however, prevent GCC from moving the asm instruction. From Extended Asm in the GCC manual:

Note that even a volatile asm instruction can be moved relative to other code, including across jump instructions.

The gcc manual is still unsatisfying in this regard. What can asm volatile statements be reordered with? If all the compiler knows is that running the asm is important, what does it assume about where the asm must run? The manual suggests it's not entirely conservative.

In a 2007 mailing list post Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily, Linus says:

[I]t's a good idea to have the "memory" clobber there too, because the semantics of the "volatile" just aren't very well defined from a code movement standpoint.

This otherwise useful howto (GCC-Inline-Assembly-HOWTO) claims the following:

If our assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization), put the keyword volatile after asm and before the ()'s. So to keep it from moving, deleting and all...

However, this contradicts the gcc manual, which clearly states that the volatile keyword on asm statements will not stop the compiler from moving the asm instructions, including across jump instructions (see Extended Asm in the GCC manual).

Clobbers memory

From Extended Asm in the GCC manual:

If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory. You will also want to add the volatile keyword if the memory affected is not listed in the inputs or outputs of the asm, as the `memory' clobber does not count as a side-effect of the asm.

In a LKML post, Trent Piepho found that in the assembly output for the following code, x[20] = 1 is ellided:

extern int a; /* keep asm from being elided for having no used output */
static inline void bar(void) { asm("call bar" : "=m"(a) : : "memory"); }
/* float x can't alias asm's output int a */
void foo(float *x) { x[20] = 1; bar(); x[20] = 2; }

I was able to confirm this behavior using the KNOPPIX 3.6 bootable CD image, which has gcc 3.3.4.

He concludes that:

Adding a memory clobber tells gcc that the asm modifies memory. It doesn't modify un-aliased local variables in registers. It does modify aliased local variables. It does not read from memory. gcc will move or elide a memory write before an asm with a memory clobber if nothing else (besides the asm) could see the write. A memory clobber doesn't count as a side-effect either, a non-volatile asm without unused outputs will be elided, even if has a memory clobber.

Linus responds:

It does leave us with very few ways of saying that an asm can read memory, and so it might be good to have it clarified that "volatile" implies that (at least with the memory clobber).

So it seems the only safe way to make sure an asm block is not not moved or deleted by the compiler is to use volatile together with a memory clobber.

Volatile keyword on data or pointers

lkml emails from linus http://lwn.net/Articles/233481/:

If the memory barriers are right, then the "volatile" doesn't matter. And if the memory barriers aren't right, then "volatile" doesn't help.

Basically, a volatile on a data structure can NEVER be right. If it makes a difference, it's a sign of improper locking.

potentially make the compiler generate worse code for no reason (the "no reason" being that if there aren't any barriers in between, the compiler should merge accesses)

The only thing "volatile" can ever do is generate worse code, WITH NO UPSIDES.

actual accessor functions can use it in a cast to make one particular access follow the rules of "don't cache this one dereference". That is useful as part of a bigger set of rules about that access (i.e., it might be the internal implementation of a "readb()", for example).

On pointers

In Section 7.1.6.1 (The cv-qualifiers) of the C++ standard:

A pointer or reference to a cv-qualified type need not actually point or refer to a cv-qualified object, but it is treated as if it does.

So you can use a pointer marked volatile to signify volatile accesses, i.e., those that should be read from memory. However, this is useful for interacting with IO devices, not for concurrency where you need memory barriers anyway.

Survey of Existing Concurrency Primitives

Linux

The various memory fences are defined in arch/x86/include/asm/system.h as follows:

#define mb()    asm volatile("mfence":::"memory")
#define rmb()   asm volatile("lfence":::"memory")
#define wmb()   asm volatile("sfence" ::: "memory")

FreeBSD

The various memory fences are defined in amd64/include/atomic.h as follows:

#define mb()    __asm __volatile("mfence;" : : : "memory")
#define wmb()   __asm __volatile("sfence;" : : : "memory")
#define rmb()   __asm __volatile("lfence;" : : : "memory")

where __volatile is defined as volatile in sys/cdefs.h. I couldn't find the definition for __asm, but I assume it's the same as asm.

Google perf tools

http://google-perftools.googlecode.com/svn/trunk/src/base/
They have a set of BSD-licensed atomic primitives, a spinlock, etc.

References

Extended Asm in the GCC manual describes how to use inline assembly in gcc. Unfortunately, it is not very precise or thorough. When describing the '+' constraint modifier, it says you should not use '+m', but Linus claims this is not true (see Re: [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints).

Sandeep.S's GCC-Inline-Assembly-HOWTO is a nice introduction to writing inline assembly in gcc. Unfortunately, it does not mention the + (plus) constraint modifier, and it is wrong about the semantics of asm volatile (see Warning box above).

Volatile Considered Harmful in the Linux Documentation describes why data should not be marked volatile.

A collection of posts by Linus on inline assembly in gcc that Norman Yarvin found interesting. The ones from 2007 are quite relevant to this document (some are also cited individually here).

A 2007 mailing list post Re: [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints by Linus which describes that "+m" as an asm output constraint is acceptable. This is in conflict with Extended Asm in the GCC manual, which Linus claims needs to be updated.

http://gcc.gnu.org/viewcvs/trunk/gcc/doc/extend.texi?r1=88547&r2=88999
Change to gcc Extended Asm docs (Revision 88999 of gcc/doc/extend.texi, Modified Wed Oct 13 19:14:44 2004 UTC by Dale Johannesen <dalej@apple.com>):

* doc/extend.texi (Extended Asm): Rewrite asm volatile description.

In a 1996 mailing list post Re: Is clobber "memory" in include/asm-i386/system.h necessary?, Linus said:

Essentially, think of the "memory" thing as a serializing flag rather than as a "this instruction changes memory" flag. It is extremely important that gcc not re-order memory instructions across these serializing instructions, because otherwise you might find that gcc moves a memory load over the serializing instruction and then you lose...

He changes his mind a bit in 2007, now believing that "memory" means that the asm instructions write to memory but do not read it.

In Section 6.7.3 Type Qualifiers of the C99 standard:

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. (*) What constitutes an access to an object that has volatile-qualified type is implementation-defined.

(*) A volatile declaration may be used to describe an object corresponding to a memory-mapped input/output port or an object accessed by an asynchronously interrupting function. Actions on objects so declared shall not be "optimized out" by an implementation or reordered except as permitted by the rules for evaluating expressions.

Among other details, Section 5.12.3 (Program Execution) says:

At sequence points, volatile objects are stable in the sense that previous accesses are complete and subsequent accesses have not yet occurred.

In Section 7.1.6.1 (The cv-qualifiers) of the C++ standard:

A pointer or reference to a cv-qualified type need not actually point or refer to a cv-qualified object, but it is treated as if it does.

Volatile is a hint to the implementation to avoid aggressive optimization involving the object because the value of the object might be changed by means undetectable by an implementation. See 1.9 for detailed semantics. In general, the semantics of volatile are intended to be the same in C++ as they are in C.