Mfence

This page describes the semantics of x86 instructions such as mfence, which must be used during synchronization to ensure that memory modifications made in one thread are visible to other threads. The information below was taken from Intel instruction set manuals.

mfence

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.1 The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring load and store ordering between routines that produce weakly-ordered results and routines that consume that data.

sfence

Performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction. This serializing operation guarantees that every store instruction that precedes the SFENCE instruction in program order becomes globally visible before any store instruction that follows the SFENCE instruction. The SFENCE instruction is ordered with respect to store instructions, other SFENCE instructions, any LFENCE and MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to load instructions.

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The SFENCE instruction provides a performance-efficient way of ensuring store ordering between routines that produce weakly-ordered results and routines that consume this data.

lfence

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. In particular, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE. (An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible.) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue and speculative reads. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The LFENCE instruction provides a performance-efficient way of ensuring load ordering between routines that produce weakly-ordered results and routines that consume that data.

Resources

No references found the usage of x86 memory fences.
General reference for memory ordering (aka. memory consistency), recommended in "A Quantitative Approach" by Henneessy and Patterson:
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
x86 (IA-32) memory ordering
1. Analysis is seen in "A Better x86 Memory Model: x86-TSO":
  http://www.springerlink.com/content/f7717l1275624610/
2. x86 (IA-32) memory ordering seems to be little confused by seeing Table 1. "Summary of Memory Ordering" in :
  http://www.ee.ryerson.ca/~courses/coe518/LinuxJournal/elj2005-136-memoryordering1.pdf
  The memory ordering is relaxed from SC->TSO->PSO->WO. X86 still keeps W->W order even R->RW order is relaxed. The atomic operations in x86 also seem to work as a memory fence, which is different from other CPU architectures.
3. The memory ordering of x86 is also discussed in section 2 of "A Better x86 Memory Model: x86-TSO":
  http://www.springerlink.com/content/f7717l1275624610/
  Notice that the memory consistency model in Intel and AMD is not identical.
In case of MIPS R10000)
1. R10k guaranteed sequential consistency (SC) with speculative execution mechanism for software compatibility. The design seems successful although it might limit wider instruction issue with more than two load/store pipelines. Wider instruction issue is not common because of diminishing return of Instruction Level parallelism.
  Hill also recommends R10000's approach:
  http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=707614;
  (We can download this paper for free in google scholar.)
2. R10k has 'sync' instruction. It is just for waiting emptying the external memory (write) buffers and for synchronizing to special uncached operation named 'uncached accelerated' mainly used for accelerating command filling to peripheral such as a graphic engine, etc.