Rdtsc and Synchronization

(All of the following measurements were made on ramcloud3 around October 27, 2010)

Rdtsc

Experiment: make 2 calls to rdtsc back-to-back, compute difference in times.

Time: 24 cycles (10ns)

Notes: when I first measured this the time was 36 cycles. However, if the measurement was repeated in a loop, after about 2600-2700 iterations the cost dropped to 24 cycles. When I came back a day later, the cost was consistently 24 cycles. Current theory: during the first measurements the machine was unloaded; perhaps energy-management causes the processor to run more slowly until it has been active a while, then the clock speeds up? On the second day there were other users on the machine so perhaps the processor never entered energy-management mode.

Synchronization with shared memory

Experiment: two threads loop over access to a single shared memory location. The first thread waits for the location to be 0, then sets it to 1; the second thread does the reverse. Several iterations were timed, and the average round-trip time was computed.

Time: 208 cycles (87ns)

Notes: it looks like the system pays the cost for a full last-level cache miss (~40ns) each time ownership of the shared memory location switches from one thread's cache to the other.

Synchronization with pthreads mutexes

Experiment: two threads loop using two mutexes. The first thread locks the first mutex and unlocks the second; the second thread was waiting for the second Mutex, then it unlocks the first. Many iterations were timed, and the average round-trip time was computed.

Time: varies from 500-2500 cycles with occasional outliers much higher (10000 cycles or more). The average over 1000 iterations is around 5000 cycles.

Synchronization with pthreads mutexes (polling)

Experiment: same as above, except that while (pthread_mutex_trylock(...) != 0) is used instead of pthread_lock(...).

Time: 575 cycles (240ns)

Notes: in this experiment the round-trip times were much more consistent.

Synchronization with pthreads spin locks

Experiment: same as mutex experiment above, except using pthread_spinlock_t instead of pthread_mutex_t.

Time: 198 cycles (82ns)

Synchronization with pthreads condition variables

Experiment: same as with first mutex experiment above, except using condition variables. The code for the first thread is below (the code for the second thread is the converse).

void *func1(void *arg) {
    pthread_mutex_lock(&mutex);
    while (1) {
        while (owner == 1) {
            pthread_cond_wait(&c1, &mutex);
        }
        owner = 1;
        pthread_cond_signal(&c2);
    }
    pthread_mutex_unlock(&mutex);
    return NULL;
}

Time per round-trip: 11900 cycles (4950ns)

Notes: there is a lot of variation in these times.

Additional measurements made December 10-25, 2010

Synchronization with Linux pipes

Experiment: similar to those above, except pass 1 byte back and forth through 2 pipes for communication.

Time: 23900 cycles (9987ns)

Using boost::thread to spin a thread

Experiment: time the following code (where func1 does nothing):

boost::thread thread(func1);
thread.join();

Time: ranges from 14-120 us, with average times of 17-18 us.

Synchronization with boost::mutex

Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own. Measured the round-trip time.

Time: wide variation (65ns - 40 us); average time 8-9 microseconds

Synchronization with boost::mutex polling

Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own;
used try_lock to avoid OS context switches. Measured the round-trip time.

Time: wide variation (90ns - 5 us); average time 145ns

Pthreads spin locks (again)

Repeated experiment from above using pthread_spinlock_t, but with a slightly different measurement approach. Got 2 very different sets of results (consistent behavior within a single run, but different behaviors between runs).

"Good" runs: average time about 50ns individual round-trips varied from 35-80ns

"Bad" runs: average time about 180ns, individual round-trips varied from 90-280ns

After further measurements, it appears that the "good" runs are when the 2 threads are hyperthreads sharing the same core, whereas in the "bad" runs they are allocated to different cores?

Memspin2: shared-memory synchronization

Repeated the measurements above using shared memory, but with some additional controls (memspin2):

There is master and either 1 or 2 slaves. In the single-slave case, the slave can be on the same core as the master, or on a different core. In the two-slave case the master alternates pings tween the two slaves; one is on the same core as the master and the other is on a different core.
In some measurements __sync_synchronize is invoked just before setting the shared flag variable.
In some measurements rdtscp is invoked after each ping as part of the timing. This makes it possible to measure the ping time separately for each slave.
In all measurements pause is invoked in the spin loop; this speeds up the case where both threads are running on the same core but has no impact on the cross-core case.

Here is the average ping time for each scenario. In the case where there were 2 slaves, the ping time is the total to ping both of the slaves (if rdtscp was enabled then this time is broken down between the two slaves).

__sync_synchronize?	rdtscp?	Same core	Different Core	Both
No	No	19ns	83ns	138ns
No	Yes	34ns	88ns	166ns (45ns/121ns)
Yes	No	29ns	86ns	161ns
Yes	Yes	45ns	136ns	198ns (58ns/140ns)

bmutexNonBlocking: acquire/release lock

Time to execute the following code: about 20ns for a lock/unlock pair.

    boost::mutex m;
    ...
    m.lock();
    m.unlock();