Rdtsc and Synchronization

(All of the following measurements were made on ramcloud3 around October 27, 2010)

Rdtsc

Experiment: make 2 calls to rdtsc back-to-back, compute difference in times.

Time: 24 cycles (10ns)

Notes: when I first measured this the time was 36 cycles. However, if the measurement was repeated in a loop, after about 2600-2700 iterations the cost dropped to 24 cycles. When I came back a day later, the cost was consistently 24 cycles. Current theory: during the first measurements the machine was unloaded; perhaps energy-management causes the processor to run more slowly until it has been active a while, then the clock speeds up? On the second day there were other users on the machine so perhaps the processor never entered energy-management mode.

Synchronization with shared memory

Experiment: two threads loop over access to a single shared memory location. The first thread waits for the location to be 0, then sets it to 1; the second thread does the reverse. Several iterations were timed, and the average round-trip time was computed.

Time: 208 cycles (87ns)

Notes: it looks like the system pays the cost for a full last-level cache miss (~40ns) each time ownership of the shared memory location switches from one thread's cache to the other.

Synchronization with pthreads mutexes

Experiment: two threads loop using two mutexes. The first thread locks the first mutex and unlocks the second; the second thread was waiting for the second Mutex, then it unlocks the first. Many iterations were timed, and the average round-trip time was computed.

Time: varies from 500-2500 cycles with occasional outliers much higher (10000 cycles or more). The average over 1000 iterations is around 5000 cycles.

Synchronization with pthreads mutexes (polling)

Experiment: same as above, except that while (pthread_mutex_trylock(...) != 0) is used instead of pthread_lock(...).

Time: 575 cycles (240ns)

Notes: in this experiment the round-trip times were much more consistent.

Synchronization with pthreads spin locks

Experiment: same as mutex experiment above, except using pthread_spinlock_t instead of pthread_mutex_t.

Time: 198 cycles (82ns)

Synchronization with pthreads condition variables

Experiment: same as with first mutex experiment above, except using condition variables. The code for the first thread is below (the code for the second thread is the converse).

void *func1(void *arg) {
    pthread_mutex_lock(&mutex);
    while (1) {
        while (owner == 1) {
            pthread_cond_wait(&c1, &mutex);
        }
        owner = 1;
        pthread_cond_signal(&c2);
    }
    pthread_mutex_unlock(&mutex);
    return NULL;
}

Time per round-trip: 11900 cycles (4950ns)

Notes: there is a lot of variation in these times.

Additional measurements made December 10-25, 2010

Synchronization with Linux pipes

Experiment: similar to those above, except pass 1 byte back and forth through 2 pipes for communication.

Time: 23900 cycles (9987ns)

Using boost::thread to spin a thread

Experiment: time the following code (where func1 does nothing):

boost::thread thread(func1);
thread.join();

Time: ranges from 14-120 us, with average times of 17-18 us.

Synchronization with boost::mutex

Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own. Measured the round-trip time.

Time: wide variation (65ns - 40 us); average time 8-9 microseconds

Synchronization with boost::mutex polling

Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own;
used try_lock to avoid OS context switches. Measured the round-trip time.

Time: wide variation (90ns - 5 us); average time 145ns

Pthreads spin locks (again)

Repeated experiment from above using pthread_spinlock_t, but with a slightly different measurement approach. Got 2 very different sets of results (consistent behavior within a single run, but different behaviors between runs).

"Good" runs: average time about 50ns individual round-trips varied from 35-80ns

"Bad" runs: average time about 180ns, individual round-trips varied from 90-280ns

After further measurements, it appears that the "good" runs are when the 2 threads are hyperthreads sharing the same core, whereas in the "bad" runs they are allocated to different cores?

Memspin2: shared-memory synchronization

Repeated the measurements above using shared memory, but with some additional controls (memspin2):

  • There is master and either 1 or 2 slaves. In the single-slave case, the slave can be on the same core as the master, or on a different core. In the two-slave case the master alternates pings tween the two slaves; one is on the same core as the master and the other is on a different core.
  • In some measurements __sync_synchronize is invoked just before setting the shared flag variable.
  • In some measurements rdtscp is invoked after each ping as part of the timing. This makes it possible to measure the ping time separately for each slave.
  • In all measurements pause is invoked in the spin loop; this speeds up the case where both threads are running on the same core but has no impact on the cross-core case.

Here is the average ping time for each scenario. In the case where there were 2 slaves, the ping time is the total to ping both of the slaves (if rdtscp was enabled then this time is broken down between the two slaves).

__sync_synchronize?

rdtscp?

Same core

Different Core

Both

No

No

19ns

83ns

138ns

No

Yes

34ns

88ns

166ns (45ns/121ns)

Yes

No

29ns

86ns

161ns

Yes

Yes

45ns

136ns

198ns (58ns/140ns)

bmutexNonBlocking: acquire/release lock

Time to execute the following code: about 20ns for a lock/unlock pair.

    boost::mutex m;
    ...
    m.lock();
    m.unlock();