Rdtsc and Synchronization
(All of the following measurements were made on ramcloud3 around October 27, 2010)
Rdtsc
Experiment: make 2 calls to rdtsc back-to-back, compute difference in times.
Time: 24 cycles (10ns)
Notes: when I first measured this the time was 36 cycles. However, if the measurement was repeated in a loop, after about 2600-2700 iterations the cost dropped to 24 cycles. When I came back a day later, the cost was consistently 24 cycles. Current theory: during the first measurements the machine was unloaded; perhaps energy-management causes the processor to run more slowly until it has been active a while, then the clock speeds up? On the second day there were other users on the machine so perhaps the processor never entered energy-management mode.
Synchronization with shared memory
Experiment: two threads loop over access to a single shared memory location. The first thread waits for the location to be 0, then sets it to 1; the second thread does the reverse. Several iterations were timed, and the average round-trip time was computed.
Time: 208 cycles (87ns)
Notes: it looks like the system pays the cost for a full last-level cache miss (~40ns) each time ownership of the shared memory location switches from one thread's cache to the other.
Synchronization with pthreads mutexes
Experiment: two threads loop using two mutexes. The first thread locks the first mutex and unlocks the second; the second thread was waiting for the second Mutex, then it unlocks the first. Many iterations were timed, and the average round-trip time was computed.
Time: varies from 500-2500 cycles with occasional outliers much higher (10000 cycles or more). The average over 1000 iterations is around 5000 cycles.
Synchronization with pthreads mutexes (polling)
Experiment: same as above, except that while (pthread_mutex_trylock(...) != 0)
is used instead of pthread_lock(...)
.
Time: 575 cycles (240ns)
Notes: in this experiment the round-trip times were much more consistent.
Synchronization with pthreads spin locks
Experiment: same as mutex experiment above, except using pthread_spinlock_t
instead of pthread_mutex_t
.
Time: 198 cycles (82ns)
Synchronization with pthreads condition variables
Experiment: same as with first mutex experiment above, except using condition variables. The code for the first thread is below (the code for the second thread is the converse).
void *func1(void *arg) { pthread_mutex_lock(&mutex); while (1) { while (owner == 1) { pthread_cond_wait(&c1, &mutex); } owner = 1; pthread_cond_signal(&c2); } pthread_mutex_unlock(&mutex); return NULL; }
Time per round-trip: 11900 cycles (4950ns)
Notes: there is a lot of variation in these times.
Additional measurements made December 10-25, 2010
Synchronization with Linux pipes
Experiment: similar to those above, except pass 1 byte back and forth through 2 pipes for communication.
Time: 23900 cycles (9987ns)
Using boost::thread to spin a thread
Experiment: time the following code (where func1
does nothing):
boost::thread thread(func1); thread.join();
Time: ranges from 14-120 us, with average times of 17-18 us.
Synchronization with boost::mutex
Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own. Measured the round-trip time.
Time: wide variation (65ns - 40 us); average time 8-9 microseconds
Synchronization with boost::mutex polling
Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own;
used try_lock to avoid OS context switches. Measured the round-trip time.
Time: wide variation (90ns - 5 us); average time 145ns
Pthreads spin locks (again)
Repeated experiment from above using pthread_spinlock_t
, but with a slightly different measurement approach. Got 2 very different sets of results (consistent behavior within a single run, but different behaviors between runs).
"Good" runs: average time about 50ns individual round-trips varied from 35-80ns
"Bad" runs: average time about 180ns, individual round-trips varied from 90-280ns
After further measurements, it appears that the "good" runs are when the 2 threads are hyperthreads sharing the same core, whereas in the "bad" runs they are allocated to different cores?
Memspin2: shared-memory synchronization
Repeated the measurements above using shared memory, but with some additional controls (memspin2):
- There is master and either 1 or 2 slaves. In the single-slave case, the slave can be on the same core as the master, or on a different core. In the two-slave case the master alternates pings tween the two slaves; one is on the same core as the master and the other is on a different core.
- In some measurements __sync_synchronize is invoked just before setting the shared flag variable.
- In some measurements rdtscp is invoked after each ping as part of the timing. This makes it possible to measure the ping time separately for each slave.
- In all measurements
pause
is invoked in the spin loop; this speeds up the case where both threads are running on the same core but has no impact on the cross-core case.
Here is the average ping time for each scenario. In the case where there were 2 slaves, the ping time is the total to ping both of the slaves (if rdtscp was enabled then this time is broken down between the two slaves).
__sync_synchronize? |
rdtscp? |
Same core |
Different Core |
Both |
---|---|---|---|---|
No |
No |
19ns |
83ns |
138ns |
No |
Yes |
34ns |
88ns |
166ns (45ns/121ns) |
Yes |
No |
29ns |
86ns |
161ns |
Yes |
Yes |
45ns |
136ns |
198ns (58ns/140ns) |
bmutexNonBlocking: acquire/release lock
Time to execute the following code: about 20ns for a lock/unlock pair.
boost::mutex m; ... m.lock(); m.unlock();