(All of the following measurements were made on ramcloud3 around October 27, 2010)
Experiment: make 2 calls to rdtsc back-to-back, compute difference in times.
Time: 24 cycles (10ns)
Notes: when I first measured this the time was 36 cycles. However, if the measurement was repeated in a loop, after about 2600-2700 iterations the cost dropped to 24 cycles. When I came back a day later, the cost was consistently 24 cycles. Current theory: during the first measurements the machine was unloaded; perhaps energy-management causes the processor to run more slowly until it has been active a while, then the clock speeds up? On the second day there were other users on the machine so perhaps the processor never entered energy-management mode.
Experiment: two threads loop over access to a single shared memory location. The first thread waits for the location to be 0, then sets it to 1; the second thread does the reverse. Several iterations were timed, and the average round-trip time was computed.
Time: 208 cycles (87ns)
Notes: it looks like the system pays the cost for a full last-level cache miss (~40ns) each time ownership of the shared memory location switches from one thread's cache to the other.
Experiment: two threads loop using two mutexes. The first thread locks the first mutex and unlocks the second; the second thread was waiting for the second Mutex, then it unlocks the first. Many iterations were timed, and the average round-trip time was computed.
Time: varies from 500-2500 cycles with occasional outliers much higher (10000 cycles or more). The average over 1000 iterations is around 5000 cycles.
Experiment: same as above, except that while (pthread_mutex_trylock(...) != 0)
is used instead of pthread_lock(...)
.
Time: 575 cycles (240ns)
Notes: in this experiment the round-trip times were much more consistent.
Experiment: same as mutex experiment above, except using pthread_spinlock_t
instead of pthread_mutex_t
.
Time: 198 cycles (82ns)
Experiment: same as with first mutex experiment above, except using condition variables. The code for the first thread is below (the code for the second thread is the converse).
void *func1(void *arg) { pthread_mutex_lock(&mutex); while (1) { while (owner == 1) { pthread_cond_wait(&c1, &mutex); } owner = 1; pthread_cond_signal(&c2); } pthread_mutex_unlock(&mutex); return NULL; } |
Time per round-trip: 11900 cycles (4950ns)
Notes: there is a lot of variation in these times.
Experiment: similar to those above, except pass 1 byte back and forth through 2 pipes for communication.
Time: 23900 cycles (9987ns)
Experiment: time the following code (where func1
does nothing):
boost::thread thread(func1); thread.join(); |
Time: ranges from 14-120 us, with average times of 17-18 us.
Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own. Measured the round-trip time.
Time: wide variation (65ns - 40 us); average time 8-9 microseconds
Experiment: 2 threads, each unlocking a mutex required by the other thread, then locking its own;
used try_lock to avoid OS context switches. Measured the round-trip time.
Time: wide variation (90ns - 5 us); average time 145ns
Repeated experiment from above using pthread_spinlock_t
, but with a slightly different measurement approach. Got 2 very different sets of results (consistent behavior within a single run, but different behaviors between runs).
"Good" runs: average time about 50ns individual round-trips varied from 35-80ns
"Bad" runs: average time about 180ns, individual round-trips varied from 90-280ns
After further measurements, it appears that the "good" runs are when the 2 threads are hyperthreads sharing the same core, whereas in the "bad" runs they are allocated to different cores?
Repeated the measurements above using shared memory, but with some additional controls (memspin2):
pause
is invoked in the spin loop; this speeds up the case where both threads are running on the same core but has no impact on the cross-core case.Here is the average ping time for each scenario. In the case where there were 2 slaves, the ping time is the total to ping both of the slaves (if rdtscp was enabled then this time is broken down between the two slaves).
__sync_synchronize? |
rdtscp? |
Same core |
Different Core |
Both |
---|---|---|---|---|
No |
No |
19ns |
83ns |
138ns |
No |
Yes |
34ns |
88ns |
166ns (45ns/121ns) |
Yes |
No |
29ns |
86ns |
161ns |
Yes |
Yes |
45ns |
136ns |
198ns (58ns/140ns) |
Time to execute the following code: about 20ns for a lock/unlock pair.
boost::mutex m; ... m.lock(); m.unlock(); |