RPC Measurements Autumn 2011

RAMCloud Benchmarks

A single backup operation (ClusterPerf with 100-byte writes, 1 master, 3 backups)

On the Master

Averaged over 1912 sample timelines:

Timeline on a Master

0 us --- Begin backup (BackupManager::sync())
| 
|
2.0 us --- First write RPC sent out
| 
|
3.3 us --- Second write RPC sent out
| 
|
4.5 us --- Third write RPC sent out
| 
|
| [~ 4 us "dead time"]
|
|
8.6 us --- First write RPC completes (duration: 6.6 us)
|
9.8 us --- Second write RPC completes (duration: 6.5 us)
|
10.8 us --- Third write RPC completes (duration: 6.3 us)
10.9 us --- End backup

Major time sinks in issue path

Acquiring Dispatch::Lock in TransportManager::WorkerSession::clientSend for every write RPC
- Cost: 3 x ~250ns

InfRcTransport<Infiniband>::getTransmitBuffer(): waiting for free tx buffer for every write RPC
- Cost: 3 x ~200ns (first write RPC more expensive than 2nd and 3rd)

Calling into Infiniband transport: postSendZeroCopy (unavoidable?)
- Cost: 3 x ~400ns (first write RPC more expensive than 2nd and 3rd)

On the the Backup

Timeline on Backup

( 0.0 ns 0.0 ns 0.0 ns ) | 0.0 ns 0.0 ns 0.0 ns | [ 0.0 ns ] Invoking appendToBuffer() [dispatch]
( 41.3 ns 62.2 ns 4.3 us ) | 41.3 ns 62.2 ns 4.3 us | [ 62.6 ns ] Completed appendToBuffer() [dispatch]
( 56.3 ns 83.4 ns 4.3 us ) | 15.0 ns 21.3 ns 37.8 ns | [ 4.1 ns ] Invoking serviceManager->handleRpc() [dispatch]
( 123.5 ns 178.2 ns 6.6 us ) | 67.2 ns 94.7 ns 2.3 us | [ 33.1 ns ] Completed serviceManager->handleRpc() [dispatch]
( 147.0 ns 255.1 ns 52.3 us ) | 23.5 ns 77.0 ns 45.7 us | [ 650.5 ns ] Invoking service.handleRpc() [worker]
( 256.1 ns 389.0 ns 53.7 us ) | 109.1 ns 133.9 ns 1.4 us | [ 27.5 ns ] Invoking service.dispatch() [worker]
( 276.2 ns 414.3 ns 54.0 us ) | 20.1 ns 25.3 ns 219.9 ns | [ 7.7 ns ] Invoking callHandler() [worker]
( 360.1 ns 537.9 ns 56.9 us ) | 83.9 ns 123.5 ns 2.9 us | [ 47.7 ns ] Invoking SegmentInfo->write() [worker]
( 483.2 ns 731.3 ns 58.6 us ) | 123.1 ns 193.5 ns 1.7 us | [ 54.0 ns ] Completed SegmentInfo->write() [worker]
( 514.2 ns 770.0 ns 60.0 us ) | 31.0 ns 38.6 ns 1.4 us | [ 22.8 ns ] Completed callHandler() [worker]
( 538.4 ns 795.9 ns 63.1 us ) | 24.2 ns 25.9 ns 3.1 us | [ 33.6 ns ] Completed service.dispatch() [worker]
( 561.6 ns 833.4 ns 65.0 us ) | 23.2 ns 37.5 ns 1.9 us | [ 19.8 ns ] Completed service.handleRpc() [worker]
( 646.1 ns 991.6 ns 73.3 us ) | 84.5 ns 158.2 ns 8.3 us | [ 172.9 ns ] Invoking worker->rpc->sendReply() [dispatch]
( 827.1 ns 1.2 us 77.8 us ) | 181.0 ns 253.9 ns 4.5 us | [ 56.1 ns ] Invoking postSend() [dispatch]
( 1.1 us 1.6 us 82.5 us ) | 273.1 ns 317.4 ns 4.7 us | [ 79.6 ns ] Completed postSend() [dispatch]
( 1.3 us 1.8 us 87.2 us ) | 152.1 ns 222.2 ns 4.7 us | [ 95.4 ns ] Completed worker->rpc->sendReply() [dispatch]

Benchmark IB Send vs. RDMA

Simple program to benchmark 56-byte write.

Averaged over 100 samples.

One-way (with completion on sender)

Using IB send: 2.753 us

Using RDMA: 2.50495 us

RTT (RPC-style)

Using IB send: 4.969 us (explains write RPC latency seen in RAMCloud: 5 + 1 = 6 us)

Using RDMA: 4.866 us

We see that a one-way RDMA easily beats the round-trip IB send's currently used RAMCloud RPC.