RAMCloud Benchmarks
A single backup operation (ClusterPerf with 100-byte writes, 1 master, 3 backups)
On the Master
Averaged over 1912 sample timelines:
Timeline on a Master
0 us --- Begin backup (BackupManager::sync()) | | 2.0 us --- First write RPC sent out | | 3.3 us --- Second write RPC sent out | | 4.5 us --- Third write RPC sent out | | | [~ 4 us "dead time"] | | 8.6 us --- First write RPC completes (duration: 6.6 us) | 9.8 us --- Second write RPC completes (duration: 6.5 us) | 10.8 us --- Third write RPC completes (duration: 6.3 us) 10.9 us --- End backup
Major time sinks in issue path
- Acquiring Dispatch::Lock in TransportManager::WorkerSession::clientSend for every write RPC
- Cost: 3 x ~250ns
- InfRcTransport<Infiniband>::getTransmitBuffer(): waiting for free tx buffer for every write RPC
- Cost: 3 x ~200ns (first write RPC more expensive than 2nd and 3rd)
- Calling into Infiniband transport: postSendZeroCopy (unavoidable?)
- Cost: 3 x ~400ns (first write RPC more expensive than 2nd and 3rd)
On the the Backup
Timeline on Backup
( 0.0 ns 0.0 ns 0.0 ns ) | 0.0 ns 0.0 ns 0.0 ns | [ 0.0 ns ] Invoking appendToBuffer() [dispatch] ( 41.3 ns 62.2 ns 4.3 us ) | 41.3 ns 62.2 ns 4.3 us | [ 62.6 ns ] Completed appendToBuffer() [dispatch] ( 56.3 ns 83.4 ns 4.3 us ) | 15.0 ns 21.3 ns 37.8 ns | [ 4.1 ns ] Invoking serviceManager->handleRpc() [dispatch] ( 123.5 ns 178.2 ns 6.6 us ) | 67.2 ns 94.7 ns 2.3 us | [ 33.1 ns ] Completed serviceManager->handleRpc() [dispatch] ( 147.0 ns 255.1 ns 52.3 us ) | 23.5 ns 77.0 ns 45.7 us | [ 650.5 ns ] Invoking service.handleRpc() [worker] ( 256.1 ns 389.0 ns 53.7 us ) | 109.1 ns 133.9 ns 1.4 us | [ 27.5 ns ] Invoking service.dispatch() [worker] ( 276.2 ns 414.3 ns 54.0 us ) | 20.1 ns 25.3 ns 219.9 ns | [ 7.7 ns ] Invoking callHandler() [worker] ( 360.1 ns 537.9 ns 56.9 us ) | 83.9 ns 123.5 ns 2.9 us | [ 47.7 ns ] Invoking SegmentInfo->write() [worker] ( 483.2 ns 731.3 ns 58.6 us ) | 123.1 ns 193.5 ns 1.7 us | [ 54.0 ns ] Completed SegmentInfo->write() [worker] ( 514.2 ns 770.0 ns 60.0 us ) | 31.0 ns 38.6 ns 1.4 us | [ 22.8 ns ] Completed callHandler() [worker] ( 538.4 ns 795.9 ns 63.1 us ) | 24.2 ns 25.9 ns 3.1 us | [ 33.6 ns ] Completed service.dispatch() [worker] ( 561.6 ns 833.4 ns 65.0 us ) | 23.2 ns 37.5 ns 1.9 us | [ 19.8 ns ] Completed service.handleRpc() [worker] ( 646.1 ns 991.6 ns 73.3 us ) | 84.5 ns 158.2 ns 8.3 us | [ 172.9 ns ] Invoking worker->rpc->sendReply() [dispatch] ( 827.1 ns 1.2 us 77.8 us ) | 181.0 ns 253.9 ns 4.5 us | [ 56.1 ns ] Invoking postSend() [dispatch] ( 1.1 us 1.6 us 82.5 us ) | 273.1 ns 317.4 ns 4.7 us | [ 79.6 ns ] Completed postSend() [dispatch] ( 1.3 us 1.8 us 87.2 us ) | 152.1 ns 222.2 ns 4.7 us | [ 95.4 ns ] Completed worker->rpc->sendReply() [dispatch]
Benchmark IB Send vs. RDMA
Simple program to benchmark 56-byte write.
Averaged over 100 samples.
One-way (with completion on sender)
Using IB send: 2.753 us
Using RDMA: 2.50495 us
RTT (RPC-style)
Using IB send: 4.969 us (explains write RPC latency seen in RAMCloud: 5 + 1 = 6 us)
Using RDMA: 4.866 us
We see that a one-way RDMA easily beats the round-trip IB send's currently used RAMCloud RPC.