RAMCloud Benchmarks
A single backup operation (ClusterPerf with 100-byte writes, 1 master, 3 backups)
On the Master
Averaged over 1912 sample timelines:
Timeline on a Master
0 us --- Begin backup | | 2.0 us --- First write RPC sent out | | 3.3 us --- Second write RPC sent out | | 4.5 us --- Third write RPC sent out | | | [~ 4 us "dead time"] | | 8.6 us --- First write RPC completes (duration: 6.6 us) | 9.8 us --- Second write RPC completes (duration: 6.5 us) | 10.8 us --- Third write RPC completes (duration: 6.3 us) 10.9 us --- End backup
Major time sinks in issue path
- Acquiring Dispatch::Lock in TransportManager::WorkerSession::clientSend for every write RPC
- Cost: 3 x ~250ns
- InfRcTransport<Infiniband>::getTransmitBuffer(): waiting for free tx buffer for every write RPC
- Cost: 3 x ~200ns (first write RPC more expensive than 2nd and 3rd)
- Calling into Infiniband transport: postSendZeroCopy (unavoidable?)
- Cost: 3 x ~400ns (first write RPC more expensive than 2nd and 3rd)
Benchmark IB Send vs. RDMA
Simple program to benchmark 56-byte write.
Averaged over 100 samples.
One-way (with completion on sender)
Using IB send: 2.753 us
Using RDMA: 2.50495 us
RTT (RPC-style)
Using IB send: 4.969 us (explains write RPC latency seen in RAMCloud: 5 + 1 = 6 us)
Using RDMA: 4.866 us
We see that a one-way RDMA easily beats the round-trip IB send's currently used RAMCloud RPC.