View Source

RAMCloud Benchmarks

A single backup operation (ClusterPerf with 100-byte writes, 1 master, 3 backups). For each operation, a timeline of events was logged. Not all timelines had the same "shape", as not all writes are handled by the same sequence of events. Thus, the most common timeline "shape" was chosen, and the timelines below represent the average of the most common timeline shape. This procedure was done for both the backup and the master.

On the Master

Averaged over 1912 same-shape timelines.

0 us --- Begin backup (BackupManager::sync())
| 
|
2.0 us --- First write RPC sent out
| 
|
3.3 us --- Second write RPC sent out
| 
|
4.5 us --- Third write RPC sent out
| 
|
| [~ 4 us "dead time"]
|
|
8.6 us --- First write RPC completes (duration: 6.6 us)
|
9.8 us --- Second write RPC completes (duration: 6.5 us)
|
10.8 us --- Third write RPC completes (duration: 6.3 us)
10.9 us --- End backup

Major time sinks in issue path

Acquiring Dispatch::Lock in TransportManager::WorkerSession::clientSend for every write RPC
- Cost: 3 x ~250ns

InfRcTransport<Infiniband>::getTransmitBuffer(): waiting for free tx buffer for every write RPC
- Cost: 3 x ~200ns (first write RPC more expensive than 2nd and 3rd)

Calling into Infiniband transport: postSendZeroCopy (unavoidable?)
- Cost: 3 x ~400ns (first write RPC more expensive than 2nd and 3rd)

On the the Backup

Average over 9584 same-shape timelines.

0 us ---- InfRcTransport Poller picks ups incoming RPC [dispatch thread]
|
255 ns -- Invoke service.handleRpc()                   [worker thread]
|
833 ns -- Completed service.handleRpc()                [worker thread]
|
991 ns -- Begin sending reply                          [dispatch thread]
|
|
|
1.8 us -- Completed worker->rpc->sendReply()           [dispatch thread]

Benchmark IB Send vs. RDMA

Simple program to benchmark 56-byte write.

Averaged over 100 samples.

One-way (with completion on sender)

Using IB send: 2.753 us

Using RDMA: 2.50495 us

RTT (RPC-style)

Using IB send: 4.969 us (explains write RPC latency seen in RAMCloud: 5 + 1 = 6 us)

Using RDMA: 4.866 us

We see that a one-way RDMA easily beats the round-trip IB send's currently used RAMCloud RPC.