Performance Improvement Log

This page is intended for recording steps we have taken over time to improve RAMCloud performance, along with measurements of the resulting performance gains. Add new entries at the beginning of the page, so that the entries are in reverse chronological order.

Enqueue replication Rpcs from the service thread to dispatch thread instead of taking the Dispatch lock (December 2014, Henry Qin)

When servicing a Write Rpc, the service thread used to take the dispatch lock to block the dispatch thread from executing, and then proceed to use the transport to send the replication Rpc.

We have introduced the DispatchExec mechanism as a new Poller for the dispatch thread. The service thread will hand work to DispatchExec in lieu of taking the Dispatch Lock.

This optimization moves the median latency for writes from 14.4 us to 13.4 us.

Fetch multiple completions at once in InfRcTransport (August 2014, John Ousterhout)

InfRcTransport used to retrieve completions (both from serverRxCq and clientRxCq) one-at-a-time. This optimization changed the code to retrieve many at once, if there are several available. This improved "clusterperf readThroughput" from 875 kreads/sec to 948 kreads/sec.

Reclaim multiple transmit buffers at once in InfRcTransport (August 2014, John Ousterhout)

InfRcTransport used to call reapTxBuffers automatically at the end of the poll method. As a result, it typically only reclaimed one buffer at a time. This optimization changed the code so that it only calls readTxBuffers when transmit buffers are running low, so it can generally reclaim several at a time. This improved "clusterperf readThroughput" from 812 kreads/sec to 875 kreads/sec (in this benchmark, the dispatch thread is the bottleneck).

Optimizing class Object to treat contiguous Buffers as normal byte buffers (July 2014, Henry Qin)

Before this optimization, we used (relatively) slow Buffer methods like getRange() to access data inside an Object even if the memory was contiguous.

This optimization causes Object to treat contiguous Buffers as normal void*, saving the overhead of Buffer.

We measured an improvement of 25 ns in the read Rpc.

Merging the check for tablet existence with incrementing the read count on the tablet. (July 2014, Henry Qin)

Before this optimization, reading an object required first checking for existence of the tablet containing the Key, reading the Object, and then looking at the same tablet to increment the read count on it. It also involved copying a TabletManager::Tablet object onto the stack.

This optimization combines the two operations which require tablet lookup into a single operation, saving roughly 40 ns of time for the repeated look-up.

Prefetching on Incoming Packet and Log Entry (June 2014, Henry Qin)

After various experiments isolating Last Level Cache misses, and the addition of the randomized read benchmark readDistRandom, we added prefetching on the incoming packet whenever it is shorter than 1000 bytes, as well as prefetching on the Log entry whenever it is less than 300 bytes.

This reduces the median read time on `clusterperf readDist` by 30 ns, while reducing the median read time on `clusterperf readDistRandom` by 190 ns.

                Old      New
readDist        4.67us   4.64us
readDistRandom  4.94us   4.75us

ObjectFinder and TransportManager (June 2014, John Ousterhout)

Before this optimization, ObjectFinder::lookup had to invoke TransportManager::getSession in each call, with 2 inefficiencies:

The interface to getSession passed in a char*, but the unordered_map in TransportManager is keyed with strings; this meant that a new string object had to be created around the char* argument during each call. I change the interface to pass in a string instead. This saved approximately 80ns in each RPC.
ObjectFinder::lookup has already done a hash table lookup to find information about the tablet, such as its service locator. I modified ObjectFinder to cache the SessionRef in his own data structure, which eliminates the call to TransportManager::getSession. This saved another 60ns in each RPC.

The overall impact of these 2 changes was to reduce he median read time in "clusterperf readDist" from 4.91µs to 4.77µs.

Buffer rewrite (June 2014, John Ousterhout)

Rewrote Buffer.cc and Buffer.h from scratch to streamline and simplify, in the hopes of speeding up basic operations.

One overall approach was to eliminate layers within the Buffer class. For example, allocating a new chunk used to have to pass through many levels of method call, with many of the methods doing nothing except passing their arguments to the next method in the chain. In the new version the most common operations are completely imploded in a single method.
The Buffer::Iterator class was simplified by moving almost all the computation to the next method and handling special cases related to the first chunk in the constructor. In the old version, there was significant complexity in each of next, getData, and getLength, with significant duplication, and extra code to deal with the first chunk that had to be executed for every single chunk. In the new version, getData and getLength are in-line methods that do nothing except return precomputed values.

Performance comparisons:

                                        Old     New
create, append 1 chunk, delete       17.0ns  12.3ns
create, alloc 1 chunk, delete        22.3ns  15.6ns
create, copy in 1 chunk, delete      25.4ns  13.2ns
extend existing chunk (alloc only)    9.3ns   5.7ns
copy 2 small chunks out of buffer    19.2ns  19.2ns
iterate over buffer with 5 chunks    51.0ns  22.6ns

The median read time in "clusterperf readDist" dropped about 40ns as a result of these changes (from 4.95µs to 4.91µs).