This page is intended for recording steps we have taken over time to improve RAMCloud performance, along with measurements of the resulting performance gains. Add new entries at the beginning of the page, so that the entries are in reverse chronological order.

Prefetching on Incoming Packet and Log Entry (June 2014, Henry Qin)

After various experiments isolating Last Level Cache misses, and the addition of the randomized read benchmark readDistRandom, we added prefetching on the incoming packet whenever it is shorter than 1000 bytes, as well as prefetching on the Log entry whenever it is less than 300 bytes.

This reduces the median read time on `clusterperf readDist` by 30 ns, while reducing the median read time on `clusterperf readDistRandom` by 190 ns.

                Old      New
readDist        4.67us   4.64us
readDistRandom  4.94us   4.75us

ObjectFinder and TransportManager (June 2014, John Ousterhout)

Before this optimization, ObjectFinder::lookup had to invoke TransportManager::getSession in each call, with 2 inefficiencies:

The overall impact of these 2 changes was to reduce he median read time in "clusterperf readDist" from 4.91µs to 4.77µs.

Buffer rewrite (June 2014, John Ousterhout)

Rewrote Buffer.cc and Buffer.h from scratch to streamline and simplify, in the hopes of speeding up basic operations.

Performance comparisons:

                                        Old     New
create, append 1 chunk, delete       17.0ns  12.3ns
create, alloc 1 chunk, delete        22.3ns  15.6ns
create, copy in 1 chunk, delete      25.4ns  13.2ns
extend existing chunk (alloc only)    9.3ns   5.7ns
copy 2 small chunks out of buffer    19.2ns  19.2ns
iterate over buffer with 5 chunks    51.0ns  22.6ns

The median read time in "clusterperf readDist" dropped about 40ns as a result of these changes (from 4.95µs to 4.91µs).