Performance Improvement Log

This page is intended for recording steps we have taken over time to improve RAMCloud performance, along with measurements of the resulting performance gains. Add new entries at the beginning of the page, so that the entries are in reverse chronological order.

Buffer rewrite (June 2014, John Ousterhout)

Rewrote Buffer.cc and Buffer.h from scratch to streamline and simplify, in the hopes of speeding up basic operations.

One overall approach was to eliminate layers within the Buffer class. For example, allocating a new chunk used to have to pass through many levels of method call, with many of the methods doing nothing except passing their arguments to the next method in the chain. In the new version the most common operations are completely imploded in a single method.
The Buffer::Iterator class was simplified by moving almost all the computation to the next method and handling special cases related to the first chunk in the constructor. In the old version, there was significant complexity in each of next, getData, and getLength, with significant duplication, and extra code to deal with the first chunk that had to be executed for every single chunk. In the new version, getData and getLength are in-line methods that do nothing except return precomputed values.

Performance comparisons:

                                        Old     New
create, append 1 chunk, delete       17.0ns  12.3ns
create, alloc 1 chunk, delete        22.3ns  15.6ns
create, copy in 1 chunk, delete      25.4ns  13.2ns
extend existing chunk (alloc only)    9.3ns   5.7ns
copy 2 small chunks out of buffer    19.2ns  19.2ns
iterate over buffer with 5 chunks    51.0ns  22.6ns

The median read time in "clusterperf readDist" dropped about 40ns as a result of these changes (from 4.95µs to 4.91µs).