The purpose of this page is to list tools and techniques that people can use to measure the performance of RAMCloud at various levels. The scope is intended to be broad, from using CPU performance counters, to RAMCloud's internal performance monitoring infrastructure, to existing RAMCloud client applications that collect statistics.

TimeTrace

If you are trying to analyze  the latency of a particular operation to see where the time is going, you will probably find the TimeTrace class useful.  An instance of this class keeps a circular buffer of events and the times when they occurred, which you can then print. Typically, you'll use the instance that is part of each RAMCloud context. You can then sprinkle calls to the record method in your code, like this:

context->timeTrace->record("starting read operation");
...
context->timeTrace->record("found hash table entry");
...

Then you run a benchmark that executes  the desired code  multiple times, enough to fill up the circular buffer.  Once the benchmark has run, you can print out the buffer to the log  using the printToLog method, like this:

context->printToLog();

From the information that's printed you can identify the places where the most time is being spent, and then add additional calls to TimeTrace to divide up the large blocks of time. Eventually you should be able to figure out exactly where all the time is going. If the recording is happening on a RAMCloud server but you are running a benchmark from a client machine, you can invoke a ServerControl RPC on the client to ask the server to dump its time trace buffer to the log:

cluster->objectServerControl(tableId, key, keyLength, WireFormat::LOG_TIME_TRACE);

In this example, cluster is a pointer to the RamCloud object for a cluster. TableId is the identifier for a table, and key and keyLength specify the key for an object; the RPC will be sent to the server containing the object given by these parameters, and the server will then dump its time trace log.

Internal RAMCloud Metrics

RAMCloud has a fairly extensive internal metrics-gathering infrastructure. The RawMetrics class defines a number of subsystem-grouped uint64_t counters that are maintained by various modules. For example, every RPC opcode serviced increments an operation counter and a ticks counter (cpu cycles spent executing the operation) such as 'rpc.ReadCount' and 'rpc.ReadTicks'.

Accessing Metrics

The ClusterMetrics class provides a very simple way for clients to gather the RawMetrics of every server in the cluster. A common use is to gather the cluster's metrics immediately before and after an experiment and taking the difference (ClusterMetrics::difference()) to get just the counts applicable to that experiment.

Incrementing Metrics

There is a global (in the RAMCloud:: namespace) pointer called metrics that points to the global instance of RawMetrics. Just #include <RawMetrics.h> and dereference that pointer. For example, the 'rpc.ReadTicks' metric is accessible via metrics->rpc.ReadTicks.

Adding New Metrics

Adding metrics requires adding fields to the scripts/rawmetrics.py file, which is used to auto-generate the RawMetrics class definition. Once added, you should be able to #include "RawMetrics.h" in your module and twiddle the 'metrics->myGroup.myCounter' counter.

Quite frequently you may find yourself wanting to add a temporary counter as you try to track down what's happening. The RawMetrics class defines a number of miscellaneous counters for this purpose so you don't have to bother modifying rawmetrics.py for a temporary value. These are named 'temp.ticks0' through 'temp.ticks10' and 'temp.count0' through 'temp.count10'.

The CycleCounter and Cycle Classes

CycleCounter and Cycles make it very easy to keep track of CPU time spent executing blocks of code.

Cycles:: wraps the CPU's timestamp (tick/cycle) counter and gives you methods to read it (::rdtsc()), as well as convert betweek ticks and seconds/microseconds/nanoseconds.

CycleCounter simply takes a timestamp reading in the constructor via Cycles::rdtsc(), and computes a delta when stop() is called. Moreover, it optionally takes a pointer in its constructor and implicitly calls stop() in the destructor, adding the delta to the value pointed at. Here's an example:

CycleCounter<> ticks;  // default template type is uint64_t
... do something ...
printf("took %lu ticks\n", ticks.stop());

extern uint64_t globalWeirdConditionCounter;
if (someConditionRequiringWork) {
    CycleCounter<> ticks2(&globalWeirdConditionCounter);
    ... do the work ...
    // the ticks2 destructor will call stop() and apply the
    // delta to globalWeirdConditionCounter automatically.
}

printf("Spent a total of %f seconds in weird condition\n",
    Cycles::toSeconds(globalWeirdConditionCounter));

 

Warnings / Caveats

RAMCloud Client Applications

A number of client applications have been developed for measuring performance. You might want to use them as is, or as a basis for your own benchmark clients.

LogCleanerBenchmark is one application that stresses the write throughput of RAMCloud. The user may specify a key locality distribution (uniform, zipfian, or hot-and-cold), a fixed object size, and the percentage of memory to use for live data), and many other options. It then blasts the server with writes until the cleaning overhead converges to a stable value. Afterwards it dumps a large number of metrics, including latency histograms, ClusterMetrics output, and an assortment of log- and cleaner-specific metrics.

Standalone Applications

In building performance-critical pieces of the system, I (Steve) found it useful to build simple standalone applications that instantiate a very small piece of the system (for example, just a hash table, or log, with no RPC subsystem) and micro-benchmark specific hot code paths. Examples of these include HashTableBenchmark, CleanerCompactionBenchmark, and RecoverSegmentBenchmark. In some sense these are like ad-hoc unit tests of performance. (Some day it would be nice if they were less ad-hoc and we had a more extensive collection of them.)

Running these locally is much easier than spinning up a cluster, and you can stress individual components much more than you otherwise would. It's also very easy to run them through gprof, perf, or other tools for analysis.

Monitoring the Kernel

RAMCloud generally interacts with the kernel as little as possible. You may want to look into SystemTap, and strace. If SystemTap is anything like DTrace, it should be enormously helpful.

Monitoring Memory Bandwidth

In many cases memory bandwidth becomes a bottleneck. For example, during recovery this is often true, especially when stacking recovery masters and backups on the same machines. The cleaner can also exhaust memory bandwidth under heavy write workloads with larger blocks (~10KB).

All modern Intel CPUs appear to have performance counters in their on-die memory controllers. On some machines you can only get aggregate statistics (number of cache lines read, fully written, partially written). On others, statistics are gathered per memory channel. The great thing about using memory controller counters is that you catch all traffic – device IO, traffic due to speculative execution -- whatever, it's there. To get at this you'll likely want to consider one of two tools for our current machines: