Measuring RAMCloud Performance

The purpose of this page is to list tools and techniques that people can use to measure the performance of RAMCloud at various levels. The scope is intended to be broad, from using CPU performance counters, to RAMCloud's internal performance monitoring infrastructure, to existing RAMCloud client applications that collect statistics.

TimeTrace

If you are trying to analyze the latency of a particular operation to see where the time is going, you will probably find the TimeTrace class useful. An instance of this class keeps a circular buffer of events and the times when they occurred, which you can then print. Typically, you'll use the instance that is part of each RAMCloud context. You can then sprinkle calls to the record method in your code, like this:

context->timeTrace->record("starting read operation");
...
context->timeTrace->record("found hash table entry");
...

Then you run a benchmark that executes the desired code multiple times, enough to fill up the circular buffer. Once the benchmark has run, you can print out the buffer to the log using the printToLog method, like this:

context->printToLog();

From the information that's printed you can identify the places where the most time is being spent, and then add additional calls to TimeTrace to divide up the large blocks of time. Eventually you should be able to figure out exactly where all the time is going. If the recording is happening on a RAMCloud server but you are running a benchmark from a client machine, you can invoke a ServerControl RPC on the client to ask the server to dump its time trace buffer to the log:

cluster->objectServerControl(tableId, key, keyLength, WireFormat::LOG_TIME_TRACE);

In this example, cluster is a pointer to the RamCloud object for a cluster. TableId is the identifier for a table, and key and keyLength specify the key for an object; the RPC will be sent to the server containing the object given by these parameters, and the server will then dump its time trace log.

Internal RAMCloud Metrics

RAMCloud has a fairly extensive internal metrics-gathering infrastructure. The RawMetrics class defines a number of subsystem-grouped uint64_t counters that are maintained by various modules. For example, every RPC opcode serviced increments an operation counter and a ticks counter (cpu cycles spent executing the operation) such as 'rpc.ReadCount' and 'rpc.ReadTicks'.

Accessing Metrics

The ClusterMetrics class provides a very simple way for clients to gather the RawMetrics of every server in the cluster. A common use is to gather the cluster's metrics immediately before and after an experiment and taking the difference (ClusterMetrics::difference()) to get just the counts applicable to that experiment.

Incrementing Metrics

There is a global (in the RAMCloud:: namespace) pointer called metrics that points to the global instance of RawMetrics. Just #include <RawMetrics.h> and dereference that pointer. For example, the 'rpc.ReadTicks' metric is accessible via metrics->rpc.ReadTicks.

Adding New Metrics

Adding metrics requires adding fields to the scripts/rawmetrics.py file, which is used to auto-generate the RawMetrics class definition. Once added, you should be able to #include "RawMetrics.h" in your module and twiddle the 'metrics->myGroup.myCounter' counter.

Quite frequently you may find yourself wanting to add a temporary counter as you try to track down what's happening. The RawMetrics class defines a number of miscellaneous counters for this purpose so you don't have to bother modifying rawmetrics.py for a temporary value. These are named 'temp.ticks0' through 'temp.ticks10' and 'temp.count0' through 'temp.count10'.

The CycleCounter and Cycle Classes

CycleCounter and Cycles make it very easy to keep track of CPU time spent executing blocks of code.

Cycles:: wraps the CPU's timestamp (tick/cycle) counter and gives you methods to read it (::rdtsc()), as well as convert betweek ticks and seconds/microseconds/nanoseconds.

CycleCounter simply takes a timestamp reading in the constructor via Cycles::rdtsc(), and computes a delta when stop() is called. Moreover, it optionally takes a pointer in its constructor and implicitly calls stop() in the destructor, adding the delta to the value pointed at. Here's an example:

CycleCounter<> ticks;  // default template type is uint64_t
... do something ...
printf("took %lu ticks\n", ticks.stop());

extern uint64_t globalWeirdConditionCounter;
if (someConditionRequiringWork) {
    CycleCounter<> ticks2(&globalWeirdConditionCounter);
    ... do the work ...
    // the ticks2 destructor will call stop() and apply the
    // delta to globalWeirdConditionCounter automatically.
}

printf("Spent a total of %f seconds in weird condition\n",
    Cycles::toSeconds(globalWeirdConditionCounter));

Warnings / Caveats

Note that the counters are reasonably expensive to access. Beware in hot code paths and especially with counters that are shared between threads.

The counters use atomic integers for consistency. Incrementing them frequently may lead to surprising overheads, not only because they require atomic ops, but because cache lines must be ping-ponged among cores. It is usually a better idea to aggregate statistics locally (in a stack variable, perhaps) and then apply them to the global counter at a lower frequency. For instance, if you have a tight loop over N operations, do not increment the RawMetrics counter once for each loop, rather increment it by N after the loop. This may seem obvious, but it has bitten us multiple times in very hot paths. (Someone should write a DelayedMetric class that wraps a local uint64_t and a RawMetrics pointer, keeps track of a local count by overloading the arithmetic operators, and updates the RawMetric counter once in the destructor.)

There may also be a risk of false sharing with our counters (I don't think they're cacheline aligned). In general, your mom was dead wrong when she taught you to share as a child. In general you should keep data thread-local and try to share as little and infrequently as possible. With metrics counters this is especially easy.

RAMCloud Client Applications

A number of client applications have been developed for measuring performance. You might want to use them as is, or as a basis for your own benchmark clients.

LogCleanerBenchmark is one application that stresses the write throughput of RAMCloud. The user may specify a key locality distribution (uniform, zipfian, or hot-and-cold), a fixed object size, and the percentage of memory to use for live data), and many other options. It then blasts the server with writes until the cleaning overhead converges to a stable value. Afterwards it dumps a large number of metrics, including latency histograms, ClusterMetrics output, and an assortment of log- and cleaner-specific metrics.

Standalone Applications

In building performance-critical pieces of the system, I (Steve) found it useful to build simple standalone applications that instantiate a very small piece of the system (for example, just a hash table, or log, with no RPC subsystem) and micro-benchmark specific hot code paths. Examples of these include HashTableBenchmark, CleanerCompactionBenchmark, and RecoverSegmentBenchmark. In some sense these are like ad-hoc unit tests of performance. (Some day it would be nice if they were less ad-hoc and we had a more extensive collection of them.)

Running these locally is much easier than spinning up a cluster, and you can stress individual components much more than you otherwise would. It's also very easy to run them through gprof, perf, or other tools for analysis.

Monitoring the Kernel

RAMCloud generally interacts with the kernel as little as possible. You may want to look into SystemTap, and strace. If SystemTap is anything like DTrace, it should be enormously helpful.

Monitoring Memory Bandwidth

In many cases memory bandwidth becomes a bottleneck. For example, during recovery this is often true, especially when stacking recovery masters and backups on the same machines. The cleaner can also exhaust memory bandwidth under heavy write workloads with larger blocks (~10KB).

All modern Intel CPUs appear to have performance counters in their on-die memory controllers. On some machines you can only get aggregate statistics (number of cache lines read, fully written, partially written). On others, statistics are gathered per memory channel. The great thing about using memory controller counters is that you catch all traffic – device IO, traffic due to speculative execution -- whatever, it's there. To get at this you'll likely want to consider one of two tools for our current machines:

For rc01 to rc80, Steve's 'perfctr' kernel module (~rumble/tmp/ramcloud-old/perfctr on the cluster)

These are X3470 Nehalem-based Xeons, which have a very simple set of memory controller registers on the "uncore" that can be read via MSRs. Steve wrote an ugly and simple kernel module that will set up these counters and let you read them via /proc/perfctr. Each counter counts the number of 64-byte cache lines. For details see Intel's "Performance Analysis Guide for Intel Core i7 and Xeon 5500 Processors".

It's possible that Linux's perfcounter framework does or will support this, but it's a bit beastly and difficult to figure out. The Nehalem counters were so simple that a custom kernel module was the most expedient. Note that the same functionality could be implemented in userspace using the 'msr' module and special files under /proc that allow access to MSRs.
For rcmonster, use the Intel Performance Counter Monitor.

rcmonster is a Sandy Bridge machine, with more complicated counters that are accessible via the PCI configuration space. Here you have two options, either you can read aggregate statistics from the "Home Agent", which sits between the QPI interconnect and the memory controller, or read per-channel values from the memory controller itself (there are 4 memory channels per socket). If using multiple sockets, memory controllers apparently reside at different bus addresses. If you really want the details, look at the "Intel Xeon Processor E5-2600 Product Family Uncore Guide". (Intel's Architecture Developer's Manual Vol 3B also has some details, but is impossibly terse). If you'd rather just get work done, the open source Intel Performance Counter Monitor framework groks this functionality for you and is very hackable open source c++. You'll want to use the "pcm-memory.x" tool.