Page Comparison

The purpose of this page is to list tools and techniques that people can use to measure the performance of RAMCloud at various levels. The scope is intended to be broad, from using CPU performance counters, to RAMCloud's internal performance monitoring infrastructure, to existing RAMCloud client applications that collect statistics.

TimeTrace

If you are trying to analyze the latency of a particular operation to see where the time is going, you will probably find the TimeTrace class useful. An instance of this class keeps a circular buffer of events and the times when they occurred, which you can then print. Typically, you'll use the instance that is part of each RAMCloud context. You can then sprinkle calls to the record method in your code, like this:

No Format

nopanel	true

context->timeTrace->record("starting read operation");
...
context->timeTrace->record("found hash table entry");
...

Then you run a benchmark that executes the desired code multiple times, enough to fill up the circular buffer. Once the benchmark has run, you can print out the buffer to the log using the printToLog method, like this:

No Format

nopanel	true

context->printToLog();

From the information that's printed you can identify the places where the most time is being spent, and then add additional calls to TimeTrace to divide up the large blocks of time. Eventually you should be able to figure out exactly where all the time is going. If the recording is happening on a RAMCloud server but you are running a benchmark from a client machine, you can invoke a ServerControl RPC on the client to ask the server to dump its time trace buffer to the log:

No Format

nopanel	true

cluster->objectServerControl(tableId, key, keyLength, WireFormat::LOG_TIME_TRACE);

In this example, cluster is a pointer to the RamCloud object for a cluster. TableId is the identifier for a table, and key and keyLength specify the key for an object; the RPC will be sent to the server containing the object given by these parameters, and the server will then dump its time trace log.

Internal RAMCloud Metrics

RAMCloud has a fairly extensive internal metrics-gathering infrastructure. The RawMetrics class defines a number of subsystem-grouped uint64_t counters that are maintained by various subsystemsmodules. For example, every RPC opcode serviced increments an operation count counter and a ticks counter (cpu cycles spent executing the operation) such as 'rpc.ReadCount' and 'rpc.ReadTicks'.

...

Code Block

language	cpp

CycleCounter<> ticks;  // default template type is uint64_t
... do something ...
printf("took %lu ticks\n", ticks.stop());

extern uint64_t globalWeirdConditionCounter;
if (someConditionRequiringWork) {
    CycleCounter<> ticksticks2(&globalWeirdConditionCounter);
    ... do the work ...
    // the ticksticks2 destructor will call stop() and apply the
    // delta to globalWeirdConditionCounter automatically.
}

printf("Spent a total of %f seconds in weird condition\n",
    Cycles::toSeconds(globalWeirdConditionCounter));

...

Note that the counters are reasonably expensive to access. Beware in hot code paths and especially with counters that are shared between threads.

The counters use atomic integers for consistency. Incrementing them frequently may lead to surprising overheads, not only because they require atomic ops, but because cache lines must be ping-ponged among cores. It is usually a better idea to aggregate statistics locally (in a stack variable, perhaps) and then apply them to the global counter and at a lower frequency. For instance, if you have a tight loop over N operations, do not increment the RawMetrics counter once for each loop, rather increment it by N after the loop. This may seem obvious, but it has bitten us multiple times in very hot paths. (Someone should write a DelayedMetric class that wraps a local uint64_t and a RawMetrics pointer, keeps track of a local count by overloading the arithmetic operators, and updates the RawMetric counter once in the destructor.)

There may also be a risk of false sharing with our counters (I don't think they're cacheline aligned). In general, your mom was dead wrong when she taught you to share as a child. In general you should keep data thread-local and try to share as little and infrequently as possible. With metrics counters this is especially easy.

...

In building performance-critical pieces of the system, I (Steve) found it useful to build simple standalone applications that instantiate a very small piece of the system (for example, just a hash table, or log, with no RPC subsystem) and micro-benchmark specific hot code paths. Examples of these include HashTableBenchmark, CleanerCompactionBenchmark, and RecoverSegmentBenchmark. In some sense these are like performance unit tests.ad-hoc unit tests of performance. (Some day it would be nice if they were less ad-hoc and we had a more extensive collection of them.)

Running these locally is much easier than spinning up a cluster, and you can stress individual components much more than you otherwise would. It's also very easy to run them through gprof, perf, or other tools for analysis.

...

Versions Compared

Old Version 3

New Version Current

Key

TimeTrace

Internal RAMCloud Metrics