Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Note that the counters are reasonably expensive to access. Beware in hot code paths and especially with counters that are shared between threads.

    The counters use atomic integers for consistency. Incrementing them frequently may lead to surprising overheads, not only because they require atomic ops, but because cache lines must be ping-ponged among cores. It is usually a better idea to aggregate statistics locally (in a stack variable, perhaps) and then apply them to the global counter and at a lower frequency. For instance, if you have a tight loop over N operations, do not increment the RawMetrics counter once for each loop, rather increment it by N after the loop. This may seem obvious, but it has bitten us multiple times in very hot paths. (Someone should write a DelayedMetric class that wraps a local uint64_t and a RawMetrics pointer, keeps track of a local count by overloading the arithmetic operators, and updates the RawMetric counter once in the destructor.)

    There may also be a risk of false sharing with our counters (I don't think they're cacheline aligned). In general, your mom was dead wrong when she taught you to share as a child. In general you should keep data thread-local and try to share as little and infrequently as possible. With metrics counters this is especially easy.

...

In building performance-critical pieces of the system, I (Steve) found it useful to build simple standalone applications that instantiate a very small piece of the system (for example, just a hash table, or log, with no RPC subsystem) and micro-benchmark specific hot code paths. Examples of these include HashTableBenchmark, CleanerCompactionBenchmark, and RecoverSegmentBenchmark. In some sense these are like performance ad-hoc unit tests of performance. (Some day it would be nice if they were less ad-hoc and we had a more extensive collection of them.)

Running these locally is much easier than spinning up a cluster, and you can stress individual components much more than you otherwise would. It's also very easy to run them through gprof, perf, or other tools for analysis.

...