Date: Thu, 28 Mar 2024 19:51:22 +0000 (UTC) Message-ID: <1575403356.45.1711655482480@943e831290cb> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_44_1367885162.1711655482480" ------=_Part_44_1367885162.1711655482480 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
The purpose of this page is to list tools and techniques that pe= ople can use to measure the performance of RAMCloud at various levels. The = scope is intended to be broad, from using CPU performance counters, to RAMC= loud's internal performance monitoring infrastructure, to existing RAMCloud= client applications that collect statistics.
If you are trying to analyze the latency of a particular operation= to see where the time is going, you will probably find the TimeTrace class= useful. An instance of this class keeps a circular buffer of events = and the times when they occurred, which you can then print. Typically, you'= ll use the instance that is part of each RAMCloud context. You can then spr= inkle calls to the record method in your code, like this:
context->timeTrace->record("starting read operation"); ... context->timeTrace->record("found hash table entry"); ...
Then you run a benchmark that executes the desired code mult= iple times, enough to fill up the circular buffer. Once the benchmark= has run, you can print out the buffer to the log using the printToLo= g method, like this:
context->printToLog();
From the information that's printed you can identify the places where th= e most time is being spent, and then add additional calls to TimeTrace to d= ivide up the large blocks of time. Eventually you should be able to figure = out exactly where all the time is going. If the recording is happening on a= RAMCloud server but you are running a benchmark from a client machine, you= can invoke a ServerControl RPC on the client to ask the server to dump its= time trace buffer to the log:
cluster->objectServerControl(tableId, key, keyLength, WireFormat::L= OG_TIME_TRACE);
In this example, cluster is a pointer to the RamCloud object for a clust= er. TableId is the identifier for a table, and key and keyLength specify th= e key for an object; the RPC will be sent to the server containing the obje= ct given by these parameters, and the server will then dump its time trace = log.
RAMCloud has a fairly extensive internal metrics-gathering infrastructur= e. The RawMetrics class defines a number of subsystem-grouped= uint64_t counters that are maintained by various modules. For example, eve= ry RPC opcode serviced increments an operation counter and a ticks counter = (cpu cycles spent executing the operation) such as 'rpc.ReadCount' and 'rpc= .ReadTicks'.
Accessing Metrics
The ClusterMetrics class provides a very simple way for cl= ients to gather the RawMetrics of every server in the cluster. A common use= is to gather the cluster's metrics immediately before and after an experim= ent and taking the difference (ClusterMetrics::difference()) to get just th= e counts applicable to that experiment.
Incrementing Metrics
There is a global (in the RAMCloud:: namespace) pointer called = metrics that points to the global instance of RawMetrics. Just #i= nclude <RawMetrics.h> and dereference that pointer. For example, the = 'rpc.ReadTicks' metric is accessible via metrics->rpc.ReadTicks.
Adding New Metrics
Adding metrics requires adding= fields to the scripts/rawmetrics.py file, which is used to auto-generate t= he RawMetrics class definition. Once added, you should be able to #include = "RawMetrics.h" in your module and twiddle the 'metrics->myGroup.myCounte= r' counter.
Quite frequently you may find yourself wanting to add a temporary counte= r as you try to track down what's happening. The RawMetrics class defines a= number of miscellaneous counters for this purpose so you don't have to bot= her modifying rawmetrics.py for a temporary value. These are named 'temp.ti= cks0' through 'temp.ticks10' and 'temp.count0' through 'temp.count10'.
The CycleCounter and Cycle Classes
CycleCounter and Cycles make it very easy to keep track of CPU time spen= t executing blocks of code.
Cycles:: wraps the CPU's timestamp (tick/cycle) counter and gives you me= thods to read it (::rdtsc()), as well as convert betweek ticks and seconds/= microseconds/nanoseconds.
CycleCounter simply takes a timestamp reading in the constructor via Cyc= les::rdtsc(), and computes a delta when stop() is called. Moreover, it opti= onally takes a pointer in its constructor and implicitly calls stop() in th= e destructor, adding the delta to the value pointed at. Here's an example:<= /p>
CycleCou= nter<> ticks; // default template type is uint64_t ... do something ... printf("took %lu ticks\n", ticks.stop()); extern uint64_t globalWeirdConditionCounter; if (someConditionRequiringWork) { CycleCounter<> ticks2(&globalWeirdConditionCounter); ... do the work ... // the ticks2 destructor will call stop() and apply the // delta to globalWeirdConditionCounter automatically. } printf("Spent a total of %f seconds in weird condition\n", Cycles::toSeconds(globalWeirdConditionCounter));
Warnings / Caveats=
A number of client applications have been developed for measuring perfor= mance. You might want to use them as is, or as a basis for your own benchma= rk clients.
LogCleanerBenchmark is one application that stresses the w= rite throughput of RAMCloud. The user may specify a key locality distributi= on (uniform, zipfian, or hot-and-cold), a fixed object size, and the percen= tage of memory to use for live data), and many other options. It then blast= s the server with writes until the cleaning overhead converges to a stable = value. Afterwards it dumps a large number of metrics, including latency his= tograms, ClusterMetrics output, and an assortment of log- and cleaner-speci= fic metrics.
In building performance-critic= al pieces of the system, I (Steve) found it useful to build simple standalo= ne applications that instantiate a very small piece of the system (for exam= ple, just a hash table, or log, with no RPC subsystem) and micro-benchmark = specific hot code paths. Examples of these include HashTableBenchmark, = CleanerCompactionBenchmark, and RecoverSegmentBenchmark. In s= ome sense these are like ad-hoc unit tests of performance. (Some day it wou= ld be nice if they were less ad-hoc and we had a more extensive collection = of them.)
Running these locally is much = easier than spinning up a cluster, and you can stress individual components= much more than you otherwise would. It's also very easy to run them throug= h gprof, perf, or other tools for = analysis.
RAMCloud generally interacts with the kernel as little as possible. You = may want to look into SystemTap, and str= ace. If SystemTap is anything like DTrace, it should be enormously help= ful.
In many cases memory bandwidth becomes a bottleneck. For example, during= recovery this is often true, especially when stacking recovery masters and= backups on the same machines. The cleaner can also exhaust memory bandwidt= h under heavy write workloads with larger blocks (~10KB).
All modern Intel CPUs appear to have performance counters in their on-di= e memory controllers. On some machines you can only get aggregate statistic= s (number of cache lines read, fully written, partially written). On others= , statistics are gathered per memory channel. The great thing about using m= emory controller counters is that you catch all traffic =E2=80=93 device IO= , traffic due to speculative execution -- whatever, it's there. To get= at this you'll likely want to consider one of two tools for our current ma= chines: