How To Measure Performance

This page serves to collect ideas about how to measure performance; these might someday form the basis of a paper on this topic.

Never trust a number generated by a computer

All performance measurements are initially bad: bugs in the measurements, not measuring what you think you are measuring, and bugs in the system
Biggest danger: initial measurements look good, so you don't bother to figure out whether they are even correct
If there is anything about a measurement that seems even a little suspicious, distrust everything until you explain it; it's possible that there is an error that has skewed all of your measurements.
Find multiple ways to measure each key result, and make sure they are consistent

To understand performance at one level, you must measure the constituent factors that explain that performance; you should only trust a result once you understand clearly why you are getting that result.

Never trust a result if it doesn't make intuitive sense. If you see anything that isn't what you expected, measure more in order to understand exactly what is going on and either fix the system or your intuition
Common mistake: people make up plausible stories to explain a particular result, rather than measuring to see if that is actually happening. Once you get a hypothesis about why something is happening, measure the system to verify whether that is actually happening.
A common mistake is to measure performance end-to-end for some operation, then conclude based on intuitive arguments that a particular sub-portion accounts for most of the time. Instead measure one level deeper (i.e. time exactly the portion you think is slow, without anything else around it) to make sure your conclusions are correct.

If there are 3 interesting design ideas in your system, it is not sufficient to measure the system in aggregate, with all 3 ideas included; it's possible that one of the ideas is actually a bad idea that has made performance worse, but it is compensated by the other ideas. Find a way to measure each of the ideas independently to see how well they work.

Figure out exactly what questions you would like to answer about your system.
Do this before running your performance measurements. Define your experiments around the questions you want to answer, not vice versa.

I was too scared to edit your wiki page, John, so I'm putting my ideas here. Please integrate them (or not) as you see fit. I'm sure this is worth writing, even if no PC would accept it.
When your graph is so obvious to you that you wonder whether others will find it boring, you're just about done.
Don't just look at averages. Also look at distributions, and check individual samples to make sure there's not a pattern (in RAMCloud, Alex and Mendel found that every other RPC was slower).
Script your entire experiment and graph end-to-end.
Generate your graphs incrementally, so you can see what the entire graph will look like while you're still gathering data. Fill the graph in like a progressively-rendered image, then repeat each data point to add error bars.
Record more data than you plot, so that when you decide you want to graph something else, you may not need to re-run the experiment.
Find out and show what optimal would be, so we can judge how good your solution is. Ideally, your optimal will be derived from some "fundamental" limit, like the speed of light or the speed of your network.
Know what would happen if you were to fix your current bottleneck by measuring what the next one would be. If you push your system close to the limit, you may essentially exhaust on multiple resources – the second bottleneck may not be far.
Never use a pie chart.