Always Measure One Level Deeper

作者是JOHN OUSTERHOUT, 主页在 https://web.stanford.edu/~ouster/cgi-bin/home.php 搞过不少系统，最近比较知名的系统就是Raft算法和RAMCloud. 这篇文章作者主要抱怨系统性能分析不够细致比较粗糙，同时也错过了不少系统优化和对系统更加深入理解的机会。

Key Insights:

Performance measurement is less straightforward than it might seem; it is easy to believe results that are incorrect or misleading and overlook important system behaviors.

The key to good performance measurement is to make many more measurements besides the ones you think will be important; it is crucial to understand not just the system’s performance but also why it performs that way.

Performance measurement done well results in new discoveries about the system being measured and new intuition about system behavior for the person doing the measuring.

一次好的性能测试可以加深对系统理解，还有助于提升研究人员的直觉。

A good performance evaluation provides a deep understanding of a system’s behavior, quantifying not only the overall behavior but also its internal mechanisms and policies. It explains why a system behaves the way it does, what limits that behavior, and what problems must be addressed in order to improve the system. Done well, perfor- mance evaluation exposes interesting system properties that were not obvi- ous previously. It not only improves the quality of the system being measured but the developer’s intuition, resulting in better systems in the future.

作者接着提出了5个常见错误

Mistake 1: Trusting the numbers. 太过于相信数字，而没有进行交叉验证和深入的定量分析。
Mistake 2: Guessing instead of measuring. 喜欢猜测/推测而不是进行测量和实验。
Mistake 3: Superficial measurements. 太过于肤浅的测量而没有做定性和定量分析。
Mistake 4: Confirmation bias. 确认偏误，这个也可以理解，毕竟要赶在deadline之前写一篇好文章。
Mistake 5: Haste. 草率，作者主要说的就是测量不够充分时间不够长。（你就说论文发表重不重要吧！）

然后提出了4点改进意见：

Rule 1: Allow lots of time. 作者建议至少要给出2~3个月的时间来做充分测试，强度也够高的。
Rule 2: Never trust a number gen- erated by a computer. 不要轻易详细测试数字，要做交叉验证
- Take different measurements at the same level. 在相同级别上做不同的测试。
- Measure the system’s behavior at a lower level to break down the factors that determine performance. 对系统分解在底层做测试，然后可以解释系统行为。
- Make back-of-the-envelope calcula- tions to see if the measurements are in the ballpark expected 保底计算
- Run simulations and compare their results to measurements of the real im- plementation. 模拟测试和真实系统对比。
Rule 3: Use your intuition to ask questions, not to answer them. 用直觉提出问题，但是不要用直觉回答问题。
Rule 4: Always measure one level deeper. 总是进一步测量。也是这篇文章题目。就是对系统拆解，测量底层每个部分，然后综合起来解释为什么性能更好或者是更差。从拆解的过程中，其实可以发现更多的优化机会，或者是发现自己做错的地方。

If you want to understand the performance of a system at a particular level, you must measure not just that level but also the next level deeper. That is, measure the underlying factors that contribute to the performance at the higher level. If you are measuring over- all latency for remote procedure calls, you could measure deeper by break- ing down that latency, determining how much time is spent in the client machine, how much time is spent in the network, and how much time is spent on the server. You could also measure where time is spent on the client and server. If you are measuring the overall throughput of a system, the system probably con- sists of a pipeline containing several components. Measure the utilization of each component (the fraction of time that component is busy). At least one component should be 100% utilized; if not, it should be possible to achieve a higher throughput.

Measuring deeper is the best way to validate top-level measurements and uncover bugs. Once you have col- lected some deeper measurements, ask yourself whether they seem consistent with the top-level measurements and with each other. You will almost certainly discover things that do not make sense; make additional measurements to resolve the contradictions.

Measuring deeper will also indicate whether you are getting the best possi- ble performance and, if not, how to im- prove it. Use deeper measurements to find out what is limiting performance. Try to identify the smallest elements that have the greatest impact on overall performance. For example, if the over- all metric is latency, measure the indi- vidual latencies of components along the critical path; typically, there will be a few components that account for most of the overall latency. You can then fo- cus on optimizing those components.

Do not spend a lot of time agoniz- ing over which deeper measurements to make. If the top-level measurements contain contradictions or things that are surprising, start with measurements that could help resolve them. Or pick measurements that will identify per- formance bottlenecks. If nothing else, choose a few metrics that are most ob- vious and easiest to collect, even if you are not sure they will be particularly illuminating. Once you look at the results, you will almost certainly find things that do not make sense; from this point on, track down and resolve everything that does not make perfect sense. Along the way you will discover other surprises; track them down as well. Over time, you will develop intuition about what kinds of deeper measurements are most likely to be fruitful.

Measuring deeper is the single most important ingredient for high-quality performance measurement. Focusing on this one rule will prevent most of the mistakes anyone could potentially make. For example, in order to make deeper measurements you will have to allocate extra time. Measuring deeper will expose bugs and inconsistencies, so you will not accidentally trust bogus data. Most of the suggestions under Rule 2 (Never trust a number generated by a computer) are actually examples of measuring deeper. You will never need to guess the reasons for performance, since you will have actual data. Your measurements will not be superficial. Finally, you are less likely to be derailed by subconscious bias, since the deeper measurements will expose weakness- es, as well as strengths.