Yilong's Random Notes on Homa Implementation

01 Mar 2017

Preliminary results running FacebookHadoop workload under different environments:

03 Mar 2017

After fixing timetracing in DpdkDriver in commit `c16fa4c`, it became clear that the ~15us latency at 99% was not due to network queueing delay. In addition, it looks more like kernel jitter because large gap in the time trace on the server side cannot be pinpointed to a fixed location.

06 Mar 2017

It appears that the kernel compile-time option `CONFIG_NO_HZ_FULL`, which is essential for full-dynticks mode, is not set by default on Ubuntu 15.04 & 16.04. This can be checked by `grep NO_HZ /boot/config*`. Therefore, to turn off the timer interrupts, use Ubuntu 14.04 or compile your own kernel.

13 Mar 2017

After slicing DpdkDriver::receivePackets and ObjectPool::construct with timetraces, it appears that the 16us jitters are spread (evenly?) across all the slices.

23 Mar 2017

It seems like the ~15us (or 30000 cycles) jitters are not due to SMI (Service Management Interrupt) that are invisible to the kernel. This can be verified by running the `user_loop` test program with `THRESHOLD_CYCLES` defined to 100000 and reading msr 0x34 before and after the test program. Here is a sample output:

The number of jitters printed matches the delta of the two msr 0x34 readings.

(PS: Today I have to do a kernel upgrade in Ubuntu, it turns out to be really simple: http://askubuntu.com/a/236458)

Linux kernel 4.9 introduced a new hardware latency tracer(Documentation/trace/hwlat_detector.txt). By setting the `tracing_thresh` to 10us, I got the following output:

It doesn't look very useful as it only records the maximum jitter observed (this behavior can be verified at http://lxr.free-electrons.com/ident?v=4.9;i=get_sample).

24 Mar 2017

It turns out that the 15us jitter is somehow related to CPU power management. I followed the configuration documented in a Dell technical white paper "Controlling Processor C-State Usage in Linux". Running `turbostat --debug sleep 10` again showed that there were still 8 SMI per second but the CPU were mostly staying in c1 state. The 15us jitters disappeared after this re-configuration, which could be verified via both `user_loop` and my "slice-ObjectPool-construct" experiment. This article[1] provides an explanation why the power-saving states could result in latency spikes. However, that is a different scenario and doesn't explain the earlier results of our `user_loop` test since the CPU is running in a tight loop. The jitter is probably due to some other thing correlated with C-states.

The longer SMI-induced jitters are harder to eliminate on the "HPE ProLiant m510 Server Cartridges in HPE Moonshot Systems". It requires disabling the "Memory Pre-Failure Notification" in the BIOS[2]. HP provides the `conrep`[3] utility to automate this task but it failed the platform check ("ProLiant m510 Server Cartridge" not in supported platform list) after I installed it from the apt repository. And I failed to find the "Service Options" when I tried to do it manually at boot-time following the instructions here[4].

PS: According to [5,6], "Intel Processor Trace can be used to trace SMM code", though I am not sure why `perf script` failed to show that in the trace output.

[1] "Are hardware power management features causing latency spikes in my application?" https://access.redhat.com/articles/65410

[2] "Configuring and tuning HPE ProLiant Servers for low-latency applications": https://www.hpe.com/h20195/v2/GetPDF.aspx/c05281307.pdf

[3] https://downloads.linux.hpe.com/SDR/project/stk/

[4] https://serverfault.com/questions/508847/how-do-i-disable-processor-power-and-utilization-monitoring-on-an-hp-proliant/508873

[5] https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-1708.html

[6] https://www.slideshare.net/pipatmet/intel-processor-trace-what-are-recorded

25 Mar 2017

The document "Low Latency Performance Tuning for Red Hat ENterprise LInux 7" is pretty comprehensive and has covered most of the tuning techniques I have read so far.

07 Apr 2017

Mellanox DPDK is now based on DPDK 16.11 instead of 2.2. In its quick start guide[1], it also mentions that unnecessary SMIs used for Power Monitoring and for Memory PreFailure Notification should be disabled. A more comprehensive guide on performance tuning for Mellanox adaptor cards can be found here: https://community.mellanox.com/docs/DOC-2489.

[1] http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v16.11_2.3.pdf

09 Apr 2017

All you need to know about Intel PT (and perf, in general):

http://halobates.de/blog/p/category/pt

https://github.com/01org/processor-trace/tree/master/doc

https://perf.wiki.kernel.org/index.php/Tutorial

http://vger.kernel.org/~acme/perf/

15 Apr 2017

The cluster diagram at Cluster Intro is out of date. I created a new one using https://github.com/glaville/graphviz-infiniband-topology. It's kind of ugly but does the work for me.

24 Apr 2017

An excellent explanation of how the NIC and its driver work collaboratively to send/receive packets: http://stackoverflow.com/questions/36625892/descriptor-concept-in-nic. The DPDK performance optimization guidelines (https://software.intel.com/en-us/articles/dpdk-performance-optimization-guidelines-white-paper) mention a bit more on DPDK-specific config options like `RX_PTHRESH`, `RX_WTHRESH`, etc.