/
Yilong's Random Notes on Homa Implementation

Yilong's Random Notes on Homa Implementation

 

Preliminary results running FacebookHadoop workload under different environments:


 

After fixing timetracing in DpdkDriver in commit `c16fa4c`, it became clear that the ~15us latency at 99% was not due to network queueing delay. In addition, it looks more like kernel jitter because large gap in the time trace on the server side cannot be pinpointed to a fixed location.


 

It appears that the kernel compile-time option `CONFIG_NO_HZ_FULL`, which is essential for full-dynticks mode, is not set by default on Ubuntu 15.04 & 16.04. This can be checked by `grep NO_HZ /boot/config*`. Therefore, to turn off the timer interrupts, use Ubuntu 14.04 or compile your own kernel.


 

After slicing DpdkDriver::receivePackets and ObjectPool::construct with timetraces, it appears that the 16us jitters are spread (evenly?) across all the slices.


 

It seems like the ~15us (or 30000 cycles) jitters are not due to SMI (Service Management Interrupt) that are invisible to the kernel. This can be verified by running the `user_loop` test program with `THRESHOLD_CYCLES` defined to 100000 and reading msr 0x34 before and after the test program. Here is a sample output:

The number of jitters printed matches the delta of the two msr 0x34 readings.

(PS: Today I have to do a kernel upgrade in Ubuntu, it turns out to be really simple: http://askubuntu.com/a/236458

Linux kernel 4.9 introduced a new hardware latency tracer(Documentation/trace/hwlat_detector.txt). By setting the `tracing_thresh` to 10us, I got the following output:

It doesn't look very useful as it only records the maximum jitter observed (this behavior can be verified at http://lxr.free-electrons.com/ident?v=4.9;i=get_sample).


 

It turns out that the 15us jitter is somehow related to CPU power management. I followed the configuration documented in a Dell technical white paper "Controlling Processor C-State Usage in Linux". Running `turbostat --debug sleep 10` again showed that there were still 8 SMI per second but the CPU were mostly staying in c1 state. The 15us jitters disappeared after this re-configuration, which could be verified via both `user_loop` and my "slice-ObjectPool-construct" experiment. This article[1] provides an explanation why the power-saving states could result in latency spikes. However, that is a different scenario and doesn't explain the earlier results of our `user_loop` test since the CPU is running in a tight loop. The jitter is probably due to some other thing correlated with C-states.

The longer SMI-induced jitters are harder to eliminate on the "HPE ProLiant m510 Server Cartridges in HPE Moonshot Systems". It requires disabling the "Memory Pre-Failure Notification" in the BIOS[2]. HP provides the `conrep`[3] utility to automate this task but it failed the platform check ("ProLiant m510 Server Cartridge" not in supported platform list) after I installed it from the apt repository. And I failed to find the "Service Options" when I tried to do it manually at boot-time following the instructions here[4].

PS: According to [5,6], "Intel Processor Trace can be used to trace SMM code", though I am not sure why `perf script` failed to show that in the trace output.

[1] "Are hardware power management features causing latency spikes in my application?" https://access.redhat.com/articles/65410

[2] "Configuring and tuning HPE ProLiant Servers for low-latency applications": https://www.hpe.com/h20195/v2/GetPDF.aspx/c05281307.pdf

[3] https://downloads.linux.hpe.com/SDR/project/stk/

[4] https://serverfault.com/questions/508847/how-do-i-disable-processor-power-and-utilization-monitoring-on-an-hp-proliant/508873

[5] https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-1708.html

[6] https://www.slideshare.net/pipatmet/intel-processor-trace-what-are-recorded


 

The document "Low Latency Performance Tuning for Red Hat ENterprise LInux 7" is pretty comprehensive and has covered most of the tuning techniques I have read so far.


 

Mellanox DPDK is now based on DPDK 16.11 instead of 2.2. In its quick start guide[1], it also mentions that unnecessary SMIs used for Power Monitoring and for Memory PreFailure Notification should be disabled. A more comprehensive guide on performance tuning for Mellanox adaptor cards can be found here: https://community.mellanox.com/docs/DOC-2489.

[1] http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v16.11_2.3.pdf


 

All you need to know about Intel PT (and perf, in general):

http://halobates.de/blog/p/category/pt

https://github.com/01org/processor-trace/tree/master/doc

https://perf.wiki.kernel.org/index.php/Tutorial

http://vger.kernel.org/~acme/perf/


 

The cluster diagram at Cluster Intro is out of date. I created a new one using https://github.com/glaville/graphviz-infiniband-topology. It's kind of ugly but does the work for me.


 

An excellent explanation of how the NIC and its driver work collaboratively to send/receive packets: http://stackoverflow.com/questions/36625892/descriptor-concept-in-nic. The DPDK performance optimization guidelines (https://software.intel.com/en-us/articles/dpdk-performance-optimization-guidelines-white-paper) mention a bit more on DPDK-specific config options like `RX_PTHRESH`, `RX_WTHRESH`, etc.


 

The main problem of using TSC is that there is no obvious way to convert cycles to nanoseconds. The Intel docs state that Invariant TSC runs at a constant rate, which may differ from the processors' realtime frequency. The Linux kernel computes this TSC frequency by counting how may TSC cycles occurs between two hardware timers. The result can be displayed by `dmesg | grep TSC`. On the CloudLab m510 machines, the command above outputs something like  "[ 0.000000] tsc: Fast TSC calibration using PIT ... [ 3.760036] tsc: Refined TSC clocksource calibration: 1995.379 MHz". RAMCloud currently uses `gettimeofday` to estimate the TSC frequency. To get a sense of how accurate it is, I redo the estimation using the newer `clock_gettime` with different clock sources. Here is the result:

It appears that clock MONOTONIC_RAW 1) gives the most consistent result across the cluster as well as multiple runs and 2) is the only one whose result matches the 1995.379 MHz computed by kernel. However, does it suggest clock MONOTONIC_RAW gives the most accurate estimate? I don't know yet.

Finally, some posts/articles that I find useful to read:

http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/

http://stackoverflow.com/questions/6498972/faster-equivalent-of-gettimeofday

https://randomascii.wordpress.com/2011/07/29/rdtsc-in-the-age-of-sandybridge/


 

After calibrating the TSC clocks on different machines and merging the time traces, it's easy to compute the one-way propagation delay of each packet. If you order the delays of all the outbound DATA/ALL_DATA packets from one client by their TX times, it becomes clear that when there are already several ongoing messages and another 0.5MB message starts, the delay number tends to increase visibly. The delay number becomes more or less stable again at least after all the unscheduled packets of that long message are transmitted. Finally, the number decreases after a long message finishes. This is suggesting that a transient queue is being built up in the sender's NIC. It turns out that there is a unsigned integer underflow in QueueEstimator::getQueueSize(uint64_t) when the caller passes in a stale time that is less than QueueEstimator::currentTime and, as a result, getQueueSize incorrectly returns 0 as the estimated queue size. Therefore, in the worst case, the transport could enqueue two more full packets even though there are already two in the NIC's TX queue. But how can the caller passes in a stale time value in the first place? Well, one example would be that, in HomaTransport::tryToTransmitData(), we are using the stale time value stored in Dispatch::currentTime to save the cost of one RDTSC instruction. After fixing this bug, we don't see the delay number go up significantly and remain high for a long period like before.


 

When looking at the time trace of a small symmetric EchoRpc (30 bytes) that takes > 12us to finish, I noticed that, on the server-side, a large message (0.5MB) finished right before the server started to process the EchoRpc. The deletion took ~2us. The time was mostly spent in destroying Transport::serverRpc::replyPayload, which contains 341 Buffer::Chunks and 4 malloc-ed blocks. John suggests also run the experiments with 0.1 and 1MB large messages to verify if it is indeed a bottleneck. The latency numbers from both experiments have no visible difference, suggesting that this is not (yet) the bottleneck.


 

Why does it take so long for a master to start sending out the reply of an EchoRpc that fits in one ALL_DATA packet? Firstly, it appears that inserting a `serverRpc` into the unordered_map `incomingRpcs` is really expensive: it can easily take more than 250ns! Also, after calling "context->workerManager→handleRpc(serverRpc);" in HomaTransport, it takes > 100ns before the RPC can be dispatched by the MasterService.


 

I have automated most of my analyses. Now the workflow is as follows:

  1. Run the clusterperf experiment to generate log files.
  2. Go to the log directory and run `RAMCloud/scripts/debugTailLatency/run_analysis.sh`. This will automatically merge time traces and extract information that prove to be useful in the past.
  3. Look into `bad_tail_latency.txt` to select the specific RpcId for further investigation.
  4. Run `RAMCloud/scripts/debugTailLatency/lookaround.sh "clientId X, sequence Y"` with (X, Y) being the RpcId selected in step 3. This will extract a detailed time line of the specific RPC that also includes a few number of time trace records before and after our events of interest.
  5. Also look into other *.txt files generated in the directory for measurements of specific metrics.


 

To speed up the master service time for small EchoRpc's, I implemented an optimization to avoid inserting "short" ServerRpc's into `incomingRpcs` in the first place. Here, a "short" ServerRpc is defined as 1) the request fits in one data packet (so we can start servicing it as soon as we receive the first packet) and 2) the response is a unscheduled message (so we don't need to receive grants). However, there is a problem with the definition of "short" ServerRpc: it ignores the fact that the processing time could be large (e.g., LOG_TIME_TRACE). As a result, if the client decides to retry the RPC, the server will have no record of such RPC and start servicing it again: thrashing the server significantly. Therefore, it's probably not a good idea. Need to think what's the most efficient data structure for storing `incomingRpcs`. For example, does it even have to be a generic hash map from RpcId to ServerRpc*? One alternative I can think of is to index ServerRpc in two levels (first clientId then sequence) and then exploit the assumptions about them to achieve cache-efficiency. 

Update: Looks like this optimization has no visible effect on end-to-end latency number even though server response time for those small EchoRpc's is reduced by 160ns(!) on average.

Update2: The optimization does have somewhat visible benefit on end-to-end numbers (usually <=0.1 us decrease across min, median and 99%-tile). Perhaps I didn't compile the executable properly?


 

Three optimizations that prove to be beneficial (though I haven't quantified them independently):

  • In tryToTransmitData(), allow sending a packet as long as it fits in the queue space. The old logic is to only check for messages waiting to be sent if the queue space is larger or equal to maxDataPerPacket. This change specifically tries to optimize the gap between "sendRequest invoked.*" and "client sending ALL_DATA.*".
  • Zero-copy RX in DpdkDriver: One caveat is that we need to increase the number of mbufs in the DPDK mempool; otherwise, when the driver runs out of mbufs it will refuse to receive any more packet... However, increasing NB_MBUF appears to have negative effect on the latency numbers. I have not figured out why.
  • Delay sending GRANTs till the end of processing packets: This is not intended to be a performance optimization initially but it turns out to also make the code faster (maybe better cache behavior?).

One optimization that doesn't work out (yet): as a follow-up to the third item above, I implemented an experimental TX buffer API for DpdkDriver (bufferPacket & flushTxBuffer) trying to send GRANTs in a batch (so we can save the cost of ringing the doorbell register). The gain was negative. I haven't figured out why.

The current performance numbers for W4 at 50% workload (NB_MBUF=32767):


 

We are finally capable of answering the question of what factors contribute to the extra delay of every single packet and how much. In sum, there are four major sources of delay:

  • Queueing delay at the NIC's transmit queue: This number can be inferred quite accurately based on the estimated queued bytes right after sending the packet and the NIC's bandwidth.
  • Queueing delay at the TOR switch: It is difficult to measure this metric directly but we can compute it by subtracting the other three sources of delay from the total extra delay and verify the result is consistent with the log. To be more specific, if a packet P is indeed delayed at the TOR then the receiver should NOT receive any packet that has priority lower than P or bubble from the time when P arrives at the TOR till P is received.
  • Long polling gap: Even though it is not possible to know exactly when a packet arrives at the RX queue, we do know, at every polling iteration, what was the last time the NIC was empty (i.e. lastIdleTime), which is the maximum polling delay a packet could have experienced.
  • RX overhead: This is extra software overhead to receive the packet (i.e., talking to the NIC, copying out mbuf, etc.) compared to the minimum time observed in the experiment  (e.g., 0.3 us on m510 nodes).

Other less important/common sources of delay include:

  • Interrupt: Depending on when it occurs, it is included in major source 2, 3, or 4.
  • Switch delay: Can't be measured directly, included in queueing delay at TOR.

Here is an example table listing each and every delay source for all Echo messages of 30 bytes:


 

Today I look into the performance collapse problem where once the configured load factor passes a certain threshold the transmission throughput measured on the client side drops to almost 0. It turns out that the cause is purely on the sender side. We implemented the logic of picking the next message to send in a straightforward way: we iterate over all the outgoing messages and find the one with minimum bytes left and sufficient grant(s). Therefore, once the rate of new requests arriving passes the rate the sender can finish sending them, outgoing requests gradually queue up and slow down the sender even further in return. One solution to such vicious circle is to use a simple data structure that maintains the top k (e.g. 10) outgoing requests with min. remaining bytes. When the sender is about to pick the next request to send, it will first select from this set. Only when all the requests in this set are waiting for grants (i.e. slow path) will the sender iterate over the entire outgoing request list. This solution can be easily extended to also include outgoing responses but I am refraining from doing this because I have not yet seen a reasonably large number of outgoing responses in the time trace. As a result of this optimization, the largest load factor I can push towards in my synthetic workload `ShortLong` (i.e. 75% 30B + 25% 0.5MB) is increased from 50% to 63% 53.7% to 67.6%.

Update 1 (06/24): After printing network throughput over 1 ms intervals in the time trace, I realized that I had been using 1024^3 instead of 10^9 to represent 10Gbps in ClusterPerf. That is, I have been underestimating the loads we can achieve by a factor of 0.93(=10^9/1024^3)...

Update 2 (06/25): `HomaTransport::outgoingRequests` is not the only sender-side bottleneck when the message arrival rate is high. Besides, is it still necessary to implement a more efficient data structure if there is a simple throttling strategy we can apply at the application level? Other sender-side bottlenecks are:

  • HomaTransport::outgoingRpcs and HomaTransport::serverTimerList: Without application-level throttling, the number of elements in these two lists can be huge. And each invocation of `HomaTransport::checkTimeouts` needs to traverse each list once.
  • Checking if any outstanding RPC has finished in ClusterPerf: This was implemented naively by calling `isReady` on each and every outstanding RPC after `context->dispatch->poll()`. However, it turns out that `RpcWrapper::isReady` takes 100~200 ns (how come?) just to return false. Therefore, I can only afford to check one outstanding RPC after each polling iteration at this moment.

Finally, limiting the number of outstanding RPCs to 20 in ClusterPerf seems to work pretty well in practice. I can achieve almost 80% load in the `ShortLong` workload with this basic throttling alone (i.e., no specialized data structure need). Here are the results of various load factors:


 

Setting the correct RTT parameter in Homa (i.e. HomaTransport::roundTripMicros) turns out to be a little bit more tricky than I thought. I used to simply think that it was some constant one-way propagation delay multiplied by 2. However, the more precise answer is actually the propagation delay of a full packet (5 us) + server response time (1~2 us) + the propagation delay of a grant packet (2 us). Overestimating the RTT would make the 99%-tile latency of our short messages worse because it increases the chance of queueing delay at the NIC's transmit queue and the ToR. In contrast, underestimating the RTT would result in the sender consistently wasting bandwidth (i.e. passing bubbles) when there aren't many messages to send. In the extreme case where there is only one message to send, if RTT is smaller then RTT_real then the message will be blocked and waiting for grants for (RTT_real - RTT) in every RTT_real interval. Right now, to set the correct RTT on m510, I run my `OneMedium` workload (i.e. 100% 130000B) under low load factor (e.g. 20%) with 1 client and 1 server to find the minimum `HomaTransport::roundTripMicros` such that the min., median, and 90%-tile completion time of the EchoRpc are all ~230 us. Since the theoretical lower bound of the one-way message TX time is 112.86 us (= 5 us + (130000B - 1470B)*1542/1470/10Gbps), 230 us of RPC completion time suggests we have almost saturated the network bandwidth.

On the other hand, BasicTransport does NOT suffer from this problem because it's granting bytes more aggressively: the receiver tries to maintain at least 1 RTT incoming bytes granted for each message; every time this number falls below 1 RTT it is set to 1 RTT + grantIncrement (i.e. 5*maxDataPerPacket). Therefore, the sender only passes bubble at the beginning before receiving its first GRANT.


 

Draft:

Apparently, setting `HomaTransport::roundTripMicros = 14` and `DpdkDriver::maxTransmitQueueSize = 4*maxPacketSize` is sufficient to drive the maximum load factor we can achieve in workload `OneMedium` to 90%. From the time trace, we know that the data bytes throughput is ~9000Mbps  (excluding DATA header overhead and Ethernet header overhead) and the control bytes throughput is ~120Mbps (excluding Ethernet packet overhead). Therefore, the real network throughput is ~9700Mbps (=9000*1542/1470+120*72/30), pretty close to 10Gbps.

Actually, `HomaTransport::roundTripMicros = 14` alone enough is to achieve 90% load. The reason appears to be that under high network load, the real RTT is actually larger than 8 us (i.e., queueing delay on both directions increases, server response time increases, etc). When we set RTT to 8 us, the sender will receive the first grant right before it is about to send the first scheduled packet only in an ideal world. Any delay of a GRANT translates directly to waste of bw and likely CPU. However, if we set RTT in a way s.t. the `transmitLimit` is still K packets ahead of `transmitOffset` when the first grant arrives, we can tolerate up to K consecutive GRANT delay without wasting bw (i.e., as long as 1 out of K+1 grants manages to catch up with us, we are all good).

Now, I just need a definitive proof about this.


 

As we already know, zero-copy RX in DpdkDriver is not that easy to get right because it requires us to carefully manage the rte_mbuf's, which is a limited resource, at the transport layer. It's very easy to get into deadlock when we run out of the statically allocated rte_mbuf's. However, I think a hybrid solution might be OK: we can expose the decision of whether to copy rte_mbuf into Driver::PacketBuf up to the transport layer where we can decide not to copy data for, say, short requests and/or control packets.


 

A few numbers useful in understanding the protocol overhead of DATA and ALL_DATA packets:

  • sizeof(DataHeader) = 30 bytes
  • sizeof(AllDataHeader) = 20 bytes
  • sizeof(Echo::Request) = sizeof(Echo::Response) = 12 bytes
  • DpdkDriver::getPacketOverhead() = 42 bytes
  • Min. length of the Ethernet payload = 46 bytes

So a 30-byte EchoRpc request message actually translates to 104 bytes (= 30 + 12 + 20 + 42) on the wire.

I also spent some time today trying to understand the end-to-end delay of a single packet in the m510 cluster. Suppose there is a linear relationship between the packet size and the end-to-end delay (it probably should!): MinDelayMicros = A * packetBytes + B, where `A * packetBytes` is the transmission delay (aka store-and-forward delay, packetization delay) and B is the sum of the propagation delay and processing delay (see https://en.wikipedia.org/wiki/End-to-end_delay for details). Ideally, in a 10Gbps network, A should be 0.1 ns/bit if the switch supports cut-through switching or 0.2 ns/bit if the switch does store-and-forward. However, it appears that the A derived from our experiment data is actually ~0.25 ns/bit. I don't have an explanation yet.


 

After bringing BasicTransport more or less up-to-date with Homa, I test its performance using W4. It reveals two problems. 1) It's really not a good idea to use FIFO for messages > 10KB: there is a big jump in slowdown when the message size passes 10KB. 2) The 99%-tile latency of small messages are pretty bad even under 50% load: this is due to the lack of congestion control mechanism on the receiver side in Basic (it's just aggressively granting messages whenever it receives a DATA packet). Eventually, we agreed to switch to SRPT but leave the granting mechanism unchanged. To separately evaluate the performance gain due to congestion control (rather than packet priority), we are just going to run Homa with only one priority.


 

A fast way to tell if the computation bottleneck is on the server side: search for "start of polling iteration X, last poll was Y ns ago" in the time trace to see if there is constantly empty polling iterations; if so, the client is probably the bottleneck and you can check the client's time trace in the same way.


 

I haven't done any rigorous benchmarking but this ska::flat_hash_map (https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/) is clearly much faster than std::map, std::unordered_map, sparsepp (https://github.com/greg7mdp/sparsepp) when the map is relatively small (e.g. ~20 entries). I am now using this map for incomingRpcs, outgoingRpcs and MessageAccumulator::fragments in HomaTransport.