Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

...

View file
nameexperiment-results.txt
height250150


 

After fixing timetracing in DpdkDriver in commit `c16fa4c`, it became clear that the ~15us latency at 99% was not due to network queueing delay. In addition, it looks more like kernel jitter because large gap in the time trace on the server side cannot be pinpointed to a fixed location.

View file
nametail-latency-timetrace.txt
height250150


 

It appears that the kernel compile-time option `CONFIG_NO_HZ_FULL`, which is essential for full-dynticks mode, is not set by default on Ubuntu 15.04 & 16.04. This can be checked by `grep NO_HZ /boot/config*`. Therefore, to turn off the timer interrupts, use Ubuntu 14.04 or compile your own kernel.

...

View file
nameslice-objectpool-construct.txt
height250150


 

It seems like the ~15us (or 30000 cycles) jitters are not due to SMI (Service Management Interrupt) that are invisible to the kernel. This can be verified by running the `user_loop` test program with `THRESHOLD_CYCLES` defined to 100000 and reading msr 0x34 before and after the test program. Here is a sample output:

View file
nameSMI_interrupts.txt
height250150

The number of jitters printed matches the delta of the two msr 0x34 readings.

...

View file
namehwlat.txt
height250150

It doesn't look very useful as it only records the maximum jitter observed (this behavior can be verified at http://lxr.free-electrons.com/ident?v=4.9;i=get_sample).

...

[2] "Configuring and tuning HPE ProLiant Servers for low-latency applications": https://www.hpe.com/h20195/v2/GetPDF.aspx/c05281307.pdf

[3] https://downloads.linux.hpe.com/SDR/project/stk/

...

The cluster diagram at Cluster Intro is out of date. I created a new one using https://github.com/glaville/graphviz-infiniband-topology. It's kind of ugly but does the work for me.

Image RemovedImage Added


 

An excellent explanation of how the NIC and its driver work collaboratively to send/receive packets: http://stackoverflow.com/questions/36625892/descriptor-concept-in-nic. The DPDK performance optimization guidelines (https://software.intel.com/en-us/articles/dpdk-performance-optimization-guidelines-white-paper) mention a bit more on DPDK-specific config options like `RX_PTHRESH`, `RX_WTHRESH`, etc.

...

View file
namecyclesPerSec.txt
height250150

It appears that clock MONOTONIC_RAW 1) gives the most consistent result across the cluster as well as multiple runs and 2) is the only one whose result matches the 1995.379 MHz computed by kernel. However, does it suggest clock MONOTONIC_RAW gives the most accurate estimate? I don't know yet.

...

https://randomascii.wordpress.com/2011/07/29/rdtsc-in-the-age-of-sandybridge/


 

After calibrating the TSC clocks on different machines and merging the time traces, it's easy to compute the one-way propagation delay of each packet. If you order the delays of all the outbound DATA/ALL_DATA packets from one client by their TX times, it becomes clear that when there are already several ongoing messages and another 0.5MB message starts, the delay number tends to increase visibly. The delay number becomes more or less stable again at least after all the unscheduled packets of that long message are transmitted. Finally, the number decreases after a long message finishes. This is suggesting that a transient queue is being built up in the sender's NIC. It turns out that there is a unsigned integer underflow in QueueEstimator::getQueueSize(uint64_t) when the caller passes in a stale time that is less than QueueEstimator::currentTime and, as a result, getQueueSize incorrectly returns 0 as the estimated queue size. Therefore, in the worst case, the transport could enqueue two more full packets even though there are already two in the NIC's TX queue. But how can the caller passes in a stale time value in the first place? Well, one example would be that, in HomaTransport::tryToTransmitData(), we are using the stale time value stored in Dispatch::currentTime to save the cost of one RDTSC instruction. After fixing this bug, we don't see the delay number go up significantly and remain high for a long period like before.


 

When looking at the time trace of a small symmetric EchoRpc (30 bytes) that takes > 12us to finish, I noticed that, on the server-side, a large message (0.5MB) finished right before the server started to process the EchoRpc. The deletion took ~2us. The time was mostly spent in destroying Transport::serverRpc::replyPayload, which contains 341 Buffer::Chunks and 4 malloc-ed blocks. John suggests also run the experiments with 0.1 and 1MB large messages to verify if it is indeed a bottleneck. The latency numbers from both experiments have no visible difference, suggesting that this is not (yet) the bottleneck.


 

Why does it take so long for a master to start sending out the reply of an EchoRpc that fits in one ALL_DATA packet? Firstly, it appears that inserting a `serverRpc` into the unordered_map `incomingRpcs` is really expensive: it can easily take more than 250ns! Also, after calling "context->workerManager→handleRpc(serverRpc);" in HomaTransport, it takes > 100ns before the RPC can be dispatched by the MasterService.

View file
namehigh_latency_of_handling_single_packet_RPC.txt
height150


 

I have automated most of my analyses. Now the workflow is as follows:

  1. Run the clusterperf experiment to generate log files.
  2. Go to the log directory and run `RAMCloud/scripts/debugTailLatency/run_analysis.sh`. This will automatically merge time traces and extract information that prove to be useful in the past.
  3. Look into `bad_tail_latency.txt` to select the specific RpcId for further investigation.
  4. Run `RAMCloud/scripts/debugTailLatency/lookaround.sh "clientId X, sequence Y"` with (X, Y) being the RpcId selected in step 3. This will extract a detailed time line of the specific RPC that also includes a few number of time trace records before and after our events of interest.
  5. Also look into other *.txt files generated in the directory for measurements of specific metrics.


 

To speed up the master service time for small EchoRpc's, I implemented an optimization to avoid inserting "short" ServerRpc's into `incomingRpcs` in the first place. Here, a "short" ServerRpc is defined as 1) the request fits in one data packet (so we can start servicing it as soon as we receive the first packet) and 2) the response is a unscheduled message (so we don't need to receive grants). However, there is a problem with the definition of "short" ServerRpc: it ignores the fact that the processing time could be large (e.g., LOG_TIME_TRACE). As a result, if the client decides to retry the RPC, the server will have no record of such RPC and start servicing it again: thrashing the server significantly. Therefore, it's probably not a good idea. Need to think what's the most efficient data structure for storing `incomingRpcs`. For example, does it even have to be a generic hash map from RpcId to ServerRpc*? One alternative I can think of is to index ServerRpc in two levels (first clientId then sequence) and then exploit the assumptions about them to achieve cache-efficiency. 

Update: Looks like this optimization has no visible effect on end-to-end latency number even though server response time for those small EchoRpc's is reduced by 160ns(!) on average.

Update2: The optimization does have somewhat visible benefit on end-to-end numbers (usually <=0.1 us decrease across min, median and 99%-tile). Perhaps I didn't compile the executable properly?

View file
nameincomingRpcs-opt-micro-benchmark.txt
height150


 

Three optimizations that prove to be beneficial (though I haven't quantified them independently):

  • In tryToTransmitData(), allow sending a packet as long as it fits in the queue space. The old logic is to only check for messages waiting to be sent if the queue space is larger or equal to maxDataPerPacket. This change specifically tries to optimize the gap between "sendRequest invoked.*" and "client sending ALL_DATA.*".
  • Zero-copy RX in DpdkDriver: One caveat is that we need to increase the number of mbufs in the DPDK mempool; otherwise, when the driver runs out of mbufs it will refuse to receive any more packet... However, increasing NB_MBUF appears to have negative effect on the latency numbers. I have not figured out why.
  • Delay sending GRANTs till the end of processing packets: This is not intended to be a performance optimization initially but it turns out to also make the code faster (maybe better cache behavior?).

One optimization that doesn't work out (yet): as a follow-up to the third item above, I implemented an experimental TX buffer API for DpdkDriver (bufferPacket & flushTxBuffer) trying to send GRANTs in a batch (so we can save the cost of ringing the doorbell register). The gain was negative. I haven't figured out why.

The current performance numbers for W4 at 50% workload (NB_MBUF=32767):

View file
name0530_facebookHadoop_result.txt
height150


 

We are finally capable of answering the question of what factors contribute to the extra delay of every single packet and how much. In sum, there are four major sources of delay:

  • Queueing delay at the NIC's transmit queue: This number can be inferred quite accurately based on the estimated queued bytes right after sending the packet and the NIC's bandwidth.
  • Queueing delay at the TOR switch: It is difficult to measure this metric directly but we can compute it by subtracting the other three sources of delay from the total extra delay and verify the result is consistent with the log. To be more specific, if a packet P is indeed delayed at the TOR then the receiver should NOT receive any packet that has priority lower than P or bubble from the time when P arrives at the TOR till P is received.
  • Long polling gap: Even though it is not possible to know exactly when a packet arrives at the RX queue, we do know, at every polling iteration, what was the last time the NIC was empty (i.e. lastIdleTime), which is the maximum polling delay a packet could have experienced.
  • RX overhead: This is extra software overhead to receive the packet (i.e., talking to the NIC, copying out mbuf, etc.) compared to the minimum time observed in the experiment  (e.g., 0.3 us on m510 nodes).

Other less important/common sources of delay include:

  • Interrupt: Depending on when it occurs, it is included in major source 2, 3, or 4.
  • Switch delay: Can't be measured directly, included in queueing delay at TOR.

Here is an example table listing each and every delay source for all Echo messages of 30 bytes:

View file
nameper_message_delay_30B.txt
height150


 

Today I look into the performance collapse problem where once the configured load factor passes a certain threshold the transmission throughput measured on the client side drops to almost 0. It turns out that the cause is purely on the sender side. We implemented the logic of picking the next message to send in a straightforward way: we iterate over all the outgoing messages and find the one with minimum bytes left and sufficient grant(s). Therefore, once the rate of new requests arriving passes the rate the sender can finish sending them, outgoing requests gradually queue up and slow down the sender even further in return. One solution to such vicious circle is to use a simple data structure that maintains the top k (e.g. 10) outgoing requests with min. remaining bytes. When the sender is about to pick the next request to send, it will first select from this set. Only when all the requests in this set are waiting for grants (i.e. slow path) will the sender iterate over the entire outgoing request list. This solution can be easily extended to also include outgoing responses but I am refraining from doing this because I have not yet seen a reasonably large number of outgoing responses in the time trace. As a result of this optimization, the largest load factor I can push towards in my synthetic workload `ShortLong` (i.e. 75% 30B + 25% 0.5MB) is increased from 50% to 63% 53.7% to 67.6%.

Update 1 (06/24): After printing network throughput over 1 ms intervals in the time trace, I realized that I had been using 1024^3 instead of 10^9 to represent 10Gbps in ClusterPerf. That is, I have been underestimating the loads we can achieve by a factor of 0.93(=10^9/1024^3)...

Update 2 (06/25): `HomaTransport::outgoingRequests` is not the only sender-side bottleneck when the message arrival rate is high. Besides, is it still necessary to implement a more efficient data structure if there is a simple throttling strategy we can apply at the application level? Other sender-side bottlenecks are:

  • HomaTransport::outgoingRpcs and HomaTransport::serverTimerList: Without application-level throttling, the number of elements in these two lists can be huge. And each invocation of `HomaTransport::checkTimeouts` needs to traverse each list once.
  • Checking if any outstanding RPC has finished in ClusterPerf: This was implemented naively by calling `isReady` on each and every outstanding RPC after `context->dispatch->poll()`. However, it turns out that `RpcWrapper::isReady` takes 100~200 ns (how come?) just to return false. Therefore, I can only afford to check one outstanding RPC after each polling iteration at this moment.

Finally, limiting the number of outstanding RPCs to 20 in ClusterPerf seems to work pretty well in practice. I can achieve almost 80% load in the `ShortLong` workload with this basic throttling alone (i.e., no specialized data structure need). Here are the results of various load factors:

View file
namecperf_basic_throttling_20_outstandingRpcs.txt
height150


 

Setting the correct RTT parameter in Homa (i.e. HomaTransport::roundTripMicros) turns out to be a little bit more tricky than I thought. I used to simply think that it was some constant one-way propagation delay multiplied by 2. However, the more precise answer is actually the propagation delay of a full packet (5 us) + server response time (1~2 us) + the propagation delay of a grant packet (2 us). Overestimating the RTT would make the 99%-tile latency of our short messages worse because it increases the chance of queueing delay at the NIC's transmit queue and the ToR. In contrast, underestimating the RTT would result in the sender consistently wasting bandwidth (i.e. passing bubbles) when there aren't many messages to send. In the extreme case where there is only one message to send, if RTT is smaller then RTT_real then the message will be blocked and waiting for grants for (RTT_real - RTT) in every RTT_real interval. Right now, to set the correct RTT on m510, I run my `OneMedium` workload (i.e. 100% 130000B) under low load factor (e.g. 20%) with 1 client and 1 server to find the minimum `HomaTransport::roundTripMicros` such that the min., median, and 90%-tile completion time of the EchoRpc are all ~230 us. Since the theoretical lower bound of the one-way message TX time is 112.86 us (= 5 us + (130000B - 1470B)*1542/1470/10Gbps), 230 us of RPC completion time suggests we have almost saturated the network bandwidth.

View file
namesetting_roundTripMicros.txt
height150

On the other hand, BasicTransport does NOT suffer from this problem because it's granting bytes more aggressively: the receiver tries to maintain at least 1 RTT incoming bytes granted for each message; every time this number falls below 1 RTT it is set to 1 RTT + grantIncrement (i.e. 5*maxDataPerPacket). Therefore, the sender only passes bubble at the beginning before receiving its first GRANT.


 

Draft:

Apparently, setting `HomaTransport::roundTripMicros = 14` and `DpdkDriver::maxTransmitQueueSize = 4*maxPacketSize` is sufficient to drive the maximum load factor we can achieve in workload `OneMedium` to 90%. From the time trace, we know that the data bytes throughput is ~9000Mbps  (excluding DATA header overhead and Ethernet header overhead) and the control bytes throughput is ~120Mbps (excluding Ethernet packet overhead). Therefore, the real network throughput is ~9700Mbps (=9000*1542/1470+120*72/30), pretty close to 10Gbps.

Actually, `HomaTransport::roundTripMicros = 14` alone enough is to achieve 90% load. The reason appears to be that under high network load, the real RTT is actually larger than 8 us (i.e., queueing delay on both directions increases, server response time increases, etc). When we set RTT to 8 us, the sender will receive the first grant right before it is about to send the first scheduled packet only in an ideal world. Any delay of a GRANT translates directly to waste of bw and likely CPU. However, if we set RTT in a way s.t. the `transmitLimit` is still K packets ahead of `transmitOffset` when the first grant arrives, we can tolerate up to K consecutive GRANT delay without wasting bw (i.e., as long as 1 out of K+1 grants manages to catch up with us, we are all good).

Now, I just need a definitive proof about this.


 

As we already know, zero-copy RX in DpdkDriver is not that easy to get right because it requires us to carefully manage the rte_mbuf's, which is a limited resource, at the transport layer. It's very easy to get into deadlock when we run out of the statically allocated rte_mbuf's. However, I think a hybrid solution might be OK: we can expose the decision of whether to copy rte_mbuf into Driver::PacketBuf up to the transport layer where we can decide not to copy data for, say, short requests and/or control packets.


 

A few numbers useful in understanding the protocol overhead of DATA and ALL_DATA packets:

  • sizeof(DataHeader) = 30 bytes
  • sizeof(AllDataHeader) = 20 bytes
  • sizeof(Echo::Request) = sizeof(Echo::Response) = 12 bytes
  • DpdkDriver::getPacketOverhead() = 42 bytes
  • Min. length of the Ethernet payload = 46 bytes

So a 30-byte EchoRpc request message actually translates to 104 bytes (= 30 + 12 + 20 + 42) on the wire.

I also spent some time today trying to understand the end-to-end delay of a single packet in the m510 cluster. Suppose there is a linear relationship between the packet size and the end-to-end delay (it probably should!): MinDelayMicros = A * packetBytes + B, where `A * packetBytes` is the transmission delay (aka store-and-forward delay, packetization delay) and B is the sum of the propagation delay and processing delay (see https://en.wikipedia.org/wiki/End-to-end_delay for details). Ideally, in a 10Gbps network, A should be 0.1 ns/bit if the switch supports cut-through switching or 0.2 ns/bit if the switch does store-and-forward. However, it appears that the A derived from our experiment data is actually ~0.25 ns/bit. I don't have an explanation yet.

View file
namestudy_single_packet_end_to_end_delay.txt
height150


 

After bringing BasicTransport more or less up-to-date with Homa, I test its performance using W4. It reveals two problems. 1) It's really not a good idea to use FIFO for messages > 10KB: there is a big jump in slowdown when the message size passes 10KB. 2) The 99%-tile latency of small messages are pretty bad even under 50% load: this is due to the lack of congestion control mechanism on the receiver side in Basic (it's just aggressively granting messages whenever it receives a DATA packet). Eventually, we agreed to switch to SRPT but leave the granting mechanism unchanged. To separately evaluate the performance gain due to congestion control (rather than packet priority), we are just going to run Homa with only one priority.

View file
nameslowdownImpl_Basic50.pdf
height150
View file
nameslowdownImpl_Basic80.pdf
height150


 

A fast way to tell if the computation bottleneck is on the server side: search for "start of polling iteration X, last poll was Y ns ago" in the time trace to see if there is constantly empty polling iterations; if so, the client is probably the bottleneck and you can check the client's time trace in the same way.


 

I haven't done any rigorous benchmarking but this ska::flat_hash_map (https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/) is clearly much faster than std::map, std::unordered_map, sparsepp (https://github.com/greg7mdp/sparsepp) when the map is relatively small (e.g. ~20 entries). I am now using this map for incomingRpcs, outgoingRpcs and MessageAccumulator::fragments in HomaTransport.