Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

...

[2] "Configuring and tuning HPE ProLiant Servers for low-latency applications": https://www.hpe.com/h20195/v2/GetPDF.aspx/c05281307.pdf

[3] https://downloads.linux.hpe.com/SDR/project/stk/

...

After calibrating the TSC clocks on different machines and merging the time traces, it's easy to compute the one-way propagation delay of each packet. If you order the delays of all the outbound DATA/ALL_DATA packets from one client by their TX times, it becomes clear that when there are already several ongoing messages and another 0.5MB message starts, the delay number tends to increase visibly. The delay number becomes more or less stable again at least after all the unscheduled packets of that long message are transmitted. Finally, the number decreases after a long message finishes. This is suggesting that a transient queue is being built up in the sender's NIC. It turns out that there is a unsigned integer underflow in QueueEstimator::getQueueSize(uint64_t) when the caller passes in a stale time that is less than QueueEstimator::currentTime and, as a result, getQueueSize incorrectly returns 0 as the estimated queue size. Therefore, in the worst case, the transport could enqueue two more full packets even though there are already two in the NIC's TX queue. But how can the caller passes in a stale time value in the first place? Well, one example would be that, in HomaTransport::tryToTransmitData(), we are using the stale time value stored in Dispatch::currentTime to save the cost of one RDTSC instruction. After fixing this bug, we don't see the delay number go up significantly and remain high for a long period like before.


 

When looking at the time trace of a small symmetric EchoRpc (30 bytes) that takes > 12us to finish, I noticed that, on the server-side, a large message (0.5MB) finished right before the server started to process the EchoRpc. The deletion took ~2us. The time was mostly spent in destroying Transport::serverRpc::replyPayload, which contains 341 Buffer::Chunks and 4 malloc-ed blocks. John suggests also run the experiments with 0.1 and 1MB large messages to verify if it is indeed a bottleneck. The latency numbers from both experiments have no visible difference, suggesting that this is not (yet) the bottleneck.


 

Why does it take so long for a master to start sending out the reply of an EchoRpc that fits in one ALL_DATA packet? Firstly, it appears that inserting a `serverRpc` into the unordered_map `incomingRpcs` is really expensive: it can easily take more than 250ns! Also, after calling "context->workerManager→handleRpc(serverRpc);" in HomaTransport, it takes > 100ns before the RPC can be dispatched by the MasterService.

View file
namehigh_latency_of_handling_single_packet_RPC.txt
height150


 

I have automated most of my analyses. Now the workflow is as follows:

  1. Run the clusterperf experiment to generate log files.
  2. Go to the log directory and run `RAMCloud/scripts/debugTailLatency/run_analysis.sh`. This will automatically merge time traces and extract information that prove to be useful in the past.
  3. Look into `bad_tail_latency.txt` to select the specific RpcId for further investigation.
  4. Run `RAMCloud/scripts/debugTailLatency/lookaround.sh "clientId X, sequence Y"` with (X, Y) being the RpcId selected in step 3. This will extract a detailed time line of the specific RPC that also includes a few number of time trace records before and after our events of interest.
  5. Also look into other *.txt files generated in the directory for measurements of specific metrics.


 

To speed up the master service time for small EchoRpc's, I implemented an optimization to avoid inserting "short" ServerRpc's into `incomingRpcs` in the first place. Here, a "short" ServerRpc is defined as 1) the request fits in one data packet (so we can start servicing it as soon as we receive the first packet) and 2) the response is a unscheduled message (so we don't need to receive grants). However, there is a problem with the definition of "short" ServerRpc: it ignores the fact that the processing time could be large (e.g., LOG_TIME_TRACE). As a result, if the client decides to retry the RPC, the server will have no record of such RPC and start servicing it again: thrashing the server significantly. Therefore, it's probably not a good idea. Need to think what's the most efficient data structure for storing `incomingRpcs`. For example, does it even have to be a generic hash map from RpcId to ServerRpc*? One alternative I can think of is to index ServerRpc in two levels (first clientId then sequence) and then exploit the assumptions about them to achieve cache-efficiency. 

Update: Looks like this optimization has no visible effect on end-to-end latency number even though server response time for those small EchoRpc's is reduced by 160ns(!) on average.

View file
nameincomingRpcs-opt-micro-benchmark.txt
height150