Inf Under Load

Goal

Measure and understand the performance of Infiniband (as used within RAMCloud) under load.

Parameters of the experiments

  • Read operations performed by clients. Single table/object is read
    over and over. More details at Workload+Generator
  • Cluster used - cluster hardware info is at Cluster+Configuration
    rc02 - server (master)
    rc03 - client (queen)
    rc04-31 - client (worker) - multiple if required.
  • Number of workers used was was limited to as many nodes as required to generate the load for each experiment (subset). For a load of 4 I would use only 04-07, while a load of 40 would make me use some nodes twice.
  • Measurements are being performed using rdtsc
  • Code being measured - InfRcTransport.cc
    InfRcTransport<Infiniband>::ServerRpc::sendReply()
            InfRcTransport<Infiniband>::getTransmitBuffer()
                    infiniband->pollCompletionQueue()
    

Results/Graphs

Reference Graph - Throughput of the system for 100 byte object reads using different Transmit Buffer Pool sizes

Analysis of throughput curves.

  • Summary - throughput drops to 50% under high load.
  • The throughput of the system is measured here against increasing
    load. The load is in terms of read operations on 100 byte objects.
  • We notice that the throughput of the system drops by a factor of 2
    for high loads. This is observed even though we are nowhere near the
    network limits at this point. The measured outgoing throughput is
    390217 ops/sec or 39M bytes/sec or 310M bits/sec which is well under
    the expected 32Gbps limit.
  • The red, blue and green lines were measured with 24 RX buffers and
    8, 32 and 64 TX buffers respectively.
  • The violet line was measured with 48 RX buffers and 8 TX
    buffers. Notice that adding buffers to the pool on the receive side
    allows the trasmit side to see a higher throughput - I do not
    understand the reasons for this.
  • A set of further measurements are taken during the same experiment
    and plotted on different graphs to aid understanding.

Average number of buffers returned by pollCompletionQueue across different Buffer Pool sizes

Analysis

  • Summary - 3 empty buffers become available at a time under load.
    Number of receive buffers also affects pollCompletionQueue().
  • This average does not include calls when no buffers were returned.
    This is an average when non-zero buffers were returned by pollCompletionQueue().
  • The red, blue and green lines were measured with 24 RX buffers and
    8, 32 and 64 TX buffers respectively.
  • The violet line was measured with 48 RX buffers and 8 TX buffers.
  • An interesting trend that appears to be independent of number of
    buffers in the pool. There is a drop in the average at the same load
    irrespective of buffer-pool.
  • Why does doubling the number of receive buffers affect the number of
    empty transmit buffers returned ? Compare Red against Violet.
  • Look at the average number of buffers returned under high load.
    It appears as if this number is around 3. We expect this number to be
    1 buffer returned under load where empty buffers are returned as soon
    as they are available. The higher number indicates that maybe buffers
    are being returned in sets of 3. What is the reason for this behavior ?

Latency Graph - Time spent in pollCompletionQueue per read (average) - fixed pool of buffers - comparing time taken by successful calls against calls that return 0

Analysis

  • This is the same latency curve as above restricted to the case where
    the size of the buffer pool for TX buffers is 8.
  • The Red line represents avg time taken by the getTransmitBuffer() call.
  • The Blue line represents avg time taken across all the calls to
    pollCompletionQueue()
  • The Green line represents the average time taken by calls to
    pollCompletionQueue() calls that returned zero empty buffers.
  • The Violet line represents the average time taken by calls to
    pollCompletionQueue() calls that returned non-zero empty buffers.
  • Note that time taken per successful call increased slightly with
    load. Number of calls however increased with load resulting in overall
    time taken by getTransmitBuffer() increasing.

Latency Graph - Time spent in pollCompletionQueue per read (average) across different Transmit Buffer Pool sizes

Analysis

  • Summary - pollCompletionQueue() (and hence getTransmitBuffer()) take
    longer to run with increasing load. Note that pollCompletionQueue()
    would be called multiple times until a succesful return - an empty
    buffer from the pool.
  • Red/Blue lines represent 24 RX buffers and 8 TX buffers.
    The Red line is the time taken on average by the getTransmitBuffer()
    call and the Blue line is the time taken on average across set of
    calls to pollCompletionQueue() required to get back an empty buffer.
    Note that the times measured are small enough that the time spent
    in the timer calls itself have an effect here.
  • Green/Violet lines represent 24 RX buffers and 32 TX buffers
  • Orange/Pink lines represent 24 RX buffers and 64 TX buffers
  • This a plot of measurements of time taken by the different functions
    during the experiments.
  • Total time spent in pollCompletionQueue was tracked and then divided by the
    number of read calls to calculate the average.
  • This tracks the curve of time spent within the getTransmitBuffer
    call well. The difference between the two needs to be explained.