Goal
Measure and understand the performance of Infiniband under load.
Parameters of the experiments
- Read operations performed by clients. Single table/object is read
over and over. - Cluster used - cluster hardware info is at Cluster+Configuration
rc02 - server (master)
rc03 - client (queen)
rc04-31 - client (worker) - multiple if required. - Call strack in code being measured
InfRcTransport<Infiniband>::ServerRpc::sendReply() InfRcTransport<Infiniband>::getTransmitBuffer() infiniband->pollCompletionQueue(commonTxCq, MAX_TX_QUEUE_DEPTH, retArray);
Results/Graphs
Reference Graph - Throughput of the system for 100 byte object
reads using different Transmit Buffer Pool sizes
Analysis of throughput curves.
- Summary - throughput drops to 50% under high load.
- The throughput of the system is measured here against increasing
load. The load is in terms of read operations on 100 byte objects. - We notice that the throughput of the system drops by a factor of 2
for high loads. This is observed even though we are nowhere near the
network limits at this point. The measured outgoing throughput is
390217 ops/sec or 39M bytes/sec or 310M bits/sec which is well under
the expected 32Gbps limit. - The red, blue and green lines were measured with 24 RX buffers and
8, 32 and 64 TX buffers respectively. - The violet line was measured with 48 RX buffers and 8 TX
buffers. Notice that adding buffers to the pool on the receive side
allows the trasmit side to see a higher throughput - I do not
understand the reasons for this. - A set of further measurements are taken during the same experiment
and plotted on different graphs to aid understanding.
Latency Graph - Time spent in pollCQ per read (average) across
different Transmit Buffer Pool sizes
Analysis
- Summary - pollCompletionQueue() (and hence getTransmitBuffer()) take
longer to run with increasing load. Note that pollCompletionQueue()
would be called multiple times until a succesful return - an empty
buffer from the pool. - Red/Blue lines represent 24 RX buffers and 8 TX buffers
- Green/Violet lines represent 24 RX buffers and 32 TX buffers
- Orange/Pink lines represent 24 RX buffers and 64 TX buffers
- This a plot of measurements of time taken by the different functions
during the experiments. - Total time spent in pollCQ was tracked and then divided by the
number of read calls to calculate the average. - This tracks the curve of time spent within the getTransmitBuffer
call well. The difference between the two needs to be explained.
Latency Graph - Time spent in pollCQ per read (average) - fixed
pool of buffers - comparing time taken by successful calls against
calls that return 0
Analysis
- Summary - pollCompletionQueue() (and hence getTransmitBuffer()) take
longer to run with increasing load. Note that pollCompletionQueue()
would be called multiple times until a succesful return - an empty
buffer from the pool. - Red/Blue lines represent 24 RX buffers and 8 TX buffers
- Green/Violet lines represent 24 RX buffers and 32 TX buffers
- Orange/Pink lines represent 24 RX buffers and 64 TX buffers
- This a plot of measurements of time taken by the different functions
during the experiments. - Total time spent in pollCQ was tracked and then divided by the
number of read calls to calculate the average. - This tracks the curve of time spent within the getTransmitBuffer
call well. The difference between the two needs to be explained.
Latency Graph - Time spent in pollCQ per read (average) - fixed
pool of buffers - comparing time taken by successful calls against
calls that return 0
Analysis
- This is the same latency curve as above restricted to the case where
the size of the buffer pool for TX buffers is 8. - The Red line represents avg time taken by the getTransmitBuffer() call.
- The Blue line represents avg time taken across all the calls to
pollCompletionQueue() - The Green line represents the average time taken by calls to
pollCompletionQueue() calls that returned zero empty buffers. - The Violet line represents the average time taken by calls to
pollCompletionQueue() calls that returned non-zero empty buffers. - Note that time taken per successful call increased slightly with
load. Number of calls however increased with load resulting in overall
time taken by getTransmitBuffer() increasing.
Latency Graph - Average number of buffers returned by pollCQ
across different Buffer Pool sizes
Analysis
- The red, blue and green lines were measured with 24 RX buffers and
8, 32 and 64 TX buffers respectively. - The violet line was measured with 48 RX buffers and 8 TX buffers.
- An interesting trend that appears to be independent of number of
buffers in the pool. There is a drop in the average at the same load
irrespective of buffer-pool. - Why does doubling the number of receive buffers affect the number of
empty transmit buffers returned ? Compare Red against Violet.