Infiniband Verbs Performance

Setup

  • 2x Westmere 5620 boxes (2.4GHz)
  • Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE cards
  • Mellanox Infiniband switch (model "IS5030/5"; for Infiniband measurements)
  • Arista 7124S (I think) 10GigE switch (for RoCE measurements)

Words of Warning

The OpenFabrics Alliance distribution includes several benchmark tools, e.g.:

  • ib_send_lat measures 1/2 ping-pong times using ibv_post_send and ibv_post_recv calls.
  • ib_write_lat measures one-way RDMA write times plus a send/recv from server back to client to indicate write completion (and probably divides by two).
  • etc.

However, ib_send_lat's results (and probably others...) are a little difficult to make sense of. Here are some things worth keeping in mind:

  1. All results given are one-way. So multiply by two for approximate RTT times.
  2. The program defaults to using inline sends for messages up to 400 bytes. Inline means that the data to be sent is colocated with the WQE, avoiding an extra fetch. For small transactions, this apparently results in a 50% better RTT. See the -I (capital i) flag.
  3. It reports min, max and "typical" times. Typical is the median, not the mean. Mean is often up to 20% worse than median.

In addition, it appears that the ib_write_lat benchmark uses writes in one direction, and ib_post_sends in the other to signal completion. If times are calculated like ib_send_lat, then the values obtained are 1/2 RTT, which conflates RDMA and send/receive operation times. This is probably more honest for an actual RDMA user, but may give a worse  one-way time for the write operation.

RTT Results using ib_send_lat

Note:

  • Measurements given are in average microseconds for RTT, not in one-way median microseconds as the program defaults to emitting.
    • (ib_send_lat was modified slightly to wrap run_iter() in rdtscs)
  • RoCE is 10GigE through the Arista switch
  • "UD" => unreliable datagram, "RC" => reliable connected
  • "Inlined" => data transmit buffers are inlined in the work request struct
  • ib_send_lat doesn't touch the transmit/receive data at all
  • example of commands used:
    • 128 bytes not inlined:
      • ib_send_lat -c UD -I 0 -s 128 -n 1000000
    • 128 bytes inline:
      • ib_send_lat -c ??UD -I 400 -s 128 -n 1000000
    • RoCE requires an additional -x0 -i2
  •  

    Packet Size

    Inlined?

    Type

    Avg. Infiniband usec

    Avg. RoCE usec

    4

    N

    UD

    5.10

    6.32

    4

    Y

    UD

    3.24

    4.48

    4

    N

    RC

    5.41

    6.49

    4

    Y

    RC

    3.56

    4.62

    128

    N

    UD

    5.50

    6.90

    128

    Y

    UD

    3.88

    5.28

    128

    N

    RC

    5.45

    6.86

    128

    Y

    RC

    3.94

    5.19

    1024

    N/A

    UD

    7.14

    9.59

    1024

    N/A

    RC

    7.07

    9.46

RTT Result Histograms:

  • 4-byte send/receive ping-pong over Infiniband using reliable QPs (RC)
    client did 1000000 RTTs in 12355458471 ticks (12355 ticks/RTT)
    client min ticks/RTT: 11361, max ticks/RTT: 329547
    avg: 5.148 usec
    histogram:
      4    : 111740     (cdf 11.1740%)
      5    : 836703     (cdf 94.8443%)
      6    : 38461      (cdf 98.6904%)
      7    : 7673       (cdf 99.4577%)
      8    : 3820       (cdf 99.8397%)
      9    : 340        (cdf 99.8737%)
      10   : 53         (cdf 99.8790%)
      11   : 1084       (cdf 99.9874%)
      12   : 29         (cdf 99.9903%)
      13   : 30         (cdf 99.9933%)
      14   : 22         (cdf 99.9955%)
      15   : 10         (cdf 99.9965%)
      16   : 12         (cdf 99.9977%)
      17   : 2          (cdf 99.9979%)
      21   : 1          (cdf 99.9980%)
      23   : 1          (cdf 99.9981%)
      24   : 1          (cdf 99.9982%)
      28   : 1          (cdf 99.9983%)
      29   : 2          (cdf 99.9985%)
      31   : 1          (cdf 99.9986%)
      34   : 1          (cdf 99.9987%)
      35   : 1          (cdf 99.9988%)
      38   : 2          (cdf 99.9990%)
      41   : 1          (cdf 99.9991%)
      60   : 1          (cdf 99.9992%)
      70   : 1          (cdf 99.9993%)
      93   : 1          (cdf 99.9994%)
      97   : 1          (cdf 99.9995%)
      102  : 1          (cdf 99.9996%)
      104  : 1          (cdf 99.9997%)
      106  : 1          (cdf 99.9998%)
      109  : 1          (cdf 99.9999%)
    client did 1000000 RTTs in 12355458471 ticks (12355 ticks/RTT)
    
  • 4-byte send/receive ping-pong over Infiniband using unreliable datagram QPs (UD)
    client did 1000000 RTTs in 12630466584 ticks (12630 ticks/RTT)
    client min ticks/RTT: 11901, max ticks/RTT: 253923
    avg: 5.263 usec
    histogram:
      4    : 159        (cdf 0.0159%)
      5    : 951073     (cdf 95.1232%)
      6    : 37846      (cdf 98.9078%)
      7    : 7682       (cdf 99.6760%)
      8    : 2831       (cdf 99.9591%)
      9    : 264        (cdf 99.9855%)
      10   : 59         (cdf 99.9914%)
      11   : 16         (cdf 99.9930%)
      12   : 12         (cdf 99.9942%)
      13   : 14         (cdf 99.9956%)
      14   : 6          (cdf 99.9962%)
      15   : 12         (cdf 99.9974%)
      16   : 10         (cdf 99.9984%)
      17   : 3          (cdf 99.9987%)
      18   : 1          (cdf 99.9988%)
      24   : 2          (cdf 99.9990%)
      40   : 1          (cdf 99.9991%)
      49   : 1          (cdf 99.9992%)
      50   : 2          (cdf 99.9994%)
      90   : 1          (cdf 99.9995%)
      93   : 1          (cdf 99.9996%)
      101  : 1          (cdf 99.9997%)
      104  : 2          (cdf 99.9999%)
    client did 1000000 RTTs in 12630466584 ticks (12630 ticks/RTT)