= Understanding the Bandwidth Limit of RAMCloud's InfUdDriver :toc: :toc-placement!: toc::[] # Introduction # Infiniband Verbs Performance Tests The source code of IB perftest is available at: https://github.com/linux-rdma/perftest. To build perftest on `rcXX` machines: $ git clone git@github.com:linux-rdma/perftest.git $ cd perftest $ ./autogen.sh $ ./configure --prefix=`pwd` && make -j install To measure the bandwidth between `rcXX` machines using unreliable datagram (UD), start the server side of `perftest` on, say, `rc02` first: $ yilongl@rc02:~/perftest/bin$ ./ib_send_bw --connection UD --run-infinitely Then start the client side of `perftest` on another machine: $ yilongl@rc03:~/perftest/bin$ ./ib_send_bw rc02 --connection UD --run-infinitely --size 1000 --post_list 1 To measure the bidirectional bandwidth, append option `--bidirectional` to both of the commands above. TODO: mention some of the useful options like inline_size, etc. Simplify perftest * only --run-infinitely * no mcg * rdma_cm == OFF * no DC, RawEth, or XRC * only IBV_TRANSPORT_IB (i.e. no IBV_TRANSPORT_IWARP) * no sleep on CQ events * no bw or msgrate limit * no HAVE_VERBS_EXP TODO: performance degrade significantly when msg_size drops below 1729; WTF happened? server: ./ib_send_bw --connection UD --run_infinitely client: ./ib_send_bw rc02 --connection UD --run_infinitely --post_list 1 --tx-depth 128 --cq-mod 8 --size 1728 Hypothesis: it might have something to do with the cache miss and/or TLB miss on the server side; perhaps reducing the msg_size by 1 byte reduce the PCIe traffic significantly so the sender can send much faster while the server can't keep up and unfortunately enter and stay in this pathological state? How to design experiment to verify? Is there any packet loss in this case (i.e., does IB link-layer flow control kick in for UD?)? # Instrument mlx4 driver to add time traces The Infiniband drivers used on rc cluster is documented at https://ramcloud.atlassian.net/wiki/spaces/RAM/pages/25493518/Cluster+Configuration+with+Debian+8.3#ClusterConfigurationwithDebian8.3-InfinibandDrivers.1[RAMCloud Wiki]. The downloaded tarball `MLNX_OFED_LINUX-3.1-1.0.3-debian8.1-x86_64` contains source code in addition to pre-compiled binaries. The two most relevant libraries are `libibverbs` and `libmlx4`, whose source code can be found under `MLNX_OFED_LINUX-3.1-1.0.3-debian8.1-x86_64/src/MLNX_OFED_SRC-3.1-1.0.3/SOURCES/`. https://github.com/linux-rdma/rdma-core/commits/ef7827c37c9d21811e6d63d055cf6a9b43ce606c During compilation, our application is linked against the shared library `libibvers` dynamically. On application startup, `libibverbs` will automatically load `libmlx4` via `dlopen` inside `load_driver(const char* name)`. The path to `libmlx4` is specified in a config file inside `IBV_CONFIG_DIR`. For instance, the config file for mlx4 is `/etc/libibverbs.d/mlx4.driver` on rc machines. $ yilongl@rcmaster:~$ cat /etc/libibverbs.d/mlx4.driver $ driver /usr/lib/libibverbs/libmlx4 To compute the absolute path to the actual `.so` file, `load_driver` of `libibverbs` appends some suffix to the path prefix contained in the config file. $ yilongl@rcmaster:~$ ls -l /usr/lib/libibverbs/ $ total 316 $ -rw-r--r-- 1 root root 145024 Sep 29 2015 libmlx4-rdmav2.so $ -rw-r--r-- 1 root root 173760 Sep 29 2015 libmlx5-rdmav2.so I obtained most of the information above by running `ib_send_bw` under gdb and setting a breakpoint on `dlopen`. Note that gdb has trouble finding the source code of `libibverbs` by default; we can fix it using the `directory` command to specify the missing source code directory manually inside gdb. You can verify that `libmlx4` is successfully loaded by `libibverbs` using `info sharedlibrary` inside gdb. (gdb) info sharedlibrary From To Syms Read Shared Object Library 0x00007ffff7ddcae0 0x00007ffff7df5130 Yes /lib64/ld-linux-x86-64.so.2 No linux-vdso.so.1 0x00007ffff7bd5450 0x00007ffff7bd8be8 Yes (*) /usr/lib/libibumad.so.3 0x00007ffff78d8580 0x00007ffff7943d96 Yes /lib/x86_64-linux-gnu/libm.so.6 0x00007ffff76c01e0 0x00007ffff76cdf90 Yes (*) /usr/lib/librdmacm.so.1 0x00007ffff74ab6c0 0x00007ffff74b6dda Yes /usr/lib/libibverbs.so.1 0x00007ffff728e9f0 0x00007ffff729a771 Yes /lib/x86_64-linux-gnu/libpthread.so.0 0x00007ffff6efd4a0 0x00007ffff7029ef3 Yes /lib/x86_64-linux-gnu/libc.so.6 0x00007ffff6c9c650 0x00007ffff6cc086f Yes (*) /usr/lib/x86_64-linux-gnu/libnl-route-3.so.200 0x00007ffff6a71560 0x00007ffff6a7cf35 Yes (*) /lib/x86_64-linux-gnu/libnl-3.so.200 0x00007ffff6865ed0 0x00007ffff686697e Yes /lib/x86_64-linux-gnu/libdl.so.2 0x00007ffff663ee00 0x00007ffff665e381 Yes /usr/lib/libibverbs/libmlx5-rdmav2.so 0x00007ffff6433360 0x00007ffff6437d95 Yes (*) /usr/lib/x86_64-linux-gnu/libnuma.so.1 (*): Shared library is missing debugging information. Now we can proceed to instrument `libmlx4` and rebuild the library. Modify the first line of `Makefile.am` to remove `-Werror`. $ ./autogen.sh $ ./configure --prefix=`pwd` && make -j install $ ls -l libs Then modify `/etc/libibverbs.d/mlx4.driver` to point to newly-built library. Nope, this won't work; the application will somehow loads two versions of `libmlx4`. # Understanding PCIe overhead https://www.xilinx.com/support/documentation/white_papers/wp350.pdf[Understanding Performance of PCI Express Systems] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf[Understanding PCIe performance for end host networking] https://github.com/opcm/pcm[Intel Processor Counter Monitor] https://ieeexplore.ieee.org/document/8048990[Evaluating Effect of Write Combining on PCIe Throughput to Improve HPC Interconnect Performance] https://community.mellanox.com/docs/DOC-2496[Understanding PCIe Configuration for Maximum Performance] https://www.techarp.com/bios-guide/pci-e-maximum-payload-size/ https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/517428 https://www.google.com/search?q=pcie+performance+counter+site:software.intel.com&sa=X&ved=2ahUKEwjEq9jvzMPeAhXO-lQKHUOMC3EQrQIoBDAAegQICRAM&biw=1350&bih=1011&dpr=1.25 https://www.design-reuse.com/articles/15900/realizing-the-performance-potential-of-a-pci-express-ip.html[Realizing the Performance Potential of a PCI-Express IP] http://xillybus.com/tutorials/pci-express-tlp-pcie-primer-tutorial-guide-1[Down to the TLP: How PCI express devices talk (Part I)] https://codywu2010.wordpress.com/2015/11/26/pci-express-max-read-request-max-payload-size-and-why-you-care/[PCI Express Max Read Request, Max Payload Size and why you care] In the Homa paper, we said the rc cluster has 24Gbps effective bandwidth. However, it is not true. Our HCA's and switches are 40Gbps while the PCIe 2.0 x8 limits us to 32Gbps, which is 4GB/s! However, our ClusterPerf results are far below this number (e.g., infud and infrc achieve only 1.9GB/s and 2.6GB/s respectively when reading 1MB objects). To test if we can indeed achieve 4GB/s bandwidth, I decided to use `ib_send_bw` to stess the system. To get the best possible result, I choose the `send` verb (which is simpler and, thus, has less overhead) together with RC transport (which has smaller WQE header compared to UD transport). With just one client and one server, I am able to get 3218MB/s. $ yilongl@rc02: ib_send_bw --connection RC --run_infinitely $ yilongl@rc03: ib_send_bw rc02 --connection RC --run_infinitely --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx4_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x21 QPN 0x0269 PSN 0x911b06 remote address: LID 0x31 QPN 0x02fe PSN 0x5a9cbe --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 257237 0.00 3216.14 0.051458 65536 257362 0.00 3217.70 0.051483 65536 257371 0.00 3217.81 0.051485 65536 257371 0.00 3217.81 0.051485 65536 257370 0.00 3217.80 0.051485 From other experiments, I already knew that the RX path in `ib_send_bw` is much more costly than the TX path because 1) the sender can use selective signaling (e.g., cq_mod=100 by default) and 2) `ibv_poll_cq` of the RX completion queue is somehow much slower than RAMCloud's InfUdDriver. Therefore, I started another `ib_send_bw` server at rc02 and another client at rc04. This gives us 1768MB/s * 2 = 3536MB/s aggregated downlink bandwidth at rc02. Adding another node doesn't improve any more (1184MB/s * 3 = 3552MB/s). Now the question is if ~3550MB/s is the application effective bandwidth (i.e., goodput). If yes, how can we explain the 600MB/s gap to 4194MB/s (i.e. 4GB/s)? TODO: possible factors are * PCIe TLP headers (~24B for every 256B MaxPayloadSize since we are not doing message inlining and the data is read via DMA): 4194MB/s * 256 / (256 + 24) = 3835MB/s * C_rc (i.e. CPU's read completion combining size)? * WQE header? probably not because data is read via DMA instead of written by MMIO? TODO: What the heck is the following info telling us? Hint: PCIe speed, # lines, MaxPayloadSize, MaxReadReq, latency(to what?), cacheline size (32B not 64B wtf?) root@rc02:/home/yilongl# lspci -vv | grep Mellanox -A 50 01:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) Subsystem: Mellanox Technologies Device 0018 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-