= Understanding the Bandwidth Limit of RAMCloud's InfUdDriver
:toc:
:toc-placement!:

toc::[]

# Introduction

# Infiniband Verbs Performance Tests

The source code of IB perftest is available at: https://github.com/linux-rdma/perftest. To build perftest on `rcXX` machines:
 $ git clone git@github.com:linux-rdma/perftest.git
 $ cd perftest
 $ ./autogen.sh
 $ ./configure --prefix=`pwd` && make -j install

To measure the bandwidth between `rcXX` machines using unreliable datagram (UD), start the server side of `perftest` on, say, `rc02` first:
 $ yilongl@rc02:~/perftest/bin$ ./ib_send_bw --connection UD --run-infinitely

Then start the client side of `perftest` on another machine:
 $ yilongl@rc03:~/perftest/bin$ ./ib_send_bw rc02 --connection UD --run-infinitely --size 1000 --post_list 1
 
To measure the bidirectional bandwidth, append option `--bidirectional` to both of the commands above.

TODO: mention some of the useful options like inline_size, etc.


Simplify perftest
* only --run-infinitely
* no mcg
* rdma_cm == OFF
* no DC, RawEth, or XRC
* only IBV_TRANSPORT_IB (i.e. no IBV_TRANSPORT_IWARP)
* no sleep on CQ events
* no bw or msgrate limit
* no HAVE_VERBS_EXP

TODO: performance degrade significantly when msg_size drops below 1729; WTF happened?
server: ./ib_send_bw --connection UD --run_infinitely
client: ./ib_send_bw rc02 --connection UD --run_infinitely --post_list 1 --tx-depth 128 --cq-mod 8 --size 1728

Hypothesis: it might have something to do with the cache miss and/or TLB miss on the server side; perhaps reducing the msg_size by 1 byte reduce the PCIe traffic significantly so the sender can send much faster while the server can't keep up and unfortunately enter and stay in this pathological state? How to design experiment to verify? Is there any packet loss in this case (i.e., does IB link-layer flow control kick in for UD?)?


# Instrument mlx4 driver to add time traces

The Infiniband drivers used on rc cluster is documented at https://ramcloud.atlassian.net/wiki/spaces/RAM/pages/25493518/Cluster+Configuration+with+Debian+8.3#ClusterConfigurationwithDebian8.3-InfinibandDrivers.1[RAMCloud Wiki].

The downloaded tarball `MLNX_OFED_LINUX-3.1-1.0.3-debian8.1-x86_64` contains source code in addition to pre-compiled binaries. The two most relevant libraries are `libibverbs` and `libmlx4`, whose source code can be found under `MLNX_OFED_LINUX-3.1-1.0.3-debian8.1-x86_64/src/MLNX_OFED_SRC-3.1-1.0.3/SOURCES/`.
https://github.com/linux-rdma/rdma-core/commits/ef7827c37c9d21811e6d63d055cf6a9b43ce606c

During compilation, our application is linked against the shared library `libibvers` dynamically. On application startup, `libibverbs` will automatically load `libmlx4` via `dlopen` inside `load_driver(const char* name)`. The path to `libmlx4` is specified in a config file inside `IBV_CONFIG_DIR`. For instance, the config file for mlx4 is `/etc/libibverbs.d/mlx4.driver` on rc machines.

 $ yilongl@rcmaster:~$ cat /etc/libibverbs.d/mlx4.driver 
 $ driver /usr/lib/libibverbs/libmlx4
 
To compute the absolute path to the actual `.so` file, `load_driver` of `libibverbs` appends some suffix to the path prefix contained in the config file.

 $ yilongl@rcmaster:~$ ls -l /usr/lib/libibverbs/
 $ total 316
 $ -rw-r--r-- 1 root root 145024 Sep 29  2015 libmlx4-rdmav2.so
 $ -rw-r--r-- 1 root root 173760 Sep 29  2015 libmlx5-rdmav2.so

I obtained most of the information above by running `ib_send_bw` under gdb and setting a breakpoint on `dlopen`. Note that gdb has trouble finding the source code of `libibverbs` by default; we can fix it using the `directory` command to specify the missing source code directory manually inside gdb. You can verify that `libmlx4` is successfully loaded by `libibverbs` using `info sharedlibrary` inside gdb.

    (gdb) info sharedlibrary 
    From                To                  Syms Read   Shared Object Library
    0x00007ffff7ddcae0  0x00007ffff7df5130  Yes         /lib64/ld-linux-x86-64.so.2
                                            No          linux-vdso.so.1
    0x00007ffff7bd5450  0x00007ffff7bd8be8  Yes (*)     /usr/lib/libibumad.so.3
    0x00007ffff78d8580  0x00007ffff7943d96  Yes         /lib/x86_64-linux-gnu/libm.so.6
    0x00007ffff76c01e0  0x00007ffff76cdf90  Yes (*)     /usr/lib/librdmacm.so.1
    0x00007ffff74ab6c0  0x00007ffff74b6dda  Yes         /usr/lib/libibverbs.so.1
    0x00007ffff728e9f0  0x00007ffff729a771  Yes         /lib/x86_64-linux-gnu/libpthread.so.0
    0x00007ffff6efd4a0  0x00007ffff7029ef3  Yes         /lib/x86_64-linux-gnu/libc.so.6
    0x00007ffff6c9c650  0x00007ffff6cc086f  Yes (*)     /usr/lib/x86_64-linux-gnu/libnl-route-3.so.200
    0x00007ffff6a71560  0x00007ffff6a7cf35  Yes (*)     /lib/x86_64-linux-gnu/libnl-3.so.200
    0x00007ffff6865ed0  0x00007ffff686697e  Yes         /lib/x86_64-linux-gnu/libdl.so.2
    0x00007ffff663ee00  0x00007ffff665e381  Yes         /usr/lib/libibverbs/libmlx5-rdmav2.so
    0x00007ffff6433360  0x00007ffff6437d95  Yes (*)     /usr/lib/x86_64-linux-gnu/libnuma.so.1
    (*): Shared library is missing debugging information.

Now we can proceed to instrument `libmlx4` and rebuild the library. Modify the first line of `Makefile.am` to remove `-Werror`.
 $ ./autogen.sh
 $ ./configure --prefix=`pwd` && make -j install
 $ ls -l libs
 
Then modify `/etc/libibverbs.d/mlx4.driver` to point to newly-built library. Nope, this won't work; the application will somehow loads two versions of `libmlx4`.


# Understanding PCIe overhead
https://www.xilinx.com/support/documentation/white_papers/wp350.pdf[Understanding Performance of PCI Express Systems]
https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf[Understanding PCIe performance for end host networking]
https://github.com/opcm/pcm[Intel Processor Counter Monitor]
https://ieeexplore.ieee.org/document/8048990[Evaluating Effect of Write Combining on PCIe Throughput to Improve HPC Interconnect Performance]
https://community.mellanox.com/docs/DOC-2496[Understanding PCIe Configuration for Maximum Performance]
https://www.techarp.com/bios-guide/pci-e-maximum-payload-size/
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/517428
https://www.google.com/search?q=pcie+performance+counter+site:software.intel.com&sa=X&ved=2ahUKEwjEq9jvzMPeAhXO-lQKHUOMC3EQrQIoBDAAegQICRAM&biw=1350&bih=1011&dpr=1.25
https://www.design-reuse.com/articles/15900/realizing-the-performance-potential-of-a-pci-express-ip.html[Realizing the Performance Potential of a PCI-Express IP]
http://xillybus.com/tutorials/pci-express-tlp-pcie-primer-tutorial-guide-1[Down to the TLP: How PCI express devices talk (Part I)]
https://codywu2010.wordpress.com/2015/11/26/pci-express-max-read-request-max-payload-size-and-why-you-care/[PCI Express Max Read Request, Max Payload Size and why you care]

In the Homa paper, we said the rc cluster has 24Gbps effective bandwidth. However, it is not true. Our HCA's and switches are 40Gbps while the PCIe 2.0 x8 limits us to 32Gbps, which is 4GB/s! However, our ClusterPerf results are far below this number (e.g., infud and infrc achieve only 1.9GB/s and 2.6GB/s respectively when reading 1MB objects).

To test if we can indeed achieve 4GB/s bandwidth, I decided to use `ib_send_bw` to stess the system. To get the best possible result, I choose the `send` verb (which is simpler and, thus, has less overhead) together with RC transport (which has smaller WQE header compared to UD transport). With just one client and one server, I am able to get 3218MB/s.

 $ yilongl@rc02: ib_send_bw --connection RC --run_infinitely
 $ yilongl@rc03: ib_send_bw rc02 --connection RC --run_infinitely
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF          Device         : mlx4_0
     Number of qps   : 1            Transport type : IB
     Connection type : RC           Using SRQ      : OFF
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs     : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x21 QPN 0x0269 PSN 0x911b06
     remote address: LID 0x31 QPN 0x02fe PSN 0x5a9cbe
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      257237           0.00               3216.14            0.051458
     65536      257362           0.00               3217.70            0.051483
     65536      257371           0.00               3217.81            0.051485
     65536      257371           0.00               3217.81            0.051485
     65536      257370           0.00               3217.80            0.051485
 
From other experiments, I already knew that the RX path in `ib_send_bw` is much more costly than the TX path because 1) the sender can use selective signaling (e.g., cq_mod=100 by default) and 2) `ibv_poll_cq` of the RX completion queue is somehow much slower than RAMCloud's InfUdDriver. Therefore, I started another `ib_send_bw` server at rc02 and another client at rc04. This gives us 1768MB/s * 2 = 3536MB/s aggregated downlink bandwidth at rc02. Adding another node doesn't improve any more (1184MB/s * 3 = 3552MB/s).

Now the question is if ~3550MB/s is the application effective bandwidth (i.e., goodput). If yes, how can we explain the 600MB/s gap to 4194MB/s (i.e. 4GB/s)?
TODO: possible factors are
  * PCIe TLP headers (~24B for every 256B MaxPayloadSize since we are not doing message inlining and the data is read via DMA): 4194MB/s * 256 / (256 + 24) = 3835MB/s
  * C_rc (i.e. CPU's read completion combining size)?
  * WQE header? probably not because data is read via DMA instead of written by MMIO?


TODO: What the heck is the following info telling us? Hint: PCIe speed, # lines, MaxPayloadSize, MaxReadReq, latency(to what?), cacheline size (32B not 64B wtf?)
root@rc02:/home/yilongl# lspci -vv | grep Mellanox -A 50
01:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
	Subsystem: Mellanox Technologies Device 0018
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at fb400000 (64-bit, non-prefetchable) [size=1M]
	Region 2: Memory at f9800000 (64-bit, prefetchable) [size=8M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [48] Vital Product Data
		Product Name: RAPTOR
		Read-only fields:
			[PN] Part number: MHZH29-XTR           
			[EC] Engineering changes: X5
			[SN] Serial number: MT1036X00903            
			[V0] Vendor specific: PCIe Gen2 x8    
			[RV] Reserved: checksum good, 0 byte(s) reserved
		Read/write fields:
			[V1] Vendor specific: N/A   
			[YA] Asset tag: N/A                             
			[RW] Read-write area: 114 byte(s) free
		End
	Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
		Vector table: BAR=0 offset=0007c000
		PBA: BAR=0 offset=0007d000
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [148 v1] Device Serial Number 00-02-c9-03-00-0c-b4-5c
	Kernel driver in use: mlx4_core

Potentially useful links:
https://community.mellanox.com/docs/DOC-2496[Understanding PCIe Configuration for Maximum Performance]
https://community.mellanox.com/docs/DOC-1523[HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool]
https://community.mellanox.com/thread/1666[The MaxReadRequest size is set too low (512 bytes)]
https://www.xilinx.com/support/answers/36596.html[What is the difference between MAX_READ_REQUEST_SIZE and MAX_PAYLOAD_SIZE?]

# Optimizing RDMA
https://www.rdmamojo.com/2013/06/08/tips-and-tricks-to-optimize-your-rdma-code
https://www.rdmamojo.com/2013/01/26/ibv_post_send/#What_is_the_benefit_from_using_IBV_SEND_INLINE
https://www.csm.ornl.gov/workshops/openshmem2013/documents/presentations_and_tutorials/Tutorials/Verbs%20programming%20tutorial-final.pdf
https://www.rdmamojo.com/2014/06/30/working-unsignaled-completions/
https://community.mellanox.com/docs/DOC-2801