network failure latency
To detect and respond machine failures, we need to understand how the networks we have (IP/Ethernet, Infiniband, RoCE) respond to missing hosts.
Infiniband RC (reliable connected QPs)
- ibv_poll_cq() returns no error status if the remote QP disappears
- we will need to manually time out
- ibv_post_send(), ibv_post_recv() both fail within ~250ns if the remote QP is dead
- how does ibv_post_recv() detect absence of the remote QP? why can't ibv_poll_cq make use of this?
Infiniband UD (unreliable datagram QPs)
- ibv_poll_cq ??
- ibv_post_send() and ibv_post_recv() both appear to block indefinitely
- doesn't appear to be an analogue to ICMPs sent in response to a closed UDP port