network failure latency

To detect and respond machine failures, we need to understand how the networks we have (IP/Ethernet, Infiniband, RoCE) respond to missing hosts.

Infiniband RC (reliable connected QPs)

  • ibv_poll_cq() returns no error status if the remote QP disappears
    • we will need to manually time out
  • ibv_post_send(), ibv_post_recv() both fail within ~250ns if the remote QP is dead
    • how does ibv_post_recv() detect absence of the remote QP? why can't ibv_poll_cq make use of this?

Infiniband UD (unreliable datagram QPs)

  • ibv_poll_cq ??
  • ibv_post_send() and ibv_post_recv() both appear to block indefinitely
    • doesn't appear to be an analogue to ICMPs sent in response to a closed UDP port