This page describes issues we have had with the cluster (most recent ones first) and how we have dealt with them.

The above appears to be due to some NICs coming up as pcie x4 (rather than x8) at boot. This can be checked for by doing lspci -d 15b3: -xxx |grep ^70 |cut -d " " -f 4. If you see '82', all's good. 42 means x4. Something else is probably even worse.

It appears to have come back after a reboot or power cycle. Perhaps it's related to the x4 vs. x8 pcie issue.

Try running ibdiagnet -P all=1 as root and look at /var/tmp/ibdiagnet2/ibdiagnet2.log to find out what links look dubious (i.e. are experiencing bit errors).