Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
  • Strange bandwidth issues:  ib_send_bw tests on second rack (rc41-rc80 with ConnectX-3 nics and 56Gbps SwitchX switches) only show ~1650MB/s between nodes.
    1. Bandwidths are apparently fine (~3200MB/s) between rc01-rc40, which are on the old switches (though they route through the new switches)
    2. Strangely, bandwidths between nodes in first rack and second rack are fine (~3200MB/s)
      1. So it appears to be a combination of new nics talking to new nics!?

The above appears to be due to some NICs coming up as pcie x4 (rather than x8) at boot. This can be checked for by doing lspci -d 15b3: -xxx |grep ^70 |cut -d " " -f 4. If you see '82', all's good. 42 means x4. Something else is probably even worse.

  • rc43's infiniband interface isn't detected. It probably needs to be re-seated in the pci-e slot.

 

  • A handful of ports on the second rack are down:  rc45, rc53, rc59, rc63, rc70
    1. Figure out if these are cable, nic, or switch port issues.
Try running ibdiagnet -P all=1 as root and look at /var/tmp/ibdiagnet2/ibdiagnet2.log to find out what links look dubious (i.e. are experiencing bit errors).
  • Update SSD firmware on all drives to version 0309 (http://www.crucial.com/support/firmware.aspx). Otherwise they'll start crashing after being up for >= 5184 hours.
    1. Perhaps easiest to hexedit the bootable updater's script to flash without any interaction and PXE boot the update on all machines?