Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

  • Strange bandwidth issues:  ib_send_bw tests on second rack (rc41-rc80 with ConnectX-3 nics and 56Gbps SwitchX switches) only show ~1650MB/s between nodes.
    1. Bandwidths are apparently fine (~3200MB/s) between rc01-rc40, which are on the old switches (though they route through the new switches)
    2. Strangely, bandwidths between nodes in first rack and second rack are fine (~3200MB/s)
      1. So it appears to be a combination of new nics talking to new nics!?

The above appears to be due to some NICs coming up as pcie x4 (rather than x8) at boot. This can be checked for by doing lspci -d 15b3: -xxx |grep ^70 |cut -d " " -f 4. If you see '82', all's good. 42 means x4. Something else is probably even worse.

  • rc43's infiniband interface isn't detected. It probably needs to be re-seated in the pci-e slot.

 

  • A handful of ports on the second rack are down:  rc45, rc53, rc59, rc63, rc70
    1. Figure out if these are cable, nic, or switch port issues.
Try running ibdiagnet -P all=1 as root and look at /var/tmp/ibdiagnet2/ibdiagnet2.log to find out what links look dubious (i.e. are experiencing bit errors).
  • Update SSD firmware on all drives to version 0309 (http://www.crucial.com/support/firmware.aspx). Otherwise they'll start crashing after being up for >= 5184 hours.
    1. Perhaps easiest to hexedit the bootable updater's script to flash without any interaction and PXE boot the update on all machines?
  • No labels