Cluster Tasks

This page describes issues we have had with the cluster (most recent ones first) and how we have dealt with them.

  • Strange bandwidth issues:  ib_send_bw tests on second rack (rc41-rc80 with ConnectX-3 nics and 56Gbps SwitchX switches) only show ~1650MB/s between nodes.
    1. Bandwidths are apparently fine (~3200MB/s) between rc01-rc40, which are on th old switches (though they route through the new switches)
    2. Strangely, bandwidths between nodes in first rack and second rack are fine (~3200MB/s)
      1. So it appears to be a combination of new nics talking to new nics!?

The above appears to be due to some NICs coming up as pcie x4 (rather than x8) at boot. This can be checked for by doing lspci -d 15b3: -xxx |grep ^70 |cut -d " " -f 4. If you see '82', all's good. 42 means x4. Something else is probably even worse.

  • rc43's infiniband interface isn't detected. It probably needs to be re-seated in the pci-e slot.

It appears to have come back after a reboot or power cycle. Perhaps it's related to the x4 vs. x8 pcie issue.

  • A handful of ports on the second rack are down:  rc45, rc53, rc59, rc63, rc70
    1. Figure out if these are cable, nic, or switch port issues.
Try running ibdiagnet -P all=1 as root and look at /var/tmp/ibdiagnet2/ibdiagnet2.log to find out what links look dubious (i.e. are experiencing bit errors).
  • Update SSD firmware on all drives to version 0309 (http://www.crucial.com/support/firmware.aspx). Otherwise they'll start crashing after being up for >= 5184 hours.
    1. Done. Modified the boot2880.img's autoexec.bat in the above iso to run the following commands at boot:
      1. sleeps appeared necessary. x8sit's shutdown.com didn't work for some reason, so the machines should be poked via ipmi to reset them after they've had enough time to finish.

        sleep 3
         echo yes | dosmcli.exe --bus ALL -f fwa.img -u 0 --segmented 10
        sleep 3 
        echo yes | dosmcli.exe --bus ALL -f fwa.img -u 1 --segmented 10
  • Update BIOS versions to 1.2 on all 80 machines (hoping this helps with the ConnectX-3 HCAs showing up as x4 or x2 devices, rather than x8)
    • Done. Now all cards show up as PCIe 1.0, rather than 2.0!