This page logs instances of dead machines in reverse chronological order; among other things, we are using it to track down the mysterious machine crashes that occurred starting in August 2011.

As of Oct 3rd, the switch for rc1-40 has been replaced, and the cluster is up to its full strength.

As of May 14, 2012 the ConnectX2 cards have been replaced by ConnectX3 cards, and the cluster is up to its full strength.

As of Dec 7, 2011 the power has been restored to all nodes

As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)

As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20

As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot.  After a couple of days they were all back at normal levels).

As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.

As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.

Notes

Schedule of cluster changes

Aug 28: John's email "Can't ssh rc08"

Aug 24: unplugged 1/3rd of memory

Aug 17: SSDs arrived and installed

Aug 4: new servers installed and running recoveries with magnetic disks


Nov 30th:

Compared BIOS settings of various commonly failing machines (rc10, 16, 20) to some of the rc41+ machines to little effect. Nothing too interesting, though there do seem to be a number of cmos checksum errors logged for the ones that fail. This isn't consistent across failing nodes, nor have I looked widely enough to see if they don't occur in good machines. Otherwise settings are pretty close across machines. A few are set to IDE, rather than AHCI, but that doesn't appear to explain anything (some ide nodes failed a lot, but then so have many AHCI ones).

Nov 11th:

Histrogram of failures as of Nov 11th:

rc01: **
rc02: **
rc03: ****
rc04: *****
rc05: ***
rc06: **
rc07: ***
rc08: ****
rc09: 
rc10: *********
rc11: *****
rc12: ***
rc13: ***
rc14: *
rc15: 
rc16: *********
rc17: 
rc18: ****
rc19: ****
rc20: ********
rc21: **
rc22: 
rc23: 
rc24: *
rc25: *
rc26: *
rc27: 
rc28: 
rc29: 
rc30: 
rc31: 
rc32: 
rc33: ***
rc34: 
rc35: *
rc36: ***
rc37: 
rc38: *
rc39: 
rc40:

There were 53 failures on even nodes, 31 on odd nodes.

 

"System Temp" as reported by ipmi (rc01 first, rc40 last):

System Temp | 29 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 20 degrees C | ok

 

Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.