This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in August/September/October/November 2011.
NB: As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)
NB: As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20
NB: As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot. After a couple of days they were all back at normal levels).
NB: As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.
NB: As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.
Schedule of cluster changes
Aug 28: John's email "Can't ssh rc08"
Aug 24: unplugged 1/3rd of memory
Aug 17: SSDs arrived and installed
Aug 4: new servers installed and running recoveries with magnetic disks
Nov 30th:
Compared BIOS settings of various commonly failing machines (rc10, 16, 20) to some of the rc41+ machines to little effect. Nothing too interesting, though there do seem to be a number of cmos checksum errors logged for the ones that fail. This isn't consistent across failing nodes, nor have I looked widely enough to see if they don't occur in good machines. Otherwise settings are pretty close across machines. A few are set to IDE, rather than AHCI, but that doesn't appear to explain anything (some ide nodes failed a lot, but then so have many AHCI ones).
Nov 11th:
Histrogram of failures as of Nov 11th:
rc01: ** rc02: ** rc03: **** rc04: ***** rc05: *** rc06: ** rc07: *** rc08: **** rc09: rc10: ********* rc11: ***** rc12: *** rc13: *** rc14: * rc15: rc16: ********* rc17: rc18: **** rc19: **** rc20: ******** rc21: ** rc22: rc23: rc24: * rc25: * rc26: * rc27: rc28: rc29: rc30: rc31: rc32: rc33: *** rc34: rc35: * rc36: *** rc37: rc38: * rc39: rc40: |
There were 53 failures on even nodes, 31 on odd nodes.
"System Temp" as reported by ipmi (rc01 first, rc40 last):
System Temp | 29 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 26 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 26 degrees C | ok System Temp | 22 degrees C | ok System Temp | 25 degrees C | ok System Temp | 22 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 23 degrees C | ok System Temp | 22 degrees C | ok System Temp | 23 degrees C | ok System Temp | 21 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 21 degrees C | ok System Temp | 20 degrees C | ok |
Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.