This page logs instances of dead machines; we are using it to track down the mysterious machine crashes that occurred starting in August 2011.
- June 15: rc37, rc38
- February 15: rc79
- February 6: rc80
- February 2: rc79 (2 am), rc79 (11 pm)
- January 30: rc06, rc24, rc65
- January 26, rc19, rc20
- January 12: rc33
- January 6: rc33
- January 5: rc32
- January 3: rc24, rc32
- December 23: rc32, rc32 (10 minutes later...)
- December 21: rc21, rc32
- December 16: rc23
- December 15: rc27
- December 8: rc03, rc04, rc33
As of Dec 7, 2011 the power has been restored to all nodes
- December 7: rc10, rc20, rc32
- December 5: rc16, rc18, rc20, rc36
As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)
- December 1: rc04, rc18
- November 30: rc04 (After disk replacement)
As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20
- November 30: rc05, rc07
- November 28: rc01, rc07, rc16, rc20
- November 27: rc05, rc11, rc12, rc16, rc19, rc20, rc21
- November 24: rc04, rc16
- November 21: rc07, rc08, rc12, rc18, rc19, rc20, rc36
- November 19: rc16
As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot. After a couple of days they were all back at normal levels).
- November 18: rc10, rc20, rc33 (rc20 had to be rebooted twice again to restart it), rc10 (again)
- November 16: rc19
- November 15:
- November 14: rc08, rc10, rc11, rc16, rc19, rc33 (then rc19 crashed again)
As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.
- November 11: rc07, rc10, rc16
- November 10: rc10, rc36
- November 9: rc03, rc04, rc07, rc10, rc11, rc33
- November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)
As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.
- November 7: rc02, rc03, rc08, rc16, rc18, rc20
- November 4: rc10, rc16, rc18
- November 3: rc18
- November 2: rc11, rc12, rc20
- October 20: rc04, rc20, rc35
- October 19: rc04, rc20
- October 18: rc01, rc05, rc11, rc16, rc24
- October 17: rc13, rc16, rc20
- October 13: rc16
- October 12: rc21, rc38
- October 11: rc03, rc08, rc10, rc16, rc19
- October 10: rc02, rc05, rc06, rc08, rc10, rc12, rc16, rc33, rc36
- rc02 and rc17 were up, but claimed a read only / file system
- October 7: rc06, rc10, rc11, rc13, rc16, rc19
- October 6: rc04
- October 5: rc01
- October 4: rc08
- October 3: rc05, rc10, rc12, rc14, rc18, rc33
- September 30: rc07, rc10, rc11, rc13, rc19, rc20, rc36
- JO restarted all of them, and all came up except rc19 & rc20.
- rc19 & 20 back up (PSU decided to work again today?!)
- JO restarted all of them, and all came up except rc19 & rc20.
- September 28: rc04, rc19, rc20, rc21, rc25, rc26
- rc04 was powered off
- rc19/20 appears to have a bad power supply
- rc21 was at Linux login prompt with cursor blinking, but didn't respond to keyboard
- rc25/26 was not plugged in (oops)
Notes
Schedule of cluster changes
Aug 28: John's email "Can't ssh rc08"
Aug 24: unplugged 1/3rd of memory
Aug 17: SSDs arrived and installed
Aug 4: new servers installed and running recoveries with magnetic disks
Nov 30th:
Compared BIOS settings of various commonly failing machines (rc10, 16, 20) to some of the rc41+ machines to little effect. Nothing too interesting, though there do seem to be a number of cmos checksum errors logged for the ones that fail. This isn't consistent across failing nodes, nor have I looked widely enough to see if they don't occur in good machines. Otherwise settings are pretty close across machines. A few are set to IDE, rather than AHCI, but that doesn't appear to explain anything (some ide nodes failed a lot, but then so have many AHCI ones).
Nov 11th:
Histrogram of failures as of Nov 11th:
rc01: ** rc02: ** rc03: **** rc04: ***** rc05: *** rc06: ** rc07: *** rc08: **** rc09: rc10: ********* rc11: ***** rc12: *** rc13: *** rc14: * rc15: rc16: ********* rc17: rc18: **** rc19: **** rc20: ******** rc21: ** rc22: rc23: rc24: * rc25: * rc26: * rc27: rc28: rc29: rc30: rc31: rc32: rc33: *** rc34: rc35: * rc36: *** rc37: rc38: * rc39: rc40:
There were 53 failures on even nodes, 31 on odd nodes.
"System Temp" as reported by ipmi (rc01 first, rc40 last):
System Temp | 29 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 26 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 26 degrees C | ok System Temp | 22 degrees C | ok System Temp | 25 degrees C | ok System Temp | 22 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 23 degrees C | ok System Temp | 22 degrees C | ok System Temp | 23 degrees C | ok System Temp | 21 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 21 degrees C | ok System Temp | 20 degrees C | ok
Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.