This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in September/October/November 2011.
NB: As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.
NB: As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.
NB: As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed.
- November 18: rc10, rc20, rc33 (rc20 had to be rebooted twice again to restart it), rc10 (again)
- November 16: rc19
- November 15:
- November 14: rc08, rc10, rc11, rc16, rc19, rc33 (then rc19 crashed again)
- November 11: rc07, rc10, rc16
- November 10: rc10, rc36
- November 9: rc03, rc04, rc07, rc10, rc11, rc33
- November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)
- November 7: rc02, rc03, rc08, rc16, rc18, rc20
- November 4: rc10, rc16, rc18
- November 3: rc18
- November 2: rc11, rc12, rc20
- October 20: rc04, rc20, rc35
- October 19: rc04, rc20
- October 18: rc01, rc05, rc11, rc16, rc24
- October 17: rc13, rc16, rc20
- October 13: rc16
- October 12: rc21, rc38
- October 11: rc03, rc08, rc10, rc16, rc19
- October 10: rc02, rc05, rc06, rc08, rc10, rc12, rc16, rc33, rc36
- rc02 and rc17 were up, but claimed a read only / file system
- October 7: rc06, rc10, rc11, rc13, rc16, rc19
- October 6: rc04
- October 5: rc01
- October 4: rc08
- October 3: rc05, rc10, rc12, rc14, rc18, rc33
- September 30: rc07, rc10, rc11, rc13, rc19, rc20, rc36
- JO restarted all of them, and all came up except rc19 & rc20.
- rc19 & 20 back up (PSU decided to work again today?!)
- JO restarted all of them, and all came up except rc19 & rc20.
- September 28: rc04, rc19, rc20, rc21, rc25, rc26
- rc04 was powered off
- rc19/20 appears to have a bad power supply
- rc21 was at Linux login prompt with cursor blinking, but didn't respond to keyboard
- rc25/26 was not plugged in (oops)
Notes
Nov 11th:
Histrogram of failures as of Nov 11th:
rc01: ** rc02: ** rc03: **** rc04: ***** rc05: *** rc06: ** rc07: *** rc08: **** rc09: rc10: ********* rc11: ***** rc12: *** rc13: *** rc14: * rc15: rc16: ********* rc17: rc18: **** rc19: **** rc20: ******** rc21: ** rc22: rc23: rc24: * rc25: * rc26: * rc27: rc28: rc29: rc30: rc31: rc32: rc33: *** rc34: rc35: * rc36: *** rc37: rc38: * rc39: rc40:
There were 53 failures on even nodes, 31 on odd nodes.
"System Temp" as reported by ipmi (rc01 first, rc40 last):
System Temp | 29 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 26 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 26 degrees C | ok System Temp | 22 degrees C | ok System Temp | 25 degrees C | ok System Temp | 22 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 23 degrees C | ok System Temp | 22 degrees C | ok System Temp | 23 degrees C | ok System Temp | 21 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 21 degrees C | ok System Temp | 20 degrees C | ok
Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.