This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in September/October/November 2011.
NB: As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.
NB: As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.
Nov 11th:
Histrogram of failures as of Nov 11th:
rc01: ** rc02: ** rc03: **** rc04: ***** rc05: *** rc06: ** rc07: *** rc08: **** rc09: rc10: ********* rc11: ***** rc12: *** rc13: *** rc14: * rc15: rc16: ********* rc17: rc18: **** rc19: **** rc20: ******** rc21: ** rc22: rc23: rc24: * rc25: * rc26: * rc27: rc28: rc29: rc30: rc31: rc32: rc33: *** rc34: rc35: * rc36: *** rc37: rc38: * rc39: rc40: |
There were 53 failures on even nodes, 31 on odd nodes.
"System Temp" as reported by ipmi (rc01 first, rc40 last):
System Temp | 29 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 28 degrees C | ok System Temp | 27 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 27 degrees C | ok System Temp | 26 degrees C | ok System Temp | 26 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 25 degrees C | ok System Temp | 26 degrees C | ok System Temp | 22 degrees C | ok System Temp | 25 degrees C | ok System Temp | 22 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 24 degrees C | ok System Temp | 23 degrees C | ok System Temp | 23 degrees C | ok System Temp | 22 degrees C | ok System Temp | 23 degrees C | ok System Temp | 21 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 23 degrees C | ok System Temp | 20 degrees C | ok System Temp | 21 degrees C | ok System Temp | 20 degrees C | ok |
Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.