This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in September/October/November 2011.

 

NB: As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.

NB: As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.

NB: As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed.

Notes

Nov 11th:

Histrogram of failures as of Nov 11th:

rc01: **
rc02: **
rc03: ****
rc04: *****
rc05: ***
rc06: **
rc07: ***
rc08: ****
rc09: 
rc10: *********
rc11: *****
rc12: ***
rc13: ***
rc14: *
rc15: 
rc16: *********
rc17: 
rc18: ****
rc19: ****
rc20: ********
rc21: **
rc22: 
rc23: 
rc24: *
rc25: *
rc26: *
rc27: 
rc28: 
rc29: 
rc30: 
rc31: 
rc32: 
rc33: ***
rc34: 
rc35: *
rc36: ***
rc37: 
rc38: *
rc39: 
rc40:

There were 53 failures on even nodes, 31 on odd nodes.

 

"System Temp" as reported by ipmi (rc01 first, rc40 last):

System Temp | 29 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 20 degrees C | ok

 

Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.