This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in August/September/October/November 2011.
As of Dec 7, 2011 the power has been restored to all nodes
- December 7: rc10, rc20, rc32
- December 5: rc16, rc18, rc20, rc36
NB: As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)
- December 1: rc04, rc18
- November 30: rc04 (After disk replacement)
NB: As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20
- November 30: rc05, rc07
- November 28: rc01, rc07, rc16, rc20
- November 27: rc05, rc11, rc12, rc16, rc19, rc20, rc21
- November 24: rc04, rc16
- November 21: rc07, rc08, rc12, rc18, rc19, rc20, rc36
- November 19: rc16
NB: As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot. After a couple of days they were all back at normal levels).
- November 18: rc10, rc20, rc33 (rc20 had to be rebooted twice again to restart it), rc10 (again)
- November 16: rc19
- November 15:
- November 14: rc08, rc10, rc11, rc16, rc19, rc33 (then rc19 crashed again)
NB: As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.
- November 11: rc07, rc10, rc16
- November 10: rc10, rc36
- November 9: rc03, rc04, rc07, rc10, rc11, rc33
- November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)
NB: As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.
...