Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in August/September/October/November 2011.

As of Dec 7, 2011 the power has been restored to all nodes

  • December 7: rc10, rc20, rc32
  • December 5: rc16, rc18, rc20, rc36

NB: As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)

  • December 1: rc04, rc18
  • November 30: rc04 (After disk replacement)

NB: As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20

  • November 30: rc05, rc07
  • November 28: rc01, rc07, rc16, rc20
  • November 27: rc05, rc11, rc12, rc16, rc19, rc20, rc21
  • November 24: rc04, rc16
  • November 21: rc07, rc08, rc12, rc18, rc19, rc20, rc36
  • November 19: rc16

NB: As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot.  After a couple of days they were all back at normal levels).

  • November 18: rc10, rc20, rc33 (rc20 had to be rebooted twice again to restart it), rc10 (again)
  • November 16: rc19
  • November 15:
  • November 14: rc08, rc10, rc11, rc16, rc19, rc33 (then rc19 crashed again)

NB: As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.

  • November 11: rc07, rc10, rc16
  • November 10: rc10, rc36
  • November 9: rc03, rc04, rc07, rc10, rc11, rc33
  • November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)

NB: As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.

...