Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 31 Next »

This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in September/October/November 2011.

NB: As of 7.11.2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.

  • November 9: rc03, rc04, rc07, rc10, rc11, rc33
  • November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)
  • November 7: rc02, rc03, rc08, rc16, rc18, rc20
  • November 4: rc10, rc16, rc18
  • November 3: rc18
  • November 2: rc11, rc12, rc20
  • October 20: rc04, rc20, rc35
  • October 19: rc04, rc20
  • October 18: rc01, rc05, rc11, rc16, rc24
  • October 17: rc13, rc16, rc20
  • October 13: rc16
  • October 12: rc21, rc38
  • October 11: rc03, rc08, rc10, rc16, rc19
  • October 10: rc02, rc05, rc06, rc08, rc10, rc12, rc16, rc33, rc36
    • rc02 and rc17 were up, but claimed a read only / file system
  • October 7: rc06, rc10, rc11, rc13, rc16, rc19
  • October 6: rc04
  • October 5: rc01
  • October 4: rc08
  • October 3: rc05, rc10, rc12, rc14, rc18, rc33
  • September 30: rc07, rc10, rc11, rc13, rc19, rc20, rc36
    • JO restarted all of them, and all came up except rc19 & rc20.
      • rc19 & 20 back up (PSU decided to work again today?!)
  • September 28: rc04, rc19, rc20, rc21, rc25, rc26
    • rc04 was powered off
    • rc19/20 appears to have a bad power supply
    • rc21 was at Linux login prompt with cursor blinking, but didn't respond to keyboard
    • rc25/26 was not plugged in (oops)
Notes

Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.

  • No labels