Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page logs instances of dead machines; we are using it to track down the mysterious machine crashes occurring in August/September/October/November 2011. This started around August 28, 2011 (see John's email "Can't ssh to rc08")..

  • December 1: rc04, rc18
  • November 30: rc04 (After disk replacement)

...

  • November 7: rc02, rc03, rc08, rc16, rc18, rc20
  • November 4: rc10, rc16, rc18
  • November 3: rc18
  • November 2: rc11, rc12, rc20
  • October 20: rc04, rc20, rc35
  • October 19: rc04, rc20
  • October 18: rc01, rc05, rc11, rc16, rc24
  • October 17: rc13, rc16, rc20
  • October 13: rc16
  • October 12: rc21, rc38
  • October 11: rc03, rc08, rc10, rc16, rc19
  • October 10: rc02, rc05, rc06, rc08, rc10, rc12, rc16, rc33, rc36
    • rc02 and rc17 were up, but claimed a read only / file system
  • October 7: rc06, rc10, rc11, rc13, rc16, rc19
  • October 6: rc04
  • October 5: rc01
  • October 4: rc08
  • October 3: rc05, rc10, rc12, rc14, rc18, rc33
  • September 30: rc07, rc10, rc11, rc13, rc19, rc20, rc36
    • JO restarted all of them, and all came up except rc19 & rc20.
      • rc19 & 20 back up (PSU decided to work again today?!)
  • September 28: rc04, rc19, rc20, rc21, rc25, rc26
    • rc04 was powered off
    • rc19/20 appears to have a bad power supply
    • rc21 was at Linux login prompt with cursor blinking, but didn't respond to keyboard
    • rc25/26 was not plugged in (oops)
Notes

Schedule of cluster changes

Aug 28: John's email "Can't ssh rc08"

Aug 24: unplugged 1/3rd of memory

Aug 17: SSDs arrived and installed

Aug 4: new servers installed and running recoveries with magnetic disks


Nov 30th:

Compared BIOS settings of various commonly failing machines (rc10, 16, 20) to some of the rc41+ machines to little effect. Nothing too interesting, though there do seem to be a number of cmos checksum errors logged for the ones that fail. This isn't consistent across failing nodes, nor have I looked widely enough to see if they don't occur in good machines. Otherwise settings are pretty close across machines. A few are set to IDE, rather than AHCI, but that doesn't appear to explain anything (some ide nodes failed a lot, but then so have many AHCI ones).

...