Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The current custodian is responsible for restarting, debugging, and reimaging machines and generally keeping the cluster working. Also, the outgoing custodian is responsible for notifying the next week's custodian that it is their turn.

Crashes

This page logs instances of dead machines; we are using it to track down the mysterious machine crashes that occurred starting in August 2011.

  • July 19: rc03 (reimage), rc04 (reimage), rc20 (no response to ipmi, see below), rc26 (reimage), rc37, rc38 (no response to ipmi, see below), rc39, rc78 (reimage), rc80
    • rc20: 1G cable loose; doesn't seat well in NIC. If problem continues may need to bend clip or try other cables.
    • rc38: Port 8 on 1G switch seems to be bad. Hopefully diagnostics on switch can tell us more. Connected to port 48T which is the 1/10G uplink port, no other ports free. Diagnostics didn't provide much info. In the 4 days of uptime port 8 never successfully detected a connected cable, even after trying neighboring, known-working end-points' cables with it. It seems to be out-of-commission.
  • June 15: rc37 (failed to reboot), rc38 (failed to reboot), rc79

...