Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Cluster Custodian

If you notice a down machine that doesn't seem to respond to rcreboot ping the IRC user listed for the current week listed below.

  • 7/22/12: stutsman
  • 7/29/12: ankitak
  • 8/5/12: ongardie
  • 8/12/12: daeschli
  • 8/19/12: mendel
  • 8/26/12: ouster
  • 9/2/12: syang0
  • 9/9/12: satoshi

The current custodian is responsible for restarting, debugging, and reimaging machines and generally keeping the cluster working. Also, the outgoing custodian is responsible for notifying the next week's custodian that it is their turn.

Crashes

This page logs instances of dead machines in reverse chronological order; among other things, we are using it to track down the mysterious machine crashes that occurred starting in August 2011.

As of Oct 3rd, the switch for rc1-40 has been replaced, and the cluster is up to its full strength.

  • September 12: rc20 still down; rebooted switch, port 9 still stayed down; tried several host/cable permutations; port 8 and 9 
  • September 4  –  rebooted but fsck faied, so cleaned up the file system (zapped: You will see the warning of host ID change.) - Satoshi, Ryan
    • rc20: crash (reboot but still no response)
    • rc45,47, 49-52: requrire password (possible NFS problem)
    • rc46: responds to pings but does notallow ssh
    • rc48: does not respond to pings
  • August 7: rc20 (no response to ipmi, see below)
    • rc20: Ports 8 and 9 on the 1G switch seem to be bad, and the jack on rc20 itself is loose. There were no more ports on that switch, so I plugged in the extra 50-port switch and routed rc20 and one other host (I think rc30)'s IPMI through this 50-port switch. Then I was able to IPMI to rc20, which was up but its network(s) seemed down. Nothing too interesting. Zapped it, since it's been offline for a while. Unfortunately, the install script hangs because it sets the ethernet MTU to 9000. I guess that 50-port switch can't handle that. Can we get the installer to use a normal-sized MTU? ..jumbo frames. We should be using a normal MTU, even if we get our switch problems worked out. I dropped the MTU back to 1500 and changed the NFS settings to use default-sized reads and writes. NFS still didn't work over UDP, so I changed it to use TCP. Things should work for now, but rc20 and rc30 are gimped – they have a 100Mbit connection to rcnfs now. We'll need to troubleshoot/RMA the HP switch on the first rack (RAM-445). //// Update August 9: I rebooted the HP switch, and port 9 works again. rc20 and rc30 are back on it. Works for now. Return the switch if this keeps happening (RAM-445).  -Diego
  • August 2: rc79
  • July 31: rc79
  • July 30: rc20 (no response to ipmi, see below), rc79
    • rc20: Port 9 on 1G switch seems to be bad.
  • July 23, rc20 (no response to ipmi), rc61, rc63, rc64, rc68 (these are likely due to operator error, but needed manual maintenance to fix)
  • July 19: rc03 (reimage), rc04 (reimage), rc20 (no response to ipmi, see below), rc26 (reimage), rc37, rc38 (no response to ipmi, see below), rc39, rc78 (reimage), rc80
    • rc20: 1G cable loose; doesn't seat well in NIC. If problem continues may need to bend clip or try other cables.
    • rc38: Port 8 on 1G switch seems to be bad. Hopefully diagnostics on switch can tell us more. Connected to port 48T which is the 1/10G uplink port, no other ports free. Diagnostics didn't provide much info. In the 4 days of uptime port 8 never successfully detected a connected cable, even after trying neighboring, known-working end-points' cables with it. It seems to be out-of-commission.
  • June 15: rc37 (failed to reboot), rc38 (failed to reboot), rc79

...