Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

This page logs instances of dead machines in reverse chronological order; among other things, we are using it to track down the mysterious machine crashes occurring in August/September/October/November 2011. that occurred starting in August 2011.

As of Oct 3rd, the switch for rc1-40 has been replaced, and the cluster is up to its full strength.

  • September 12: rc20 still down; rebooted switch, port 9 still stayed down; tried several host/cable permutations; port 8 and 9 
  • September 4  –  rebooted but fsck faied, so cleaned up the file system (zapped: You will see the warning of host ID change.) - Satoshi, Ryan
    • rc20: crash (reboot but still no response)
    • rc45,47, 49-52: requrire password (possible NFS problem)
    • rc46: responds to pings but does notallow ssh
    • rc48: does not respond to pings
  • August 7: rc20 (no response to ipmi, see below)
    • rc20: Ports 8 and 9 on the 1G switch seem to be bad, and the jack on rc20 itself is loose. There were no more ports on that switch, so I plugged in the extra 50-port switch and routed rc20 and rc30's IPMI through this 50-port switch. Then I was able to IPMI to rc20, which was up but its network(s) seemed down. Nothing too interesting. Zapped it, since it's been offline for a while. Unfortunately, the install script hangs because it sets the ethernet MTU to 9000. I guess that 50-port switch can't handle jumbo frames. We should be using a normal MTU, even if we get our switch problems worked out. I dropped the MTU back to 1500 and changed the NFS settings to use default-sized reads and writes. NFS still didn't work over UDP, so I changed it to use TCP. Things should work for now, but rc20 and rc30 are gimped – they have a 100Mbit connection to rcnfs now. We'll need to troubleshoot/RMA the HP switch on the first rack (RAM-445). //// Update August 9: I rebooted the HP switch, and port 9 works again. rc20 and rc30 are back on it. Works for now. Return the switch if this keeps happening (RAM-445).  -Diego
  • August 2: rc79
  • July 31: rc79
  • July 30: rc20 (no response to ipmi, see below), rc79
    • rc20: Port 9 on 1G switch seems to be bad.
  • July 23, rc20 (no response to ipmi), rc61, rc63, rc64, rc68 (these are likely due to operator error, but needed manual maintenance to fix)
  • July 19: rc03 (reimage), rc04 (reimage), rc20 (no response to ipmi, see below), rc26 (reimage), rc37, rc38 (no response to ipmi, see below), rc39, rc78 (reimage), rc80
    • rc20: 1G cable loose; doesn't seat well in NIC. If problem continues may need to bend clip or try other cables.
    • rc38: Port 8 on 1G switch seems to be bad. Hopefully diagnostics on switch can tell us more. Connected to port 48T which is the 1/10G uplink port, no other ports free. Diagnostics didn't provide much info. In the 4 days of uptime port 8 never successfully detected a connected cable, even after trying neighboring, known-working end-points' cables with it. It seems to be out-of-commission.
  • June 15: rc37 (failed to reboot), rc38 (failed to reboot), rc79

As of May 14, 2012 the ConnectX2 cards have been replaced by ConnectX3 cards, and the cluster is up to its full strength.

  • February 15: rc79
  • February 6: rc80
  • February 2: rc79 (2 am), rc79 (11 pm)
  • January 30: rc06, rc24, rc65
  • January 26, rc19, rc20
  • January 12: rc33
  • January 6: rc33
  • January 5: rc32
  • January 3: rc24, rc32
  • December 23: rc32, rc32 (10 minutes later...)
  • December 21: rc21, rc32
  • December 16: rc23
  • December 15: rc27
  • December 8: rc03, rc04, rc33

...