Dead Machines

This page logs instances of dead machines in reverse chronological order; among other things, we are using it to track down the mysterious machine crashes that occurred starting in August 2011.

As of Oct 3rd, the switch for rc1-40 has been replaced, and the cluster is up to its full strength.

September 12: rc20 still down; rebooted switch, port 9 still stayed down; tried several host/cable permutations; port 8 and 9
September 4 – rebooted but fsck faied, so cleaned up the file system (zapped: You will see the warning of host ID change.) - Satoshi, Ryan
- rc20: crash (reboot but still no response)
- rc45,47, 49-52: requrire password (possible NFS problem)
- rc46: responds to pings but does notallow ssh
- rc48: does not respond to pings
August 7: rc20 (no response to ipmi, see below)
- rc20: Ports 8 and 9 on the 1G switch seem to be bad, and the jack on rc20 itself is loose. There were no more ports on that switch, so I plugged in the extra 50-port switch and routed rc20 and rc30's IPMI through this 50-port switch. Then I was able to IPMI to rc20, which was up but its network(s) seemed down. Nothing too interesting. Zapped it, since it's been offline for a while. Unfortunately, the install script hangs because it sets the ethernet MTU to 9000. I guess that 50-port switch can't handle jumbo frames. We should be using a normal MTU, even if we get our switch problems worked out. I dropped the MTU back to 1500 and changed the NFS settings to use default-sized reads and writes. NFS still didn't work over UDP, so I changed it to use TCP. Things should work for now, but rc20 and rc30 are gimped – they have a 100Mbit connection to rcnfs now. We'll need to troubleshoot/RMA the HP switch on the first rack (RAM-445). //// Update August 9: I rebooted the HP switch, and port 9 works again. rc20 and rc30 are back on it. Works for now. Return the switch if this keeps happening (RAM-445). -Diego
August 2: rc79
July 31: rc79
July 30: rc20 (no response to ipmi, see below), rc79
- rc20: Port 9 on 1G switch seems to be bad.
July 23, rc20 (no response to ipmi), rc61, rc63, rc64, rc68 (these are likely due to operator error, but needed manual maintenance to fix)
July 19: rc03 (reimage), rc04 (reimage), rc20 (no response to ipmi, see below), rc26 (reimage), rc37, rc38 (no response to ipmi, see below), rc39, rc78 (reimage), rc80
- rc20: 1G cable loose; doesn't seat well in NIC. If problem continues may need to bend clip or try other cables.
- rc38: Port 8 on 1G switch seems to be bad. Hopefully diagnostics on switch can tell us more. Connected to port 48T which is the 1/10G uplink port, no other ports free. Diagnostics didn't provide much info. In the 4 days of uptime port 8 never successfully detected a connected cable, even after trying neighboring, known-working end-points' cables with it. It seems to be out-of-commission.
June 15: rc37 (failed to reboot), rc38 (failed to reboot), rc79

As of May 14, 2012 the ConnectX2 cards have been replaced by ConnectX3 cards, and the cluster is up to its full strength.

February 15: rc79
February 6: rc80
February 2: rc79 (2 am), rc79 (11 pm)
January 30: rc06, rc24, rc65
January 26, rc19, rc20
January 12: rc33
January 6: rc33
January 5: rc32
January 3: rc24, rc32
December 23: rc32, rc32 (10 minutes later...)
December 21: rc21, rc32
December 16: rc23
December 15: rc27
December 8: rc03, rc04, rc33

As of Dec 7, 2011 the power has been restored to all nodes

December 7: rc10, rc20, rc32
December 5: rc16, rc18, rc20, rc36

As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)

December 1: rc04, rc18
November 30: rc04 (After disk replacement)

As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20

November 30: rc05, rc07
November 28: rc01, rc07, rc16, rc20
November 27: rc05, rc11, rc12, rc16, rc19, rc20, rc21
November 24: rc04, rc16
November 21: rc07, rc08, rc12, rc18, rc19, rc20, rc36
November 19: rc16

As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot. After a couple of days they were all back at normal levels).

November 18: rc10, rc20, rc33 (rc20 had to be rebooted twice again to restart it), rc10 (again)
November 16: rc19
November 15:
November 14: rc08, rc10, rc11, rc16, rc19, rc33 (then rc19 crashed again)

As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.

November 11: rc07, rc10, rc16
November 10: rc10, rc36
November 9: rc03, rc04, rc07, rc10, rc11, rc33
November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)

As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.

November 7: rc02, rc03, rc08, rc16, rc18, rc20
November 4: rc10, rc16, rc18
November 3: rc18
November 2: rc11, rc12, rc20
October 20: rc04, rc20, rc35
October 19: rc04, rc20
October 18: rc01, rc05, rc11, rc16, rc24
October 17: rc13, rc16, rc20
October 13: rc16
October 12: rc21, rc38
October 11: rc03, rc08, rc10, rc16, rc19
October 10: rc02, rc05, rc06, rc08, rc10, rc12, rc16, rc33, rc36
- rc02 and rc17 were up, but claimed a read only / file system
October 7: rc06, rc10, rc11, rc13, rc16, rc19
October 6: rc04
October 5: rc01
October 4: rc08
October 3: rc05, rc10, rc12, rc14, rc18, rc33
September 30: rc07, rc10, rc11, rc13, rc19, rc20, rc36
- JO restarted all of them, and all came up except rc19 & rc20.
  - rc19 & 20 back up (PSU decided to work again today?!)
September 28: rc04, rc19, rc20, rc21, rc25, rc26
- rc04 was powered off
- rc19/20 appears to have a bad power supply
- rc21 was at Linux login prompt with cursor blinking, but didn't respond to keyboard
- rc25/26 was not plugged in (oops)

Notes

Schedule of cluster changes

Aug 28: John's email "Can't ssh rc08"

Aug 24: unplugged 1/3rd of memory

Aug 17: SSDs arrived and installed

Aug 4: new servers installed and running recoveries with magnetic disks

Nov 30th:

Compared BIOS settings of various commonly failing machines (rc10, 16, 20) to some of the rc41+ machines to little effect. Nothing too interesting, though there do seem to be a number of cmos checksum errors logged for the ones that fail. This isn't consistent across failing nodes, nor have I looked widely enough to see if they don't occur in good machines. Otherwise settings are pretty close across machines. A few are set to IDE, rather than AHCI, but that doesn't appear to explain anything (some ide nodes failed a lot, but then so have many AHCI ones).

Nov 11th:

Histrogram of failures as of Nov 11th:

rc01: **
rc02: **
rc03: ****
rc04: *****
rc05: ***
rc06: **
rc07: ***
rc08: ****
rc09: 
rc10: *********
rc11: *****
rc12: ***
rc13: ***
rc14: *
rc15: 
rc16: *********
rc17: 
rc18: ****
rc19: ****
rc20: ********
rc21: **
rc22: 
rc23: 
rc24: *
rc25: *
rc26: *
rc27: 
rc28: 
rc29: 
rc30: 
rc31: 
rc32: 
rc33: ***
rc34: 
rc35: *
rc36: ***
rc37: 
rc38: *
rc39: 
rc40:

There were 53 failures on even nodes, 31 on odd nodes.

"System Temp" as reported by ipmi (rc01 first, rc40 last):

System Temp | 29 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 20 degrees C | ok

Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.