This page logs instances of dead machines in reverse chronological order; among other things, we are using it to track down the mysterious machine crashes that occurred starting in August 2011.
As of Oct 3rd, the switch for rc1-40 has been replaced, and the cluster is up to its full strength.
- September 12: rc20 still down; rebooted switch, port 9 still stayed down; tried several host/cable permutations; port 8 and 9
- September 4 – rebooted but fsck faied, so cleaned up the file system (zapped: You will see the warning of host ID change.) - Satoshi, Ryan
- rc20: crash (reboot but still no response)
- rc45,47, 49-52: requrire password (possible NFS problem)
- rc46: responds to pings but does notallow ssh
- rc48: does not respond to pings
- August 7: rc20 (no response to ipmi, see below)
- rc20: Ports 8 and 9 on the 1G switch seem to be bad, and the jack on rc20 itself is loose. There were no more ports on that switch, so I plugged in the extra 50-port switch and routed rc20 and rc30's IPMI through this 50-port switch. Then I was able to IPMI to rc20, which was up but its network(s) seemed down. Nothing too interesting. Zapped it, since it's been offline for a while. Unfortunately, the install script hangs because it sets the ethernet MTU to 9000. I guess that 50-port switch can't handle jumbo frames. We should be using a normal MTU, even if we get our switch problems worked out. I dropped the MTU back to 1500 and changed the NFS settings to use default-sized reads and writes. NFS still didn't work over UDP, so I changed it to use TCP. Things should work for now, but rc20 and rc30 are gimped – they have a 100Mbit connection to rcnfs now. We'll need to troubleshoot/RMA the HP switch on the first rack (RAM-445). //// Update August 9: I rebooted the HP switch, and port 9 works again. rc20 and rc30 are back on it. Works for now. Return the switch if this keeps happening (RAM-445). -Diego
- August 2: rc79
- July 31: rc79
- July 30: rc20 (no response to ipmi, see below), rc79
- rc20: Port 9 on 1G switch seems to be bad.
- July 23, rc20 (no response to ipmi), rc61, rc63, rc64, rc68 (these are likely due to operator error, but needed manual maintenance to fix)
- July 19: rc03 (reimage), rc04 (reimage), rc20 (no response to ipmi, see below), rc26 (reimage), rc37, rc38 (no response to ipmi, see below), rc39, rc78 (reimage), rc80
- rc20: 1G cable loose; doesn't seat well in NIC. If problem continues may need to bend clip or try other cables.
- rc38: Port 8 on 1G switch seems to be bad. Hopefully diagnostics on switch can tell us more. Connected to port 48T which is the 1/10G uplink port, no other ports free. Diagnostics didn't provide much info. In the 4 days of uptime port 8 never successfully detected a connected cable, even after trying neighboring, known-working end-points' cables with it. It seems to be out-of-commission.
- June 15: rc37 (failed to reboot), rc38 (failed to reboot), rc79
As of May 14, 2012 the ConnectX2 cards have been replaced by ConnectX3 cards, and the cluster is up to its full strength.
- February 15: rc79
- February 6: rc80
- February 2: rc79 (2 am), rc79 (11 pm)
- January 30: rc06, rc24, rc65
- January 26, rc19, rc20
- January 12: rc33
- January 6: rc33
- January 5: rc32
- January 3: rc24, rc32
- December 23: rc32, rc32 (10 minutes later...)
- December 21: rc21, rc32
- December 16: rc23
- December 15: rc27
- December 8: rc03, rc04, rc33
As of Dec 7, 2011 the power has been restored to all nodes
- December 7: rc10, rc20, rc32
- December 5: rc16, rc18, rc20, rc36
As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)
- December 1: rc04, rc18
- November 30: rc04 (After disk replacement)
As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20
- November 30: rc05, rc07
- November 28: rc01, rc07, rc16, rc20
- November 27: rc05, rc11, rc12, rc16, rc19, rc20, rc21
- November 24: rc04, rc16
- November 21: rc07, rc08, rc12, rc18, rc19, rc20, rc36
- November 19: rc16
As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot. After a couple of days they were all back at normal levels).
- November 18: rc10, rc20, rc33 (rc20 had to be rebooted twice again to restart it), rc10 (again)
- November 16: rc19
- November 15:
- November 14: rc08, rc10, rc11, rc16, rc19, rc33 (then rc19 crashed again)
As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.
- November 11: rc07, rc10, rc16
- November 10: rc10, rc36
- November 9: rc03, rc04, rc07, rc10, rc11, rc33
- November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)
As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.
- November 7: rc02, rc03, rc08, rc16, rc18, rc20
- November 4: rc10, rc16, rc18
- November 3: rc18
- November 2: rc11, rc12, rc20
- October 20: rc04, rc20, rc35
- October 19: rc04, rc20
- October 18: rc01, rc05, rc11, rc16, rc24
- October 17: rc13, rc16, rc20
- October 13: rc16
- October 12: rc21, rc38
- October 11: rc03, rc08, rc10, rc16, rc19
- October 10: rc02, rc05, rc06, rc08, rc10, rc12, rc16, rc33, rc36
- rc02 and rc17 were up, but claimed a read only / file system
- October 7: rc06, rc10, rc11, rc13, rc16, rc19
- October 6: rc04
- October 5: rc01
- October 4: rc08
- October 3: rc05, rc10, rc12, rc14, rc18, rc33
- September 30: rc07, rc10, rc11, rc13, rc19, rc20, rc36
- JO restarted all of them, and all came up except rc19 & rc20.
- rc19 & 20 back up (PSU decided to work again today?!)
- JO restarted all of them, and all came up except rc19 & rc20.
- September 28: rc04, rc19, rc20, rc21, rc25, rc26
- rc04 was powered off
- rc19/20 appears to have a bad power supply
- rc21 was at Linux login prompt with cursor blinking, but didn't respond to keyboard
- rc25/26 was not plugged in (oops)
Schedule of cluster changes
Aug 28: John's email "Can't ssh rc08"
Aug 24: unplugged 1/3rd of memory
Aug 17: SSDs arrived and installed
Aug 4: new servers installed and running recoveries with magnetic disks
Compared BIOS settings of various commonly failing machines (rc10, 16, 20) to some of the rc41+ machines to little effect. Nothing too interesting, though there do seem to be a number of cmos checksum errors logged for the ones that fail. This isn't consistent across failing nodes, nor have I looked widely enough to see if they don't occur in good machines. Otherwise settings are pretty close across machines. A few are set to IDE, rather than AHCI, but that doesn't appear to explain anything (some ide nodes failed a lot, but then so have many AHCI ones).
Histrogram of failures as of Nov 11th:
There were 53 failures on even nodes, 31 on odd nodes.
"System Temp" as reported by ipmi (rc01 first, rc40 last):
Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.