Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 84 Next »

Cluster Custodian

If you notice a down machine that doesn't seem to respond to rcreboot ping the IRC user listed for the current week listed below.

  • 7/22/12: stutsman
  • 7/29/12: ankitak
  • 8/5/12: ongardie
  • 8/12/12: daeschli
  • 8/19/12: mendel
  • 8/26/12: ouster
  • 9/2/12: syang0
  • 9/9/12: satoshi

The current custodian is responsible for restarting, debugging, and reimaging machines and generally keeping the cluster working.

Crashes

This page logs instances of dead machines; we are using it to track down the mysterious machine crashes that occurred starting in August 2011.

  • June 15: rc37 (failed to reboot), rc38 (failed to reboot), rc79

As of May 14, 2012 the ConnectX2 cards have been replaced by ConnectX3 cards, and the cluster is up to its full strength.

  • February 15: rc79
  • February 6: rc80
  • February 2: rc79 (2 am), rc79 (11 pm)
  • January 30: rc06, rc24, rc65
  • January 26, rc19, rc20
  • January 12: rc33
  • January 6: rc33
  • January 5: rc32
  • January 3: rc24, rc32
  • December 23: rc32, rc32 (10 minutes later...)
  • December 21: rc21, rc32
  • December 16: rc23
  • December 15: rc27
  • December 8: rc03, rc04, rc33

As of Dec 7, 2011 the power has been restored to all nodes

  • December 7: rc10, rc20, rc32
  • December 5: rc16, rc18, rc20, rc36

As of Dec 1, 2011 the power has been shut off on odd nodes (rc01,rc03,...,rc19)

  • December 1: rc04, rc18
  • November 30: rc04 (After disk replacement)

As of Nov 30, 2011 the flash disks were replaced with original hard disks in rc01-rc20

  • November 30: rc05, rc07
  • November 28: rc01, rc07, rc16, rc20
  • November 27: rc05, rc11, rc12, rc16, rc19, rc20, rc21
  • November 24: rc04, rc16
  • November 21: rc07, rc08, rc12, rc18, rc19, rc20, rc36
  • November 19: rc16

As of Nov 18, 2011 (following rebooting the machines listed below) fans were reset to max speed (Ryan: I don't think this stuck long since the fans went to normal speed on reboot.  After a couple of days they were all back at normal levels).

  • November 18: rc10, rc20, rc33 (rc20 had to be rebooted twice again to restart it), rc10 (again)
  • November 16: rc19
  • November 15:
  • November 14: rc08, rc10, rc11, rc16, rc19, rc33 (then rc19 crashed again)

As of Nov 11, 2011 the infiniband cables have been removed from rc01-rc20. These were removed after the failures of rc07, rc10, and rc16 were reported.

  • November 11: rc07, rc10, rc16
  • November 10: rc10, rc36
  • November 9: rc03, rc04, rc07, rc10, rc11, rc33
  • November 8: rc03, rc20 (Ryan: I've noticed over a few days rc20 must be rebooted twice to get it to come back up.)

As of Nov 7, 2011 the infiniband stacks (OFED) have been removed from rc01-rc20. Let's see if the IB drivers are causing the problem.

  • November 7: rc02, rc03, rc08, rc16, rc18, rc20
  • November 4: rc10, rc16, rc18
  • November 3: rc18
  • November 2: rc11, rc12, rc20
  • October 20: rc04, rc20, rc35
  • October 19: rc04, rc20
  • October 18: rc01, rc05, rc11, rc16, rc24
  • October 17: rc13, rc16, rc20
  • October 13: rc16
  • October 12: rc21, rc38
  • October 11: rc03, rc08, rc10, rc16, rc19
  • October 10: rc02, rc05, rc06, rc08, rc10, rc12, rc16, rc33, rc36
    • rc02 and rc17 were up, but claimed a read only / file system
  • October 7: rc06, rc10, rc11, rc13, rc16, rc19
  • October 6: rc04
  • October 5: rc01
  • October 4: rc08
  • October 3: rc05, rc10, rc12, rc14, rc18, rc33
  • September 30: rc07, rc10, rc11, rc13, rc19, rc20, rc36
    • JO restarted all of them, and all came up except rc19 & rc20.
      • rc19 & 20 back up (PSU decided to work again today?!)
  • September 28: rc04, rc19, rc20, rc21, rc25, rc26
    • rc04 was powered off
    • rc19/20 appears to have a bad power supply
    • rc21 was at Linux login prompt with cursor blinking, but didn't respond to keyboard
    • rc25/26 was not plugged in (oops)
Notes

Schedule of cluster changes

Aug 28: John's email "Can't ssh rc08"

Aug 24: unplugged 1/3rd of memory

Aug 17: SSDs arrived and installed

Aug 4: new servers installed and running recoveries with magnetic disks


Nov 30th:

Compared BIOS settings of various commonly failing machines (rc10, 16, 20) to some of the rc41+ machines to little effect. Nothing too interesting, though there do seem to be a number of cmos checksum errors logged for the ones that fail. This isn't consistent across failing nodes, nor have I looked widely enough to see if they don't occur in good machines. Otherwise settings are pretty close across machines. A few are set to IDE, rather than AHCI, but that doesn't appear to explain anything (some ide nodes failed a lot, but then so have many AHCI ones).

Nov 11th:

Histrogram of failures as of Nov 11th:

rc01: **
rc02: **
rc03: ****
rc04: *****
rc05: ***
rc06: **
rc07: ***
rc08: ****
rc09: 
rc10: *********
rc11: *****
rc12: ***
rc13: ***
rc14: *
rc15: 
rc16: *********
rc17: 
rc18: ****
rc19: ****
rc20: ********
rc21: **
rc22: 
rc23: 
rc24: *
rc25: *
rc26: *
rc27: 
rc28: 
rc29: 
rc30: 
rc31: 
rc32: 
rc33: ***
rc34: 
rc35: *
rc36: ***
rc37: 
rc38: *
rc39: 
rc40:

There were 53 failures on even nodes, 31 on odd nodes.

 

"System Temp" as reported by ipmi (rc01 first, rc40 last):

System Temp | 29 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 28 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 27 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 26 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 25 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 24 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 22 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 23 degrees C | ok
System Temp | 20 degrees C | ok
System Temp | 21 degrees C | ok
System Temp | 20 degrees C | ok

 

Nov 7th: Trying to see if OFED stack is causing the problem by removing it on rc01-rc20. Why the lower 20 are overrepresented compared to the upper 20 I don't know.

  • No labels