Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Failure Brainstorm

Failure Modes

  • Loss of machine
    • What does this mean? When we label a misbehaving machine as toast?
  • Loss of rock
    • Common failure scenario - must be able to handle gracefully
  • Network Partitions
    • Should not lose sanity when these occur.

Failure Symptoms

  • RPC Timeout
    • Client->Server or Server->Server request doesn't yield quick response
  • RPC Error
    • Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error
  • Assertion failure
    • Internal software consistency violated; likely abort(3)'s
  • Error During Recovery
    **
  • Full power loss
    • All lights go out.
  • Partial power loss/Network partition
    • Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.
  • Missing heartbeat
    • Controller or master nodes expected a regular liveness ping, but didn't get one.
  • Cluster coordinator dies/becomes unreachable
  • Client asks wrong server for object
    • Likely response: Not my data, not my problem.
  • Backup I/O Error
    • Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?
  • No labels