Failures

Failure Brainstorm

Failure Modes

  • Loss of machine
    • What does this mean? When we label a misbehaving machine as toast?
    • Multiple machine types: client, master, backup (separate service colocated w/ master), coordinator
  • Loss of rack
    • Common failure scenario - must be able to handle gracefully
    • Implies we need # of racks >= # of copies of replicated data for durability
  • Network Partitions
    • Should not lose sanity when these occur.

Failure Symptoms

  • RPC Timeout
    • Client->Server or Server->Server request doesn't yield quick response
  • RPC Error
    • Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error
  • Assertion failure
    • Internal software consistency violated; likely abort(3)'s
  • Error During Recovery
    • Many potentialities: backup sources die, reconstitution destinations die (e.g if distributed index reconstruction), reconstituting master dies, data missing/invalid/unsalvagable, etc.
  • Full power loss
    • All lights go out.
  • Partial power loss/Network partition
    • Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.
  • Missing heartbeat
    • Controller or master nodes expected a regular liveness ping, but didn't get one.
  • Cluster coordinator dies/becomes unreachable
  • Client asks wrong server for object
    • Likely response: Not my data, not my problem.
  • Backup I/O Error
    • Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?