Failures

Failures

Failure Brainstorm

Failure Modes

  • Loss of machine

    • What does this mean? When we label a misbehaving machine as toast?

    • Multiple machine types: client, master, backup (separate service colocated w/ master), coordinator

  • Loss of rack

    • Common failure scenario - must be able to handle gracefully

    • Implies we need # of racks >= # of copies of replicated data for durability

  • Network Partitions

    • Should not lose sanity when these occur.

Failure Symptoms

  • RPC Timeout

    • Client->Server or Server->Server request doesn't yield quick response

  • RPC Error

    • Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error

  • Assertion failure

    • Internal software consistency violated; likely abort(3)'s

  • Error During Recovery

    • Many potentialities: backup sources die, reconstitution destinations die (e.g if distributed index reconstruction), reconstituting master dies, data missing/invalid/unsalvagable, etc.

  • Full power loss

    • All lights go out.

  • Partial power loss/Network partition

    • Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.

  • Missing heartbeat

    • Controller or master nodes expected a regular liveness ping, but didn't get one.

  • Cluster coordinator dies/becomes unreachable

  • Client asks wrong server for object

    • Likely response: Not my data, not my problem.

  • Backup I/O Error

    • Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?