Failures
Failure Brainstorm
Failure Modes
Loss of machine
What does this mean? When we label a misbehaving machine as toast?
Multiple machine types: client, master, backup (separate service colocated w/ master), coordinator
Loss of rack
Common failure scenario - must be able to handle gracefully
Implies we need # of racks >= # of copies of replicated data for durability
Network Partitions
Should not lose sanity when these occur.
Failure Symptoms
RPC Timeout
Client->Server or Server->Server request doesn't yield quick response
RPC Error
Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error
Assertion failure
Internal software consistency violated; likely abort(3)'s
Error During Recovery
Many potentialities: backup sources die, reconstitution destinations die (e.g if distributed index reconstruction), reconstituting master dies, data missing/invalid/unsalvagable, etc.
Full power loss
All lights go out.
Partial power loss/Network partition
Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.
Missing heartbeat
Controller or master nodes expected a regular liveness ping, but didn't get one.
Cluster coordinator dies/becomes unreachable
Client asks wrong server for object
Likely response: Not my data, not my problem.
Backup I/O Error
Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?