Failure Brainstorm

Failure Modes

Loss of machine
- What does this mean? When we label a misbehaving machine as toast?
- Multiple machine types: client, master, backup (separate service colocated w/ master), coordinator
Loss of rack
- Common failure scenario - must be able to handle gracefully
- Implies we need # of racks >= # of copies of replicated data for durability
Network Partitions
- Should not lose sanity when these occur.

RPC Timeout
- Client->Server or Server->Server request doesn't yield quick response
RPC Error
- Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error
Assertion failure
- Internal software consistency violated; likely abort(3)'s
Error During Recovery
- Many potentialities: backup sources die, reconstitution destinations die (e.g if distributed index reconstruction), reconstituting master dies, data missing/invalid/unsalvagable, etc.
Full power loss
- All lights go out.
Partial power loss/Network partition
- Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.
Missing heartbeat
- Controller or master nodes expected a regular liveness ping, but didn't get one.
Cluster coordinator dies/becomes unreachable
Client asks wrong server for object
- Likely response: Not my data, not my problem.
Backup I/O Error
- Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?