/
Failures
Failures
Failure Brainstorm
Failure Modes
- Loss of machine
- What does this mean? When we label a misbehaving machine as toast?
- Multiple machine types: client, master, backup (separate service colocated w/ master), coordinator
- Loss of rack
- Common failure scenario - must be able to handle gracefully
- Implies we need # of racks >= # of copies of replicated data for durability
- Network Partitions
- Should not lose sanity when these occur.
Failure Symptoms
- RPC Timeout
- Client->Server or Server->Server request doesn't yield quick response
- RPC Error
- Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error
- Assertion failure
- Internal software consistency violated; likely abort(3)'s
- Error During Recovery
- Many potentialities: backup sources die, reconstitution destinations die (e.g if distributed index reconstruction), reconstituting master dies, data missing/invalid/unsalvagable, etc.
- Full power loss
- All lights go out.
- Partial power loss/Network partition
- Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.
- Missing heartbeat
- Controller or master nodes expected a regular liveness ping, but didn't get one.
- Cluster coordinator dies/becomes unreachable
- Client asks wrong server for object
- Likely response: Not my data, not my problem.
- Backup I/O Error
- Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?