Page Comparison

...

Loss of machine
- What does this mean? When we label a misbehaving machine as toast?
- Multiple machine types: client, master, backup (separate service colocated w/ master), coordinator
Loss of rock rack
- Common failure scenario - must be able to handle gracefully
- Implies we need # of racks >= # of copies of replicated data for durability
Network Partitions
- Should not lose sanity when these occur.

...

RPC Timeout
- Client->Server or Server->Server request doesn't yield quick response
RPC Error
- Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error
Assertion failure
- Internal software consistency violated; likely abort(3)'s
Error During Recovery
- Full power loss
  - All lights go out.
- Partial power loss/Network partition
  - Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.
- Missing heartbeat
  - Controller or master nodes expected a regular liveness ping, but didn't get one.
- Cluster coordinator dies/becomes unreachable
- Client asks wrong server for object
  - Likely response: Not my data, not my problem.
- Backup I/O Error
  - Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?

Versions Compared