Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Corrected links that should have been relative instead of absolute.

...

  • Loss of machine
    • What does this mean? When we label a misbehaving machine as toast?
    • Multiple machine types: client, master, backup (separate service colocated w/ master), coordinator
  • Loss of rock rack
    • Common failure scenario - must be able to handle gracefully
    • Implies we need # of racks >= # of copies of replicated data for durability
  • Network Partitions
    • Should not lose sanity when these occur.

...

  • RPC Timeout
    • Client->Server or Server->Server request doesn't yield quick response
  • RPC Error
    • Client or Server node detects corrupt RPC, corrupt payload or server explicitly indicates error
  • Assertion failure
    • Internal software consistency violated; likely abort(3)'s
  • Error During Recovery
      **
      • Many potentialities: backup sources die, reconstitution destinations die (e.g if distributed index reconstruction), reconstituting master dies, data missing/invalid/unsalvagable, etc.
    • Full power loss
      • All lights go out.
    • Partial power loss/Network partition
      • Part(s) of datacenter either goes down, or appears to, while other part(s) stay up.
    • Missing heartbeat
      • Controller or master nodes expected a regular liveness ping, but didn't get one.
    • Cluster coordinator dies/becomes unreachable
    • Client asks wrong server for object
      • Likely response: Not my data, not my problem.
    • Backup I/O Error
      • Backup disk/ssd/etc gives I/O error. Common enough to handle specially or just SIITH?