Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Make RPCs more robust.
    • Hosts should be able to recover from the other end of a session failing.
    • RPCs need to time out eventually (this should be reasonably aggressive).
  • Update the log cleaner.
  • Handle coordinator failures.
  • Handle backup failures.
  • Handle multiple master failures and other secondary failures.
  • Cold bootstart.
    • Backups need a superblock.
  • Threading
  • Overall reliability model: the system can handle simple failures with no data loss, and can survive anything, but more complex failures (such as total power failure) will cause data loss. At any point if we get confused about what to do (e.g. network partition), we can just shut the whole system down and do a cold start, with potential data loss.

Tasks deferred until later:

...