Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Note: Meant for personal use, so may not be easily readable. Points in black have been done, and points in grey are todos.
 
  1. Initial design phase: 
    • Lot of discussions on how to make Coordinator fault tolerant, i.e., make distributed state changes atomic in face of failures.
    • Some of these are summarized on this page.
  2. Refactoring discussions: most of it summarized on this page (till before "update").
  3. Linked RAMCloud to LogCabin.
  4. Implementation: All methods related to server management:
    1. Refactoring (required code moved to a new CoordinatorServerManager)
    2. Methods log state to LogCabin.
    3. Implemented recovery paths for recovering after failure by reading state (for corresponding operations) from LogCabin.
  5. Lot of re-design of the initial design happened during this implementation. Summarized on this page.
  6. Implementation: CoordinatorServiceRecovery module that replays the log from LogCabin and invokes appropriate recovery methods implemented above.
  7. Ran a cluster with real LogCabin (not just the client API used in unit testing), a single coordinator, and multiple masters. Coordinator recovery works!
  8. During the above, discovered bugs and some other weird things. Some of them detailed here.
  9. Re-refactoring: summarized in "update" on this page.
    1. Implementation: Moved CoordinatorServerManager code into CoordinatorServerList.
  10. Implementation: All methods related to tablet management.
  11. Re-visit design: Multiple coordinator nodes: one leader, multiple followers. One of the followers take over after leader failure. Issues:
    1. Consensus for new leader
    2. Followers following LogCabin log in real-time so they have less catching up to do, and the recovery code modified accordingly.
  12. Implementation: of above.
  • No labels