Coordinator - Progress tracking page
Note: Meant for personal use, so may not be easily readable. Points in black have been done, and points in grey are todos.
- Initial design phase:
- Lot of discussions on how to make Coordinator fault tolerant, i.e., make distributed state changes atomic in face of failures.
- Some of these are summarized on this page.
- Refactoring discussions: most of it summarized on this page (till before "update").
- Linked RAMCloud to LogCabin.
- Implementation: All methods related to server management:
- Refactoring (required code moved to a new CoordinatorServerManager)
- Methods log state to LogCabin.
- Implemented recovery paths for recovering after failure by reading state (for corresponding operations) from LogCabin.
- Lot of re-design of the initial design happened during this implementation. Summarized on this page.
- Implementation: CoordinatorServiceRecovery module that replays the log from LogCabin and invokes appropriate recovery methods implemented above.
- Ran a cluster with real LogCabin (not just the client API used in unit testing), a single coordinator, and multiple masters. Coordinator recovery works!
- During the above, discovered bugs and some other weird things. Some of them detailed here.
- Re-refactoring: summarized here.
- Implementation: Moved CoordinatorServerManager code into CoordinatorServerList.
- Re-visit design: Multiple coordinator nodes: one leader, multiple followers. One of the followers take over after leader failure. Discussion here.
- Implementation: of above. (partial implementation done)
- Implementation: of above. (partial implementation done)
- Implementation: All methods related to tablet management.
- Maintain server list update version numbers across coordinator crashes.
- Extensive system tests and benchmarking