Tasks for Design & Implementation of Fast Recovery
The following is a concise list of topics we will likely need to address this fall for the fast recovery paper.
NB: (?) denotes tasks that are eventually necessary, but probably not critical for demonstrating our prototype.
- New Log
- checksumming
- finding/identifying log head
- simple cleaning
- log per machine vs. log per partition
- Backup Server
- ops: open segment, write, commit segment
- handle log discovery broadcasts
- handle backup failures (?)
- ensure durability of in-memory (non-committed) segments (?)
- maintain utilisation stats and respond to load queries (?)
- Cluster Coordinator
- server monitoring and failure detection
- new master selection
- service location: resolve (table, key) -> master
- coordinator failures (?)
- Multi-server Clients
- maintain "Config" table of service locators resolved from the cluster coordinator
- Client Applications
- Need "real" applications to throw at the system
- Need benchmark and stress testing applications for performance measurement, load simulation, etc.
- Data Model
- Is key/value storage sufficiently interesting? If not, should we tackle indexes? What are the other alternatives?
- If indexes are needed, should we do them client-side or server-side?
- Is key/value storage sufficiently interesting? If not, should we tackle indexes? What are the other alternatives?
- Master Recovery
- Replay Log and Rebuild hash table on new masters
- Eventually reconstitute the original master (?)
- Simulation
- Can we demonstrate with 10 to 100 machines and extrapolate to 10,000+?
- Machines
- Need to acquire, set up experimental cluster
- How many machines are needed?
- Where can we put them?
- Can't afford long delays in acquisition.
- Load Balancing / Reconfiguration
- Moving partitions between machines, splitting tables, etc. (?)
- Partitioning
- Figure out how we want to do partitioning within each master.