Recovery Task List
Tasks for Design & Implementation of Fast Recovery
The following is a concise list of topics we will likely need to address this fall for the fast recovery paper.
NB: (?) denotes tasks that are eventually necessary, but probably not critical for demonstrating our prototype.
- New Log
- checksumming
- finding/identifying log head (know what to do, unimplemented)
- simple cleaning
- log per machine vs. log per partition
- Backup Server
- ops: open segment, write, commit segment
- handle log discovery broadcasts
- handle backup failures, there are 2 different cases: (?)
- recovery-side: backup failures on recovery (i.e. find another backup to pull from)
- master-side: backup failures affecting replication factor for various masters (write out segment(s) again to get R back up to desired value.
- ensure durability of in-memory (non-committed) segments (?)
- maintain utilisation stats and respond to load queries (?)
- add compression for segments written to disk (lzjb? lzo? zippy? http://denisy.dyndns.org/lzo_vs_lzjb/)
- Cluster Coordinator
- server monitoring and failure detection
- new master selection
- service location: resolve (table, key) -> master
- coordinator failures (?)
- Multi-server Clients
- maintain "Config" table of service locators resolved from the cluster coordinator
- Client Applications
- Need "real" applications to throw at the system
- Need benchmark and stress testing applications for performance measurement, load simulation, etc.
- Data Model
- Is key/value storage sufficiently interesting? If not, should we tackle indexes? What are the other alternatives?
- If indexes are needed, should we do them client-side or server-side?
- Is key/value storage sufficiently interesting? If not, should we tackle indexes? What are the other alternatives?
- Master Recovery
- Replay Log and Rebuild hash table on new masters
- Eventually reconstitute the original master (?)
- Simulation
- Can we demonstrate with 10 to 100 machines and extrapolate to 10,000+?
- Machines Purchase Order Submitted Dec 1
- Need to acquire, set up experimental cluster
- How many machines are needed?
- Where can we put them?
- Can't afford long delays in acquisition.
- Load Balancing / Reconfiguration
- Moving partitions between machines, splitting tables, etc. (?)
- Will
- Design and implement a will generation algorithm for the Master. Sync with coordinator.