Recovery Task List

Tasks for Design & Implementation of Fast Recovery

The following is a concise list of topics we will likely need to address this fall for the fast recovery paper.

NB: (?) denotes tasks that are eventually necessary, but probably not critical for demonstrating our prototype.

New Log
1. checksumming
2. finding/identifying log head
3. simple cleaning
4. log per machine vs. log per partition
Backup Server
1. ops: open segment, write, commit segment
2. handle log discovery broadcasts
3. handle backup failures (?)
4. ensure durability of in-memory (non-committed) segments (?)
5. maintain utilisation stats and respond to load queries (?)
Cluster Coordinator
1. server monitoring and failure detection
2. new master selection
3. service location: resolve (table, key) -> master
4. coordinator failures (?)
Multi-server Clients
1. maintain "Config" table of service locators resolved from the cluster coordinator
Client Applications
1. Need "real" applications to throw at the system
2. Need benchmark and stress testing applications for performance measurement, load simulation, etc.
Data Model
1. Is key/value storage sufficiently interesting? If not, should we tackle indexes? What are the other alternatives?
  1. If indexes are needed, should we do them client-side or server-side?
Master Recovery
1. Replay Log and Rebuild hash table on new masters
2. Eventually reconstitute the original master (?)
Simulation
1. Can we demonstrate with 10 to 100 machines and extrapolate to 10,000+?
Machines
1. Need to acquire, set up experimental cluster
2. How many machines are needed?
3. Where can we put them?
4. Can't afford long delays in acquisition.
Load Balancing / Reconfiguration
1. Moving partitions between machines, splitting tables, etc. (?)
Partitioning
1. Figure out how we want to do partitioning within each master.