Recovery Task List

Tasks for Design & Implementation of Fast Recovery

The following is a concise list of topics we will likely need to address this fall for the fast recovery paper.

NB: (?) denotes tasks that are eventually necessary, but probably not critical for demonstrating our prototype.

New Log
1. checksumming
2. finding/identifying log head (know what to do, unimplemented)
3. simple cleaning
4. log per machine vs. log per partition
Backup Server
1. ops: open segment, write, commit segment
2. handle log discovery broadcasts
3. handle backup failures, there are 2 different cases: (?)
  1. recovery-side: backup failures on recovery (i.e. find another backup to pull from)
  2. master-side: backup failures affecting replication factor for various masters (write out segment(s) again to get R back up to desired value.
4. ensure durability of in-memory (non-committed) segments (?)
5. maintain utilisation stats and respond to load queries (?)
6. add compression for segments written to disk (lzjb? lzo? zippy? http://denisy.dyndns.org/lzo_vs_lzjb/)
Cluster Coordinator
1. server monitoring and failure detection
2. new master selection
3. service location: resolve (table, key) -> master
4. coordinator failures (?)
Multi-server Clients
1. maintain "Config" table of service locators resolved from the cluster coordinator
Client Applications
1. Need "real" applications to throw at the system
2. Need benchmark and stress testing applications for performance measurement, load simulation, etc.
Data Model
1. Is key/value storage sufficiently interesting? If not, should we tackle indexes? What are the other alternatives?
  1. If indexes are needed, should we do them client-side or server-side?
Master Recovery
1. Replay Log and Rebuild hash table on new masters
2. Eventually reconstitute the original master (?)
Simulation
1. Can we demonstrate with 10 to 100 machines and extrapolate to 10,000+?
Machines Purchase Order Submitted Dec 1
1. Need to acquire, set up experimental cluster
2. How many machines are needed?
3. Where can we put them?
4. Can't afford long delays in acquisition.
Load Balancing / Reconfiguration
1. Moving partitions between machines, splitting tables, etc. (?)
Will
1. Design and implement a will generation algorithm for the Master. Sync with coordinator.