Recovery Task List
Tasks for Design & Implementation of Fast Recovery
The following is a concise list of topics we will likely need to address this fall for the fast recovery paper.
NB: (?) denotes tasks that are eventually necessary, but probably not critical for demonstrating our prototype.
New Log
checksumming
finding/identifying log head (know what to do, unimplemented)
simple cleaning
log per machine vs. log per partition
Backup Server
ops: open segment, write, commit segment
handle log discovery broadcasts
handle backup failures, there are 2 different cases: (?)
recovery-side: backup failures on recovery (i.e. find another backup to pull from)
master-side: backup failures affecting replication factor for various masters (write out segment(s) again to get R back up to desired value.
ensure durability of in-memory (non-committed) segments (?)
maintain utilisation stats and respond to load queries (?)
add compression for segments written to disk (lzjb? lzo? zippy? http://denisy.dyndns.org/lzo_vs_lzjb/)
Cluster Coordinator
server monitoring and failure detection
new master selection
service location: resolve (table, key) -> master
coordinator failures (?)
Multi-server Clients
maintain "Config" table of service locators resolved from the cluster coordinator
Client Applications
Need "real" applications to throw at the system
Need benchmark and stress testing applications for performance measurement, load simulation, etc.
Data Model
Is key/value storage sufficiently interesting? If not, should we tackle indexes? What are the other alternatives?
If indexes are needed, should we do them client-side or server-side?
Master Recovery
Replay Log and Rebuild hash table on new masters
Eventually reconstitute the original master (?)
Simulation
Can we demonstrate with 10 to 100 machines and extrapolate to 10,000+?
Machines Purchase Order Submitted Dec 1
Need to acquire, set up experimental cluster
How many machines are needed?
Where can we put them?
Can't afford long delays in acquisition.
Load Balancing / Reconfiguration
Moving partitions between machines, splitting tables, etc. (?)
Will
Design and implement a will generation algorithm for the Master. Sync with coordinator.