RAMCloud 1.0
This page tracks the work remaining to reach version 1.0 of RAMCloud. This version is intended to be a "Least Usable System:" the smallest amount of functionality that could be sufficient to support actual applications.
Target Timeframe
July - October 2012
Issues
Progress
11/8/12
Risks: may need to add these.
Gathering statistics for balancing recovery.
LogCabin cleaning.
Log Cabin: Reconsidering API.
In progress
Coordinator: Tablet map persistence is still to do.
Cold Start: Try larger tests?
Tablet map: Stop using protocol buffers on masters.
Done
Coordinator: Server list management. Recovery works.
Log Cabin: Implementing consensus module.
Log Cleaner: Done.
Cold Start: Works.
Leases: Done.
8/30/12
Coordinator: Starting to test server list management; tablet map persistence is still to do.
Log Cabin: Reimplementing consensus module.
Client retry: Done.
Log Cleaner: A week out from merge.
Fault Tolerance: No progress; probably regressions due to churn in backup rewrite.
Cold Start: Needs to be reimplemented due to backup storage changes.
Leases: Officially a 1.0 feature, no other progress, though.
Synchronous write mode: Done and merged.
Tablet map: Awaiting code review fixes.
7/12/12
Coordinator: Implementation in progress, making basic state persistent
Log Cabin: Implementing consensus module; interface and durability already working; coordinator work is not blocked on it
Client retry: Should be done in about a week; converting existing rpcs to the new architecture
Table Enumeration: Functional, needs real-world testing
Log Cleaner: Redesign done; integrating with log refactoring, should done in a week
Fault tolerance: Recovery can survive all sorts of failures (recovery master crashes, loss of backups); recovery of multiple hosts works; still smoking out bugs
Cold start: Awaiting client retry, but hack allows some basic testing; have found a fixed a few bugs, but haven't been able to successfully cold start yet
New potential requirement: leases?
5/11/12
Fault-tolerant coordinator: new design in progress
Cold start attempted; fails on enlistment since CoordinatorServiceList isn't persisted
Enumerate: designed, coding
Fault tolerance: new python class for scripting more interesting failure scenarios for RAMCloud
Log cleaner: gathering metrics
Goals
Support a high-volume website
Requires durability & availability
Support experimental applications
May not require durability, only minimal availability
Expect users to require serious hand-holding and interaction with RAMCloud team to develop, deploy, and support their application
Features
Fault-tolerant coordinator (Ankita)
Log cabin (Diego)
Cold start (Ryan)
Client retry (John)
Enumerate (Elliott?)
Synchronous backup write mode
Leases?
Stability and Testing
Fault-tolerance (Ryan)
Master recovery
Backup recovery
Cold start
Log cleaner (Steve)
Overload (Steve)
Deployment
Documentation for development and deployment (as much as the group can collectively generate in 1 day)
Client interface cleanup (as much as the group can collectively do in 1 day)
Packaging (make install)
Archival/Extraction via enumerate (see above)
Notes
Planned supported transports
TCP: Easy deployment on vanilla hardware, low performance
InfRc: Requires Infiniband NICs/switches, high performance
Planned supported scale
80 nodes
Test scale down so we can at least give a lower-bound on usable cluster size
Deferred
Tablet migration
Supporting additional transports/10 G Ethernet
Performance testing
Scale up testing
Monitoring/Management
Additional bindings