RAMCloud 1.0

This page tracks the work remaining to reach version 1.0 of RAMCloud. This version is intended to be a "Least Usable System:" the smallest amount of functionality that could be sufficient to support actual applications.

Target Timeframe

July - October 2012

Issues

Progress

  • 11/8/12
    • Risks: may need to add these.
      • Gathering statistics for balancing recovery.
      • LogCabin cleaning.
      • Log Cabin: Reconsidering API.
    • In progress
      • Coordinator: Tablet map persistence is still to do.
      • Cold Start: Try larger tests?
      • Tablet map: Stop using protocol buffers on masters.
    • Done
      • Coordinator: Server list management. Recovery works.
      • Log Cabin: Implementing consensus module.
      • Log Cleaner: Done.
      • Cold Start: Works.
      • Leases: Done.
  • 8/30/12
    • Coordinator: Starting to test server list management; tablet map persistence is still to do.
    • Log Cabin: Reimplementing consensus module.
    • Client retry: Done.
    • Log Cleaner: A week out from merge.
    • Fault Tolerance: No progress; probably regressions due to churn in backup rewrite.
    • Cold Start: Needs to be reimplemented due to backup storage changes.
    • Leases: Officially a 1.0 feature, no other progress, though.
    • Synchronous write mode: Done and merged.
    • Tablet map: Awaiting code review fixes.
  • 7/12/12

    • Coordinator: Implementation in progress, making basic state persistent
    • Log Cabin: Implementing consensus module; interface and durability already working; coordinator work is not blocked on it
    • Client retry: Should be done in about a week; converting existing rpcs to the new architecture
    • Table Enumeration: Functional, needs real-world testing
    • Log Cleaner: Redesign done; integrating with log refactoring, should done in a week
    • Fault tolerance: Recovery can survive all sorts of failures (recovery master crashes, loss of backups); recovery of multiple hosts works; still smoking out bugs
    • Cold start: Awaiting client retry, but hack allows some basic testing; have found a fixed a few bugs, but haven't been able to successfully cold start yet
    • New potential requirement: leases?
  • 5/11/12
    • Fault-tolerant coordinator: new design in progress
    • Cold start attempted; fails on enlistment since CoordinatorServiceList isn't persisted
    • Enumerate: designed, coding
    • Fault tolerance: new python class for scripting more interesting failure scenarios for RAMCloud
    • Log cleaner: gathering metrics

Goals

  • Support a high-volume website
    • Requires durability & availability
  • Support experimental applications
    • May not require durability, only minimal availability
  • Expect users to require serious hand-holding and interaction with RAMCloud team to develop, deploy, and support their application

Features

  • Fault-tolerant coordinator (Ankita)
    • Log cabin (Diego)
  • Cold start (Ryan)
  • Client retry (John)
  • Enumerate (Elliott?)
  • Synchronous backup write mode
  • Leases?

Stability and Testing

  • Fault-tolerance (Ryan)

    • Master recovery
    • Backup recovery
    • Cold start
  • Log cleaner (Steve)
  • Overload (Steve)

Deployment

  • Documentation for development and deployment (as much as the group can collectively generate in 1 day)
  • Client interface cleanup (as much as the group can collectively do in 1 day)
  • Packaging (make install)
  • Archival/Extraction via enumerate (see above)

Notes

  • Planned supported transports
    • TCP: Easy deployment on vanilla hardware, low performance
    • InfRc: Requires Infiniband NICs/switches, high performance
  • Planned supported scale
    • 80 nodes
    • Test scale down so we can at least give a lower-bound on usable cluster size

Deferred

  • Tablet migration
  • Supporting additional transports/10 G Ethernet
  • Performance testing
  • Scale up testing
  • Monitoring/Management
  • Additional bindings