RAMCloud 1.0

RAMCloud 1.0

This page tracks the work remaining to reach version 1.0 of RAMCloud. This version is intended to be a "Least Usable System:" the smallest amount of functionality that could be sufficient to support actual applications.

Target Timeframe

July - October 2012

Issues

Progress

  • 11/8/12

    • Risks: may need to add these.

      • Gathering statistics for balancing recovery.

      • LogCabin cleaning.

      • Log Cabin: Reconsidering API.

    • In progress

      • Coordinator: Tablet map persistence is still to do.

      • Cold Start: Try larger tests?

      • Tablet map: Stop using protocol buffers on masters.

    • Done

      • Coordinator: Server list management. Recovery works.

      • Log Cabin: Implementing consensus module.

      • Log Cleaner: Done.

      • Cold Start: Works.

      • Leases: Done.

  • 8/30/12

    • Coordinator: Starting to test server list management; tablet map persistence is still to do.

    • Log Cabin: Reimplementing consensus module.

    • Client retry: Done.

    • Log Cleaner: A week out from merge.

    • Fault Tolerance: No progress; probably regressions due to churn in backup rewrite.

    • Cold Start: Needs to be reimplemented due to backup storage changes.

    • Leases: Officially a 1.0 feature, no other progress, though.

    • Synchronous write mode: Done and merged.

    • Tablet map: Awaiting code review fixes.

  • 7/12/12

    • Coordinator: Implementation in progress, making basic state persistent

    • Log Cabin: Implementing consensus module; interface and durability already working; coordinator work is not blocked on it

    • Client retry: Should be done in about a week; converting existing rpcs to the new architecture

    • Table Enumeration: Functional, needs real-world testing

    • Log Cleaner: Redesign done; integrating with log refactoring, should done in a week

    • Fault tolerance: Recovery can survive all sorts of failures (recovery master crashes, loss of backups); recovery of multiple hosts works; still smoking out bugs

    • Cold start: Awaiting client retry, but hack allows some basic testing; have found a fixed a few bugs, but haven't been able to successfully cold start yet

    • New potential requirement: leases?

  • 5/11/12

    • Fault-tolerant coordinator: new design in progress

    • Cold start attempted; fails on enlistment since CoordinatorServiceList isn't persisted

    • Enumerate: designed, coding

    • Fault tolerance: new python class for scripting more interesting failure scenarios for RAMCloud

    • Log cleaner: gathering metrics

Goals

  • Support a high-volume website

    • Requires durability & availability

  • Support experimental applications

    • May not require durability, only minimal availability

  • Expect users to require serious hand-holding and interaction with RAMCloud team to develop, deploy, and support their application

Features

  • Fault-tolerant coordinator (Ankita)

    • Log cabin (Diego)

  • Cold start (Ryan)

  • Client retry (John)

  • Enumerate (Elliott?)

  • Synchronous backup write mode

  • Leases?

Stability and Testing

  • Fault-tolerance (Ryan)

    • Master recovery

    • Backup recovery

    • Cold start

  • Log cleaner (Steve)

  • Overload (Steve)

Deployment

  • Documentation for development and deployment (as much as the group can collectively generate in 1 day)

  • Client interface cleanup (as much as the group can collectively do in 1 day)

  • Packaging (make install)

  • Archival/Extraction via enumerate (see above)

Notes

  • Planned supported transports

    • TCP: Easy deployment on vanilla hardware, low performance

    • InfRc: Requires Infiniband NICs/switches, high performance

  • Planned supported scale

    • 80 nodes

    • Test scale down so we can at least give a lower-bound on usable cluster size

Deferred

  • Tablet migration

  • Supporting additional transports/10 G Ethernet

  • Performance testing

  • Scale up testing

  • Monitoring/Management

  • Additional bindings