2-week Milestones

2-week Milestones

This page documents the goals and results for a series of milestones for the first implementation of RAMCloud. Each milestone last 2 weeks, ending on a Monday.

Milestone 17: ends October 4, 2010

  • Ryan goals:

    • Finish edits in response to code review.

    • File bugs on remaining important issues in FastTransport.

  • Ryan actual:

    •  

  • Steve goals:

    • Finish Infiniband Transport and Driver

    • Simple benchmarks of the above

    • Restart work on new Log and Segment code

  • Steve actual:

    •  

  • Diego goals:

    • Finish updates to Transport API.

    • Update Driver API.

    • Finish Ethernet packet driver.

  • Diego actual:

    •  

Milestone 16: ends September 21, 2010

  • Ryan goals:

    • Code review FastTransport, commit with review changes

    • Analyze performance of FastTransport.

  • Ryan actual:

    • Working through code review:

      • Done with everything except FastTransport*

      • Worked through easy comments in FastTransport.h

  • Steve goals:

    • Infiniband transport & driver

    • Continue working on performance discrepancy with Mellanox

  • Steve actual:

    • Driver mostly done (blocked on ServiceLocator), does copying right now to avoid memory registration issues

    • Transport 80% coded

    • Mellanox: our hardware is slower, but we're not sure why (they get latency up to 40% better than us)

  • Diego goals:

    • Finish ServiceLocator.

    • Finish Ethernet packet driver.

    • Resolve monitor issue.

  • Diego actual:

    • Class/code review done, but integration still in progress.

    • No packet driver work.

    • Monitor #5 works

  • John goals:

    • TCPTransport performance improvements.

    • Class prep.

  • John actual:

    • No TCPTransport work yet.

Milestone 15: ends September 7, 2010

  • Ryan goals:

    • Tuning/code review for FastTransport

  • Ryan actual:

    • Code review imminent

  • Steve goals:

    • Gone one week

    • Infiniband transport working

  • Steve actual:

    • Gone one week

    • Partial Infiniband transport (waiting for ServiceLocators)

    • Talking with Mellanox about performance

  • Diego goals:

    • Working monitor

    • Driver to send raw Ethernet packets through the kernel

  • Diego actual:

    • Monitor still broken

    • Ethernet packet driver exists but not in proper form

    • Implemented ServiceLocator class; undergoing code review, not integrating cleanly

  • John goals:

    • Really really check in Client/Server changes.

    • Start on smaller issues.

  • John actual:

    • Committed Client/Server changes

    • Fixed several smaller issues (cpplint, etc.)

Milestone 14: ends August 23, 2010

  • Ryan goals:

    • Finish documentation and tests for FastTransport (1 week)

    • Possible next steps:

      • Redo sessions

      • Add UDP layer

  • Ryan actual:

    • A few more unit tests to write for FastTransport

    • More work needed on docs

    • FastTransport currently ~1300 lines

  • Steve goals:

    • Learn about Infiniband software layers

    • Start on Infiniband Transport class

  • Steve actual:

    • Still learning Infiniband, writing simple stand-alone app

    • Setup code for Infiniband complex

    • Mellanox person coming tomorrow for training

  • Diego goals:

    • On vacation

  • Diego actual:

    • Vacation

  • John goals:

    • Finish Client/Server changes and check in (1 week)

    • Odds and ends

  • John actual:

    • Client/Server changes code-reviewed, working on followups

    • Switched to using exceptions instead of Status returns.

Milestone 13: ends August 9, 2010

  • Ryan:

    • New transport working without timers

    • Take basic performance measurements of new transport

    • Timers working with new transport

  • Ryan results:

    • C++ FastTransport implementation compiles and is showing signs of life, but has almost no tests or documentation yet.

    • It looks like the Session mechanism is going to need a redesign.

  • Steve goals:

    • Intel driver completely working in user space

    • Basic user-space 10G performance measurements (without fast transport)

    • Ideal: measure performance with user-space driver and new transport

  • Steve results:

    • Mellanox NICs and switch arrived

    • Got Infiniband and 10GE pings working using "out-of-the-box" software.

    • Starting to learn about Infiniband verbs

    • Intel work is on hold for now

  • Diego:

    • Finish translating ClientSession into C++.

    • On vacation

  • John goals:

    • Client converted to C++, upgraded for full use of Buffers, integrated with RAMCloud

  • John results:

    • Client converted to C++ and tested, along with Server.

    • Several new header files, such as Rpc.h and Table.h

    • All existing C++ code (including ClientMain and Bench) now uses the new Client.

    • Still need to convert the Python bindings and Python code.

Milestone 12: ends July 26, 2010

  • Ryan:

    • Implement a few specific benchmarks using Metrics counters (actual: no change)

    • Commit Metrics class (actual: done)

    • Aggregator classes (actual: started)

    • Additional: started translating Transport from Python to C++; coded InboundMessage; working on OutboundMessage.

  • Steve:

    • Rerun E1000 measurements on new machines (actual: done, but slightly slower than HiStar machines: 19 us for 64 bytes vs. 18.4 us)

    • 10G driver running in user space (actual: kernel pings work; using "HugeTLB" to map big pages, pin; wrote "HugeTLBPhys" driver to return physical addresses; all that's left is to map device registers into user space)

  • Diego goals:

    • Finish Python version of fast transport

    • Possibly start on logging mechanism

  • Diego results:

    • Python version done

    • Committed basic logging (prints to stderr)

    • Helping Ryan with Python->C++ translation of Transport

Milestone 11: ends July 12, 2010

  • Ryan:

    • 2 or 3 basic benchmarks (est: 3 days; actual: some simple benchmarks running, Metrics class ready for code review)

    • Integrate into Makefile (est: 1 day; actual: done, server started automatically for tests now)

  • Steve:

    • Setup machines, send RAMCloud packets to/from each machine (actual: done)

    • Experiment with 10G user-space driver (actual: works using standard in-kernel approach, not yet working from user space)

    • Rerun E1000 tests on new machines (actual: not done)

    • Log stuff (actual: not done)

  • Diego:

    • Finish buffer integration (actual: done, merged into master)

    • More transport work (actual: still working on Python prototype; feature complete)

    • Vacation July 28-Aug. 13

Milestone 10: ends June 28, 2010

  • Ryan:

    • Performance infrastructure (actual: 3 days, basic measurements working)

  • Steve:

    • Setup machines? (actual: machines just arrived)

    • Revisions to log code (actual: log code complete rewrite (mostly coded, no tests/docs), just starting on cleaner)

  • Diego:

    • Finish buffer integration (partial progress, not done: .25 day)

    • Decompose fast transport into tasks (actual: task changed, see below)

    • Start on fast transport tasks (actual: task changed, see below)

    • Revised task: full working Python prototype on UDP (actual: mostly there: 6 days)

    • Additional task: Python mock script (reduce duplication of prototypes, other improvements: .3 day)

Milestone 9: ends June 14, 2010

  • Ryan:

    • Benchmark planning (actual: worked mostly on this; some initial end-to-end measurements working)

    • Try out simulator for performance analysis (actual: tried it, but not sure how interesting it is yet)

  • Steve:

    • 6 machines ordered (est: 1 day; actual: in purchasing)

    • Set machines up (est: 1 day; actual: N/A)

    • Cleanup log code (actual: just getting started)

    • Understand existing drivers (w. Aravind) (actual: done)

  • Aravind:

    • Transfer knowledge

  • Diego:

    • Replicate Aravind 11 usec numbers (est: 1 day) (actual: done)

    • Take over fast transport from Aravind (actual: done)

    • Integrate Buffer into code base (actual: partial progress, 1 day)

Milestone 8: ends May 31, 2010

  • Ryan:

    • Working on other projects/quals

  • Steve:

    • Working on other projects/quals

  • Aravind:

    • Finish testing/docs for RPC driver (est: 2 days; actual: 2 days, done)

    • Write RPC transport class (est: 5 days; actual: 1 day, not done)

    • GSRC talk (est: 0; actual: 2 days, done)

    • POMI poster (est: 0; actual 0.5 day)

  • Diego:

    • Buffer chunk deallocation (est: 2 days; actual: 2.25 days, done)

    • Finish TCPTransport cleanup (est: 1 day; actual: .625 days, done)

    • Try out Google Mock framework (est: 1 day; actual: .875 days, done/discarded)

Milestone 7: ends May 17, 2010

  • Ryan:

    • Working on other projects/quals

  • Steve:

    • Working on other projects/quals

  • Aravind:

    • Fast RPC driver class (NIC, buffers) (est: 2 days; actual: 3 days, not done (no tests/doc))

    • Fast RPC transport class (mux/de-mux) (est: 5 days; actual: 0.5 day, not done)

    • Protocol design for retransmit/large packets (est: 1 day; actual: done, 1 day)

    • Web site for code reviews (est: 0.5 day; actual: done, 0.5 day)

  • Diego:

    • ZooKeeper measurements: latency/bandwidth of get/set (est: 1 day; actual: done, 1 day)

    • Buffer chunk deallocation (est: 1 day; actual: 1 day, coding not started)

    • Update Hash Table after code review (est: 1 day; actual: done, 1 day)

    • Try out Google Mock framework (est: 1 day; actual: not started)

    • TCPTransport code review cleanup (est: 1 day; actual: 0.5 day, not done)

Milestone 6: ends May 3, 2010

  • Ryan:

    • Working on other projects

  • Steve:

    • Working on other projects

  • Aravind:

    • Documentation and tests for RPC code (est: 2 days, actual: done, 1 day)

    • Integrate properly Buffer into the code - right now Buffer is only used just before sending an RPC. (est: 2 days, actual: done, 0.5 day)

    • Computer forum poster (est: 2 days, actual: N/A)

  • Diego:

    • Poster (est: 3 days, actual: done, 2 days)

    • Unit tests and docs for TCP transport (est: 2 days, actual: done, 2 days)

    • ZooKeeper measurements: latency/bandwidth of get/set (est: 1 day, actual: not started)

Milestone 5: ends April 19, 2010

  • Ryan:

    • Mostly working on other projects

    • Flesh out design of index recovery with partitions (est: 1 day, actual: no time to work)

  • Steve:

    • Working on other projects (actual: done!)

  • Aravind:

    • Full task list for TCP-based RPC (est: 1 day, actual: done, 1 day)

    • Other RPC tasks TBD

      • Port existing code to Buffer and RPC (est: was TBD, actual: done except for comments and tests, 2 days)

      • Hook up RPC system to use Diego's TCP Transport (est: was TBD; actual: done except for comments and tests, 1 days)

    • Mail Jeff Dean about ProtocolBuffer performance disconnect (est: 1 day, actual: done, 1 day)

  • Diego:

    • TCP transport for RPC (est: 3 days) (actual: not done, 2.5 days)

    • Get ZooKeeper installed and running (est: 1 day) (actual: done, 0.5 day)

    • ZooKeeper measurements: latency/bandwidth of get/set (est: 1 day) (actual: not started)

Milestone 4: ends March 29, 2010

  • Ryan:

    • Slides for review

  • Steve:

    • Slides for review

  • Aravind:

    • Slides for review

    • RPC refactoring (no estimate)

  • Diego:

    • Slides for review

Milestone 3: ends March 15, 2010

  • Aravind:

    • Implement RPC API on TCP (est: 3 days) (actual: done, 2 days)

    • Finish Protocol Buffer analysis (est: 2 days) (actual: "done", 1.5 days)

    • Deallocate Buffer memory (est: 2 days) (actual: not done)

    • Extra stuff: reworked Buffer class after code review

  • Diego:

    • RAM-33: Least-usable file system: small file ops (est: 2 days) (actual: done, 0.5 day)

    • RAM-51: RAMCloud build attempt with TCP and post-mortem analysis (est: 1 day) (actual: done, except unit test binary too large to fit, 0.5 day)

    • RAM-50: File-by-file opt in/out of "extra checks", add to pre-commit hooks (est: 2 days) (actual: done, 1 day)

    • RAM-41: Hash table cleanup/code review (est: 2 days) (actual: done, 3 days)

  • Ryan:

    • Multi-host backup (est: 1 day) (actual: not done, 0.5 day)

    • Save and collect segement location info for restore (est: 2 days) (actual: not done)

    • RPC design (actual: done, 0.5 day)

  • Steve:

    • Log/Segment refactoring (est: 2 days) (actual: not done)

    • Other code review issues (est: 2 days) (actual: not done)

    • Threading design (actual: done, 1 day)

Milestone 2: ends March 1, 2010

  • Aravind:

    • Analyze protocol buffer performance (est: 0.5 day) (actual: not done, 0.5 day)

    • Implement new RPC API on TCP (est: 4 days) (actual: 70% done?, 3 days)

    • Code review of RCBUF (est: 0.5 day) (actual: done, 3 days)

    • Quick and dirty user-level implementation of scatter-gather (est: 1 day) (actual: done, but not with Buffer, 1 day)

  • Diego (modified 2010-02-18 with John's approval, and somewhat changed my mind 2010-02-19):

    • RAM-36: Finish mini-transactions (est: 1 day)s (actual: done, 0.5 day)

    • RAM-33: Least usable version of file system (est: 5 days) (actual: not done (only directories so far), 4 days)

    • RAM-26: Table enumeration (est: 2 days) or maybe instead RAM-38: C/C++ extension for Python bindings (est: 2 days) (actual: not started)

  • Ryan:

    • Propagate GetSegMetaData through server (est: 0.5 day) (actual: done, 0.5 day)

    • Boilerplate recovery routine (est: 0.5 day) (actual: in flux because of segment refactor, 1 day)

    • Creating shadow objects in log (est: 2 days) (actual: not done, depends on segment refactor)

  • Steve:

    • Log/tombstone tests (est: 1 day) (actual: done, 2 days)

    • Design for multi-threading (est: 2 days) (actual: not done, 1 day so far)

Milestone 1: ends February 15, 2010

  • Aravind:

    • RCBuf implementation (including docs and tests) (est: 3 days) (actual: done in 4 days, ~10 hours)

    • Use RCBufs in current RPC mechanism (est: 1 day) (actual: done in 2 half-days)

    • New RPC API implemented on TCP (est: 4 days) (actual: not done)

    • Extras: protocol buffer tests

  • Diego:

    • Mini-transactions returning Booleans (est: 3 of "7") (actual: not done (90%?) spent "5")

    • Mini-transactions returning results/error info (est: 4 of "7") (actual: not done (90%?), spent "2")

    • Extras: transaction discussions, functional tests

  • Ryan:

    • Finish GetSegMetaData (est: 2 days) (actual: done in 16 actual hours)

    • Support multiple segments in backup (est: 3 days) (actual: not done)

    • Support multiple backups/master (est: 3 days) (actual: not done)

    • Extras: cleanup (4 hours)