2-week Milestones
This page documents the goals and results for a series of milestones for the first implementation of RAMCloud. Each milestone last 2 weeks, ending on a Monday.
Milestone 17: ends October 4, 2010
Ryan goals:
Finish edits in response to code review.
File bugs on remaining important issues in FastTransport.
Ryan actual:
Steve goals:
Finish Infiniband Transport and Driver
Simple benchmarks of the above
Restart work on new Log and Segment code
Steve actual:
Diego goals:
Finish updates to Transport API.
Update Driver API.
Finish Ethernet packet driver.
Diego actual:
Milestone 16: ends September 21, 2010
Ryan goals:
Code review FastTransport, commit with review changes
Analyze performance of FastTransport.
Ryan actual:
Working through code review:
Done with everything except FastTransport*
Worked through easy comments in FastTransport.h
Steve goals:
Infiniband transport & driver
Continue working on performance discrepancy with Mellanox
Steve actual:
Driver mostly done (blocked on ServiceLocator), does copying right now to avoid memory registration issues
Transport 80% coded
Mellanox: our hardware is slower, but we're not sure why (they get latency up to 40% better than us)
Diego goals:
Finish ServiceLocator.
Finish Ethernet packet driver.
Resolve monitor issue.
Diego actual:
Class/code review done, but integration still in progress.
No packet driver work.
Monitor #5 works
John goals:
TCPTransport performance improvements.
Class prep.
John actual:
No TCPTransport work yet.
Milestone 15: ends September 7, 2010
Ryan goals:
Tuning/code review for FastTransport
Ryan actual:
Code review imminent
Steve goals:
Gone one week
Infiniband transport working
Steve actual:
Gone one week
Partial Infiniband transport (waiting for ServiceLocators)
Talking with Mellanox about performance
Diego goals:
Working monitor
Driver to send raw Ethernet packets through the kernel
Diego actual:
Monitor still broken
Ethernet packet driver exists but not in proper form
Implemented ServiceLocator class; undergoing code review, not integrating cleanly
John goals:
Really really check in Client/Server changes.
Start on smaller issues.
John actual:
Committed Client/Server changes
Fixed several smaller issues (cpplint, etc.)
Milestone 14: ends August 23, 2010
Ryan goals:
Finish documentation and tests for FastTransport (1 week)
Possible next steps:
Redo sessions
Add UDP layer
Ryan actual:
A few more unit tests to write for FastTransport
More work needed on docs
FastTransport currently ~1300 lines
Steve goals:
Learn about Infiniband software layers
Start on Infiniband Transport class
Steve actual:
Still learning Infiniband, writing simple stand-alone app
Setup code for Infiniband complex
Mellanox person coming tomorrow for training
Diego goals:
On vacation
Diego actual:
Vacation
John goals:
Finish Client/Server changes and check in (1 week)
Odds and ends
John actual:
Client/Server changes code-reviewed, working on followups
Switched to using exceptions instead of Status returns.
Milestone 13: ends August 9, 2010
Ryan:
New transport working without timers
Take basic performance measurements of new transport
Timers working with new transport
Ryan results:
C++ FastTransport implementation compiles and is showing signs of life, but has almost no tests or documentation yet.
It looks like the Session mechanism is going to need a redesign.
Steve goals:
Intel driver completely working in user space
Basic user-space 10G performance measurements (without fast transport)
Ideal: measure performance with user-space driver and new transport
Steve results:
Mellanox NICs and switch arrived
Got Infiniband and 10GE pings working using "out-of-the-box" software.
Starting to learn about Infiniband verbs
Intel work is on hold for now
Diego:
Finish translating ClientSession into C++.
On vacation
John goals:
Client converted to C++, upgraded for full use of Buffers, integrated with RAMCloud
John results:
Client converted to C++ and tested, along with Server.
Several new header files, such as Rpc.h and Table.h
All existing C++ code (including ClientMain and Bench) now uses the new Client.
Still need to convert the Python bindings and Python code.
Milestone 12: ends July 26, 2010
Ryan:
Implement a few specific benchmarks using Metrics counters (actual: no change)
Commit Metrics class (actual: done)
Aggregator classes (actual: started)
Additional: started translating Transport from Python to C++; coded InboundMessage; working on OutboundMessage.
Steve:
Rerun E1000 measurements on new machines (actual: done, but slightly slower than HiStar machines: 19 us for 64 bytes vs. 18.4 us)
10G driver running in user space (actual: kernel pings work; using "HugeTLB" to map big pages, pin; wrote "HugeTLBPhys" driver to return physical addresses; all that's left is to map device registers into user space)
Diego goals:
Finish Python version of fast transport
Possibly start on logging mechanism
Diego results:
Python version done
Committed basic logging (prints to stderr)
Helping Ryan with Python->C++ translation of Transport
Milestone 11: ends July 12, 2010
Ryan:
2 or 3 basic benchmarks (est: 3 days; actual: some simple benchmarks running, Metrics class ready for code review)
Integrate into Makefile (est: 1 day; actual: done, server started automatically for tests now)
Steve:
Setup machines, send RAMCloud packets to/from each machine (actual: done)
Experiment with 10G user-space driver (actual: works using standard in-kernel approach, not yet working from user space)
Rerun E1000 tests on new machines (actual: not done)
Log stuff (actual: not done)
Diego:
Finish buffer integration (actual: done, merged into master)
More transport work (actual: still working on Python prototype; feature complete)
Vacation July 28-Aug. 13
Milestone 10: ends June 28, 2010
Ryan:
Performance infrastructure (actual: 3 days, basic measurements working)
Steve:
Setup machines? (actual: machines just arrived)
Revisions to log code (actual: log code complete rewrite (mostly coded, no tests/docs), just starting on cleaner)
Diego:
Finish buffer integration (partial progress, not done: .25 day)
Decompose fast transport into tasks (actual: task changed, see below)
Start on fast transport tasks (actual: task changed, see below)
Revised task: full working Python prototype on UDP (actual: mostly there: 6 days)
Additional task: Python mock script (reduce duplication of prototypes, other improvements: .3 day)
Milestone 9: ends June 14, 2010
Ryan:
Benchmark planning (actual: worked mostly on this; some initial end-to-end measurements working)
Try out simulator for performance analysis (actual: tried it, but not sure how interesting it is yet)
Steve:
6 machines ordered (est: 1 day; actual: in purchasing)
Set machines up (est: 1 day; actual: N/A)
Cleanup log code (actual: just getting started)
Understand existing drivers (w. Aravind) (actual: done)
Aravind:
Transfer knowledge
Diego:
Replicate Aravind 11 usec numbers (est: 1 day) (actual: done)
Take over fast transport from Aravind (actual: done)
Integrate Buffer into code base (actual: partial progress, 1 day)
Milestone 8: ends May 31, 2010
Ryan:
Working on other projects/quals
Steve:
Working on other projects/quals
Aravind:
Finish testing/docs for RPC driver (est: 2 days; actual: 2 days, done)
Write RPC transport class (est: 5 days; actual: 1 day, not done)
GSRC talk (est: 0; actual: 2 days, done)
POMI poster (est: 0; actual 0.5 day)
Diego:
Buffer chunk deallocation (est: 2 days; actual: 2.25 days, done)
Finish TCPTransport cleanup (est: 1 day; actual: .625 days, done)
Try out Google Mock framework (est: 1 day; actual: .875 days, done/discarded)
Milestone 7: ends May 17, 2010
Ryan:
Working on other projects/quals
Steve:
Working on other projects/quals
Aravind:
Fast RPC driver class (NIC, buffers) (est: 2 days; actual: 3 days, not done (no tests/doc))
Fast RPC transport class (mux/de-mux) (est: 5 days; actual: 0.5 day, not done)
Protocol design for retransmit/large packets (est: 1 day; actual: done, 1 day)
Web site for code reviews (est: 0.5 day; actual: done, 0.5 day)
Diego:
ZooKeeper measurements: latency/bandwidth of get/set (est: 1 day; actual: done, 1 day)
Buffer chunk deallocation (est: 1 day; actual: 1 day, coding not started)
Update Hash Table after code review (est: 1 day; actual: done, 1 day)
Try out Google Mock framework (est: 1 day; actual: not started)
TCPTransport code review cleanup (est: 1 day; actual: 0.5 day, not done)
Milestone 6: ends May 3, 2010
Ryan:
Working on other projects
Steve:
Working on other projects
Aravind:
Documentation and tests for RPC code (est: 2 days, actual: done, 1 day)
Integrate properly Buffer into the code - right now Buffer is only used just before sending an RPC. (est: 2 days, actual: done, 0.5 day)
Computer forum poster (est: 2 days, actual: N/A)
Diego:
Poster (est: 3 days, actual: done, 2 days)
Unit tests and docs for TCP transport (est: 2 days, actual: done, 2 days)
ZooKeeper measurements: latency/bandwidth of get/set (est: 1 day, actual: not started)
Milestone 5: ends April 19, 2010
Ryan:
Mostly working on other projects
Flesh out design of index recovery with partitions (est: 1 day, actual: no time to work)
Steve:
Working on other projects (actual: done!)
Aravind:
Full task list for TCP-based RPC (est: 1 day, actual: done, 1 day)
Other RPC tasks TBD
Port existing code to Buffer and RPC (est: was TBD, actual: done except for comments and tests, 2 days)
Hook up RPC system to use Diego's TCP Transport (est: was TBD; actual: done except for comments and tests, 1 days)
Mail Jeff Dean about ProtocolBuffer performance disconnect (est: 1 day, actual: done, 1 day)
Diego:
TCP transport for RPC (est: 3 days) (actual: not done, 2.5 days)
Get ZooKeeper installed and running (est: 1 day) (actual: done, 0.5 day)
ZooKeeper measurements: latency/bandwidth of get/set (est: 1 day) (actual: not started)
Milestone 4: ends March 29, 2010
Ryan:
Slides for review
Steve:
Slides for review
Aravind:
Slides for review
RPC refactoring (no estimate)
Diego:
Slides for review
Milestone 3: ends March 15, 2010
Aravind:
Implement RPC API on TCP (est: 3 days) (actual: done, 2 days)
Finish Protocol Buffer analysis (est: 2 days) (actual: "done", 1.5 days)
Deallocate Buffer memory (est: 2 days) (actual: not done)
Extra stuff: reworked Buffer class after code review
Diego:
RAM-33: Least-usable file system: small file ops (est: 2 days) (actual: done, 0.5 day)
RAM-51: RAMCloud build attempt with TCP and post-mortem analysis (est: 1 day) (actual: done, except unit test binary too large to fit, 0.5 day)
RAM-50: File-by-file opt in/out of "extra checks", add to pre-commit hooks (est: 2 days) (actual: done, 1 day)
RAM-41: Hash table cleanup/code review (est: 2 days) (actual: done, 3 days)
Ryan:
Multi-host backup (est: 1 day) (actual: not done, 0.5 day)
Save and collect segement location info for restore (est: 2 days) (actual: not done)
RPC design (actual: done, 0.5 day)
Steve:
Log/Segment refactoring (est: 2 days) (actual: not done)
Other code review issues (est: 2 days) (actual: not done)
Threading design (actual: done, 1 day)
Milestone 2: ends March 1, 2010
Aravind:
Analyze protocol buffer performance (est: 0.5 day) (actual: not done, 0.5 day)
Implement new RPC API on TCP (est: 4 days) (actual: 70% done?, 3 days)
Code review of RCBUF (est: 0.5 day) (actual: done, 3 days)
Quick and dirty user-level implementation of scatter-gather (est: 1 day) (actual: done, but not with Buffer, 1 day)
Diego (modified 2010-02-18 with John's approval, and somewhat changed my mind 2010-02-19):
RAM-36: Finish mini-transactions (est: 1 day)s (actual: done, 0.5 day)
RAM-33: Least usable version of file system (est: 5 days) (actual: not done (only directories so far), 4 days)
RAM-26: Table enumeration (est: 2 days) or maybe instead RAM-38: C/C++ extension for Python bindings (est: 2 days) (actual: not started)
Ryan:
Propagate GetSegMetaData through server (est: 0.5 day) (actual: done, 0.5 day)
Boilerplate recovery routine (est: 0.5 day) (actual: in flux because of segment refactor, 1 day)
Creating shadow objects in log (est: 2 days) (actual: not done, depends on segment refactor)
Steve:
Log/tombstone tests (est: 1 day) (actual: done, 2 days)
Design for multi-threading (est: 2 days) (actual: not done, 1 day so far)
Milestone 1: ends February 15, 2010
Aravind:
RCBuf implementation (including docs and tests) (est: 3 days) (actual: done in 4 days, ~10 hours)
Use RCBufs in current RPC mechanism (est: 1 day) (actual: done in 2 half-days)
New RPC API implemented on TCP (est: 4 days) (actual: not done)
Extras: protocol buffer tests
Diego:
Mini-transactions returning Booleans (est: 3 of "7") (actual: not done (90%?) spent "5")
Mini-transactions returning results/error info (est: 4 of "7") (actual: not done (90%?), spent "2")
Extras: transaction discussions, functional tests
Ryan:
Finish GetSegMetaData (est: 2 days) (actual: done in 16 actual hours)
Support multiple segments in backup (est: 3 days) (actual: not done)
Support multiple backups/master (est: 3 days) (actual: not done)
Extras: cleanup (4 hours)