SEDCL Retreat 2013 - Industrial Feedback Session - Outline Scribe

Curt Kolovson, VMware:
What does this mean for us?
What would it take to support low-latency ops in a virtualized environment.
DCFT: Interesting, need clarity over methodology, present in conjunction with thesis work/overall work.
Dune: Great work, interested.
Raft: Education about how to do user studies and the complexities of understanding Paxos.
pFabric: Intuitive explanation
Copysets: Will have impact, clever interesting.
Graph processing: We are thinking about this too, "big data" applications, consider what are the real needs of large memory for applicaitons? Attempts to convert algs for parallelism. May not need big memory to deal with large graphs.
Mutilate: Most applicable to VMware; how do you consolidate well? Take results and turn into a useable cookbook of concise set of recommendations. Very valuable, shame if limited to a handful of researchers.
Meta comments: Have this meeting more than once in a year. Invite more people, larger facility. Provide the slides, or condensed summaries.

Keith Klemba, SAP:
Two to followup: Dune and Raft.
Will watch the raft videos.
Have many of the same problems as RAMCloud.
Wonders about: running software inside in-memory db. NUMA, coherence across threads/txs. Impact on latency. Measured down to lanes in QPI in HANA.

Daphne Chen, Inventec:
Will share findings about virtualization internally.

Kyle Nesbit, Google:
Focus on issues when you want to run RAMCloud as a service: isolation, quota, load balancing, security. Applications, want to understand what new applications it will enable and how it will impact existing applications.
Try natural graphs and mutable graphs on RAMCloud; girlfriend example, want strong consistency.
Teaching the social side of engineering; takes tens of engineers; didn't get enough experience with developing as a group in grad school.

Abdul Kabbani, Google:
Really likes talks that dig into systems and what's going on. Need more talks on datacenter networking at next year.
Low-latency is a major issue at Google; above the fabric, application level. Problems are just as bad or worse above the fabric. May be microoptimizing fabrics while many of the problems are in the storage and app layers for latency.
More frequent shorter meetings, potentially, lower latency feedback loop between industry and students.

Hans Fugal, Facebook:
RAMCloud shows what can be done. What can we do in between RAMCloud and memcached? How can we fix TCP or replace TCP in a non-invasive way.
Big difference between the academic and industry papers: industry talks about operational complexities; academics talk about theoretical structure. How do we administer and run it, roll it out, how do failures happen? Continue to try to take real enviroments and make incremental impact.

Steve Muir, VMware:
Want to do more of these meetings if possible. Try to narrow gap between datasets industry is seeing and what we design with/test with.

Peter Newman, Cisco:
Worried about his job security. Been hearing we're going to fill the datacenter full of cheap switches, but it hasn't happened yet. Missing: load-balancing solution. How to you route? ECMP doesn't work on the spine. Still going to have elephants colliding. Clos wanted to route calls across networking; seems ironic we are moving back to 60 years ago. Why should TCP require ordering? Without that switches get much cheaper. What if we wanted to do a transport protocol from scratch. pFabric radically ahead of what Cisco is up to. Customers are still asking for buffers. DCTCP really solved a problem they had with a minimal tweak. Big buffers is the wrong solution. How long will it take for DCTCP to roll over in large? Starting to happen, big switches will go on for a few more years.

Kevin Deierling, Mellanox:
Interesting that people talk about low-latency and tail latency. RDMA, kernel-bypass 15 years ago no one understood. Would be nice to hear more about virtualization, virtulization of the NIC.
Parallel evolution: people are running into the same problems at the same time. Want different solutions with the same API.

Diego Crupnicoff, Mellanox:
Would like to see a focus on finding a killer app for RAMCloud. So close to 1.0, begs for having the real life challenges of deployment. Will encounter serious problems that are interesting as it scales. Would like to cooperate on issues we run into as we scale. Evolve beyond key-value store would be interesting. Interested in simulator for the network, would love to help.

Cisco:
Works on NIC and MMU. pFabrics very interesting. Starting to talk about power optimization; this is starting to dominate CPU design, will be moving into other aspects. Would like to see more work on that. Thinks the tide will turn, shift will be from latency to power.

Satoshi Yamakawa, NEC:
Interested in durability and availability in RAMCloud; important for users. Looking forward to RAMCloud 1.0.

Shankar Pasupathy, NetApp:
Things I liked about the retreat:
(1) It is clear that RAMCloud continues to make a lot of progress. We were happy to see the study of graph processing algorithms on RAMCloud
(3) The overview of RAMCloud by John had just the right amount of information to help us understand the rest of the retreat work
(4) Location, logistics, agenda were great. Appreciate the efficient way in which we are contacted and signed up for the retreat months in advance.
(4) The addition of two external speakers was a great idea. The Dremel talk was particularly interesting.
Logistics Suggestions:
(1) Some talks felt like they should go a bit longer or have more time for questions eg. Alizadeh's talk so perhaps reducing one talk per day provides more time buffer
(2) The poster session felt a bit rushed – largely because some posters had a lot of traffic and it was hard to talk to those students before the poster session was over. Perhaps a 10 minute preview of upcoming posters at the end of the day's session followed by the poster session itself would work.
Technical:
(1) Would have loved a talk on the Raft protocol and experiences teaching/implementing Raft and Paxos
(2) It would be great if you could share systems lessons learned in building RAMCloud -- anything from project scheduling, use of programming language, frameworks etc. Ryan Stutsman talk was interesting but fairly abstract so perhaps a deeper dive into the challenges faced would be interesting for us industry folks
(3) A lot of challenge of building a practical distributed system is cluster management, understanding performance bottlenecks rapidly, and most important of all cluster configuration – we'd love to see more research in these areas
(4) Have you thought about using RAMCloud as a write-back cache that is organized on a cluster of disk-based storage systems, and leverages memory on those systems. How would you fill this cache rapidly ? What would you do with this cache ? 


Deepak Kenchammana, NetApp:
Over all, I like that the group is attacking problems with a clean slate (e.g., Paxos vs. Raft) and bold enough to try new ways (e.g., user study of Raft). I will echo Shankar’s comment that the poster session was too short.
 
It is nice to see a spurt in the #students working with RAMCloud as against working on it. Exploring application development frameworks/tools that preserves the low latency while improving programmer productivity might be an interesting side track.