Final Comments Scribe

NetApp

Most interested in low lat RPC
Don't care about commodity as much
Don't understand failure modes
Error in log? How do we recover? Scrubbing?
Think about instrumentation & logging early
John: Right amount of instrumentation?
(laughter)
Mogul: Check out Network Flight Recorder
Keith Adams: dtrace
Find interesting set of apps
Global timestamps useful
Expose location of objects to client

Shel

Chicken or egg problem: apps/architecture
Important to know which apps, but need new apps
Will make wrong choices for some cases right for others
Will want some level of computation need data
What storage nodes do and don't do is an important question
Fan of stored procedures
Agnostic on whether this is going to work
More info on failures
Worried about correlated failures, losing the 2 or 3 copies
Apps are going to want to use the same data
Isolation issues
Steal what works already, tweak for our assumptions (put in our networking)
It's research,

Michael Armbrust

Would love fast RPC
Glad we may build SQL on top
Conditional puts are important
Doesn't understand why single place for object
Constantly recovering
With this many machines some will be slow
If we have devs serialize with speed expectation
Why ids instead of keys
Likes central coordinator - that's the way to do it

Mike Franklin

Apps - need some apps in mind
Not clear we optimize for the right things if we don't have apps in mind
Locality disappears - hard to believe
Not going to get around locality
Compression on in-memory stuff, surprised we didn't mention it
Other in mem approaches are doing it
Locality in two ways: also push out to avoid load
Interesting choice: single master

Tim Kraska

White paper: believes in 20-80 rule
Why pay for 20% of the data I don't use 80% of the price
Transactions: need more isolation, how would it work/primitives?
Need things for scanning, column stores, apply predicates, on server

Marcos

If we choose narrow app set system not interesting
choose too broad system not good at anything
Have tricky algorithms
provide proofs for them
did proofs on Sinfonia protocols
don't want to publish protocols that fail

NetApp

Not perfectly sure this is what needs to be done about disks getting slow
If plan is to show good of RAM as stable store
investigate infiniband etc, make networking perhipheral since not the point of RC
Thinks we've nailed it with disks -> tape
Flash good enough if latencies don't need to be in the 10s of us
Same problem could be solved by dramatically changing apps
RC solving it without changing apps which is a good think

NEC

Big ISP operation in Japan
Have DC operation businesses
Concerned about latency in their DCs
3 control planes
apps, network, data storage
interested in how we integrate these
when combining apps + rc server have same amount of computation
virtualize resources on app and rc servers
Reminds him of HPC systems
All ops done on DSM
HPC message passing systems provided us range RPCs
Data access patterns depend on apps
sometimes disk + cache suffice

NEC

What would be a killer app for low latency?
Something online search interesting
Maybe translation
Controlling a robot
Can touch or traverse large data
Impressed by memcached and it is simple
RC as kernel module with special nics
They use memcached for storing query results
formidable enemy to RC

Bob English

Preference toward components rather than systems
Wants low-latency network access as a component
Not all apps need consistency so you're overbuilding for some apps
Compression and tight memory important for in-mem systems
Lose same data to not compressing as not replicating

Jeff Mogul

Useful to have estimate for performance/watt
Want TCO
If DRAM wrong have to switch to NVRAM, would that have major changes?
Concerned that we're overdesigning; simplify and implement
Backoff to 50 us as thought experiment
A lot easier to get it
Still deliver a lot of research benefits
Not forced to herorics at OS
Hard to instrument to find out why its not working right
instrumentation may guarantee it doesn't work at scale

Keith Adams

Any amount of perf will get instantly absorbed by features
Laws of econ not suspended by this project
Doesn't find latency solving x or y argument convincing
But thinks it's refreshing we're trying to hit it
Also greedy for 11 us RPC times
Failures - all correlations go to 1
bad things beget bad things until major problems
Humans messing with network infra, it fails alot
Network partitions are real
Doesn't understand why that won't kill performance