Assumptions Scribe

Networks will improve?

Network latency: plausible within 5 years?

About 6 people, mostly from Google, think there's a better than 50% chance that latency across a data center within 5-10us.

Google: You should assume 2us. "That's the most boring goal you have."
Infiniband exists today.

The latency goal should be from the web app perspective. Google and Facebook want distributions, not goals.

2-second recovery is fast enough?

7 or so people think 2 second hiccups are OK. Think end-to-end: stupid DNS server, etc.

Facebook wants a tryGet() - try to get and if it takes too long, abort. Might be useful when a 2s hiccup is not ok.

How many apps are there nowadays that don't tolerate 2s hiccups?

High velocity financial trading.

Facebook: Crucial piece of metadata on some server, every web machine waits for 2s, systemic fallout from that.
tryGet() with exponential backoff, application-level replication.

Faster you decide something has failed, more likely you'll have more hiccups.
It's the number of hiccups over a day that gets interesting here. SLAs.

How often does it happen? How global is the effect?

Per-server battery backups will become available?

Google: distinction (for recovery) between in-flight data and data that was written over half an hour ago. Much more important not to lose old data.

Question about cleaning or in general invariants on our persistent data structures, during a simultaneous failure.

Large scale power outage experience – the quantity of data lost – stuff that had made it to disk you often never lose.
Stuff that's in flight you lose a much higher fraction.

Not knowing what you lost

Suppose you embedded a tree in these objects and you lose the root

Simultaneous failure is not a hardware problem, it's a software problem because you're designing your software to scale to all hosts.

Cascading failure scenarios?

Compare what you're doing with what people are doing today. You may have a higher probability.
Losing something 5 days ago and not being able to roll forward might be a problem for some people.
Some people will be mad about losing even recent transactions if they were important.

Facebook: big red button on ramcloud itself? You get some warning when these things happen usually.

Google:
Lots of things that make you lose 1% of machines, e.g., losing power
Things that knock down 5% of your machines may be more rare but might happen once or twice a year.

Google: Do you store ramcloud state in ramcloud itself?
Losing pointers to say a partition may lose the entire partition.

Low latency will make a big difference?

Google: I think you're right. People that run their services today will say "Oh it's not going to help us"
"It will be a lot easier to develop applications that you could have developed with higher latency"

It might come down to the cost.

Low latency will enable stronger consistency at scale?

You have consistency problems with main memory, it's not like that problem is gone.
You're trying to allow people to increase R.

Berkeley: I think this is true if you're talking about isolation. If you're talking about more of the CAP theorem, I don't think this is right for that.

The people that want a single machine with more memory will be disappointed because of the higher latency.

John: Would people drive up R so far?

Aggregate bandwidth in terms of number of transactions raised.
To use the system effectively, you have to use it near the max capacity.
Then we will have more transactions.

Analytical models for 2 phase locking don't have the duration in time, only in number of resources.

Sinfonia: There's a cross-section: # object locked * amount of time should be fairly small

Locality is getting harder to find and exploit?

In any system when you have 100M users, thinks there will be low locality.

Google: I don't buy it. There's always going to be locality.
Cluster people by email domain in Facebook.

There's clearly locality on your wall.

Google: Worried about your API given cross-object locality.
Just knowing that if I give things sequential IDs they're likely to be stored on the same machine.

Google: Hard because they have a flat number space.

Graph algos have phases: local computation, then batch updates to other machine.

Not clear on optimizing the "local computation" phase, that's going to cause a lot of bandwidth.

This goes back to the full bisection assumption.

Berkeley: Workloads have hot spots. Sometimes you'll want to split these on different servers.
If you have to wait for the last guy, you're going to have bad latency for the worst-case guy.

Google wants multi-get.
Suppose I want to do a linear pass through all the data. I'd like to go at RAM speeds, not network.

Google: Could say: Let's assume someone gives us a distributed file system to store all the log files. How can we get the latency down?

John: Is there such a filesystem in existence?
Google: Conceptually, I don't see what the difference is between your writes and those of hadoopfs or gfs.
They scatter things across.
It could be that you don't really want to deal with their implementations.

Other things like backups kind of come for free.

Google: I kind of sympathize with the position because you don't want to build layer upon layer upon layer. Builds up latency.

Another assumption: DRAM is going to get cheaper. Patterson says that's not the case.

Reasons behind correlated failures likely to change with e.g., phase change memory.

Another assumption: object size. Checksums and version sizes don't let you have extremely small objects.