Durability Scribe

Even if you could replicate, is data durable in DRAM alone? ECC
Can DRAM hold data for 2 years?

Replication factor at Google:
3 is good for disk for "targeted recovery"
Prioritize the ones for which you have lost 2 discs over ones in which you've only lost 1.
For RAM, he'd be worried, 2 would not be enough, not sure about 3.

How much of a retreat is battery-backed memory from commodity hardware?
Google says:
"Good and tragic"
Batteries are finicky, have to continuously check
We will lose data with 2-3% failures
10K machines where 5% have the same failure mode
Need very good operations to make sure these batteries live
Cost of hardware infrastructure is not a problem

What about a UPS every few racks? It's expensive but is that a direction to go in?
Google says:
The cost issues of a per-server vs per rack are similar.
Operational costs are similar.
The batteries we buy are no better than the ones we build.

Facebook: Software failures are easily coordinator.
Italian cat on the keyboard.

A snapshot of the hash table would be a consistent view of the log.

—

Google: Missing locality. This whole business of ignoring locality is not something you can afford.
Even for main memory cache misses.
Example from BigTable: order things by URL, use really stupid compressor that
normally gets you 3:1 compression gets you 40:1 compression.
Need to push hard on that if you're storing things in RAM.

Facebook: Compression in column store is really good, not clear you could do that app-side.
Google: I think you need to rework the main memory organization.
Facebook: You might have extra cores.
Google: Library can't do cross-object compression.
Google: Decompression (or was that compression?) schemes can do .5GB per second
You could also work on different compression schemes.
Could make decompression incremental work to jump into the middle of the object, but compress in big blocks.
You'd have to do something custom here.
John: We could push decompression to client side.
Google: Not if you compress across objects. Not if it comes with a megabyte dictionary.

What about deduplication?
It's a big issue in the storage world.
E.g., email system
Only 20M names in the world.

Much greater potential on the server side.
Cross object compression should buy you a factor of 5 or 10.
You're checksumming these anyway.

—

HP: Concerned about maintaining latency for small objects while doing recovery. Maybe need 2 networks?

Can improve latency by having bigger R and not waiting on all of them.

—

HP: Handling peak writes and servicing low latency reads will be hard on a single network.
Google: Is TOS enough?
HP: Not commodity today.
Google: Neither is 2 networks.
HP: You could have plenty of bandwidth but bursts of storage traffic that will cause variance in latency.
HP: InfiniBand, etc, use separate lanes for separate traffic.
Facebook: With 30% writes, is there any bandwidth left for anything else?

—

HP: Paper in yesterday's OS review, flash wearouts measured as a factor of 10 better than what people have been quoting.
Before things wear out, they get slower. It's unclear whether this makes failures predictable.
Infant mortality problem is worse in flash than DRAM. Sold to people with other expectations.

Google: Can flash do anything for us?

HP: Trend away from Flash because it's not going to keep scaling past a couple more generations.

Facebook: Flash or PC RAM as main storage might be better in practice than DRAM.

—

Berkeley: Go to first master that returns to avoid hitting variations in latency.
Facebook: If you want memcached, you know where to find it.

Could use coding to avoid disks entirely.