Recovery Scribe
Recovery Scribe Notes:
Hashtable recovery worse for smaller objects? Its memory bandwidth thats limiting us instead of randomly accessing hash table; Can we sort things by hash key value and then stream it in. We don't need to wait for it all, just batch some.
Why not have the hash table asynchronously replicated, and the client talking to multiple masters.
Checkpoint restore? Rebuild vs store approach for hash table.
Log per partition or log per master: Bigtable - initially one log per table, switched to one log per server. They don't have a better solution than it now. One per server means more complexity during recovery. Suppose we have 100 partitions sharing the same log, to recover one partition, we have to skim through the log to find things for this partition. So instead of doing 100 passes, we sort the log first by partition and then we do recovery. So thats extra complexity. When we get a data loss event, its really difficult to find which data are lost, since everything is mixed into the same log.
If we can tell for a given request, what data we need for that, we can prioritize the recovery such that we recover this data first.
If the clients know about the backups, the clients could ask the backups who the new master is.
Can we hash the object and use a deterministic function to decide which backup to talk to?
How does the client know which server to talk to. First time, you're going to talk to the wrong master, so the master is going to tell the client where the object is actually is.
What if crash during phase 2 of partitioned recovery. Locality will worsen over time.
Partitions themselves don't fragment.
What sort of locality do we look for in a partition. What are we really buying by reshuffling the partitions after recovery. If we had locality across partitions, we should try to do that after recovery as well right? What operations span partitions in a way that they can take advantage of locality. Are we saying we get locality in chunks of 640 MB?
If we have variable length data and things change how do we keep objects from moving across partitions? What happens? That partition will have to get subdivided into two tablets, and the tables will move across to new partitions
Is it ok that some reads take half a second to complete. If the data is some really hot piece of data, you have that unavailable for .5s. Also when the wait completes at the same time, every application waiting for that read will get the reply at the same time, potentially resulting in an in-cast problem.
Hair trigger on the timeouts - we have a lot of recovery on the network at all times. This might increase the latency arbitrarily for normal operations.
If the heartbeats are really short between the master and coordinator, we could have a master declared failed without you knowing it.
What about network partition? The side without the coordinator cannot doing anything. We shouldn't elect a new coordinator.
Jeff Dean's idea: The coordinator detects which partitions are dead, and we can build up this as queue, and only have k recovering at any given time. If we get a lookup for one of these partitions, we can move that up to the head of the queue. This will solve the problem of hair triggers
How is the physical topology discovered by the master. Its part of the metadata that the coordinator maintains. The master will ask the coordinator for a list of available backup locations along with physical rack number or something.
1-2 seconds may still too long - may take hours to go back to equilibrium after a failure.
SuperCaps - capacitors with very high faraday rating. Can supply 20-30 seconds of power. They don't wear out like batteries. They do age, a bit. If we do shared log, we don't need 10 - 20 seconds, we only need half a second. Cost is not a problem, we just need to have a well engineered solution.
What is the time needed to recover from an entire data center? We have a lot of data flying about. Could be hours or days. If we had enough battery backups, we can push the RAM to local disk before power loss - may prevent unnecessary network traffic.
Assume out of R, we always write to your own disk - R is one. If we don't have enough bisection bandwidth, we can recover from local disk after power failure.
We need to back up - for people to trust us. Offsite?