Master Recovery

Goal: No client can (overtly) distinguish that a failure has occurred using the specified client interface while less than f uncorrected failures are ongoing - with high probability.
Goal: Client can expect latency less than l - with high probability; they can further expect latency bounded by l' while the system has less than f ongoing, uncorrected failures.

Implication: On master failure low-latency availability must be restored after a failure.
Implication: To maintain the ability to restore low-latency access with high probability we must restore durability.

Important Note: We've effectively "decided" that we'll need warm spare machines since we have eliminated shards. In other words, our minimum unit of recovery in the system is currently an entire machine.

Design considerations

Client Interface Invariants

Invariant: A read of a key following a write always sees the value associated with that write.

Invariant: A write to a key associates a version number with the value that is greater than that of any previous value written to that key location.

Invariant: An application always sees bounded latency.

Invariant: Writes are "k-durable" on return of success.

Pluggable storage backends

The user, on a table-by-table basis, should be able to select between a variety of durability strategies. These should include:

Disk k (default, probably with k = 2)
Flash k
RAM k
Null (no durability)

and a lot of other more interesting ones are possible as well.

PRAM
RAM 2, Disk 1
Flash 1, Disk 1

etc.

Implication: Segment backup placement needs complete flexibility, meaning a "backup server schedule" likely isn't going to work for us.

Master Soft State

This is essentially an enumeration of all the things we must convince ourselves we can recover (hopefully quickly).

Segment Backup Locations
Restoration: Given to new master from the cluster manager or from one of the former master's backups. One nice possibility. Can be a list or a function. The current working assumption that this isn't strictly needed and contacting each of the servers in the cluster is practical.

Segment List
Restoration: Have to send a query to all outstanding backups. This is probably okay because we can piggyback this with the fetch of the metadata summaries. Additionally, since we are doing a broadcast to find segment backup locations we can do this with the same query.

Master vector clock
Restoration: After re-reading segment metadata from backups fetch the most recent value written to the most recent segment and set the vector clock one higher.

Id -> Ptr Hashtable
Restoration: After re-reading segment metadata from backup scan all objects, populating the hashtable.

Objects
Restoration: Scan the hashtable, fetching all objects on demand. This prevents fetching old versions of the objects, for which only the metadata (which has already been fetched) is needed.

Cleaning information
Must rebuild the delete pointers or refcounts; they maintain the cleaning invariants needed in case this instance of the master crashes to recover the next instance.
How do we decide to do this if the data in the segments is fetched on demand?
Maybe assert we can't clean until we've refetched all the data.
Seems reasonable anyway - just causes an increase in active disk space.
Also, we might be able to do this anyway without refetching the objects. But it seems like we'd need some kind of shadow layout in memory to maintain the pointer links. Perhaps refcounting doesn't have this problem.

Indexes
Since indexes have no guaranteed relationship with the data on the master we may be better off rebuilding them elsewhere to mitigate the incast problem.

Master will have data that doesn't appear in its indexes.
- Implies we must refetch all the index data from those master's to rebuild.
Other master's will have data in which this master's data appears.

Walkthrough

Total crash; no battery backup

Total crash; battery backup

Master only crash

m crashes
b notices failure of m
b notifies c
c chooses m' giving it the same name a m
c transfers the set of all machine names to m	c notifies m'' it is now responsible the indexes formerly on m	c notifies m''' that it is responsible for ensuring that each segment of m is backed up at least k times
m' contacts for b in machines, j of them in parallel, asking for a list of segments that b has backed up for m' along with the object ids and versions on each	m'' rebuilds the indexes by fetching the metadata for all associated indexes, some of these calls may go to the new m' so some calls may block until it begins servicing requests	m''' fetches a list of which segments all the servers in the cluster have, for the server it is backing up it checks to see which segments aren't replicated and it coordinates the duplication of those segments
m rebuilds its Hashtable by forward replay, simultaneously recovering tombstone pointers/refcounts
m' fetches the last object written by m, and restores its Master Vector Clock by adding one to the version of that object
m' begins servicing requests	m' fetches live objects in the background

Back of the Envelope

Time for an individual backup to fetch a single 8 MB chunk into RAM
- 8 ms + (8 MB / 150 MB / s) = 8 ms + 53 ms = 61 ms

Metadata per segment
- Segment number
  - 8 bytes for master id + 8 bytes for segment id
- Object ids
  - 8 bytes per object
  - For 100 b objects filling 64 GB that is 5.12 GB.
    - This is replicated by k.
    - So we can expect for k = 2, to get 10.24 GB of data @ 1 Gbps = 5.49 s.

TTR

Can't service requests in this system for about 5.49 s for a tiny object server in the ideal case.

Backup Recovery

Once a master's data has been restored we just need to replicate the backups it has once as well.

How does b' know which segments b had? Most likely broadcast or gossip.

Important back of the envelope times

Times of each of the recovery phases

Phase 1: Recovery master metadata
Phase 2: Recovery master data
Phase 1': Recover backup data

Might be worth having a verification phase here to ensure k precisely.
Phase 1'': Recover indexes