Master Recovery

Goal: No client can (overtly) distinguish that a failure has occurred using the specified client interface while less than f uncorrected failures are ongoing - with high probability.
Goal: Client can expect latency less than l - with high probability; they can further expect latency bounded by l' while the system has less than f ongoing, uncorrected failures.

Implication: On master failure low-latency availability must be restored after a failure.
Implication: To maintain the ability to restore low-latency access with high probability we must restore durability.

Important Note: We've effectively "decided" that we'll need warm spare machines since we have eliminated shards. In other words, our minimum unit of recovery in the system is currently an entire machine.

Punted for the moment

Partitions
Nuclear attacks

Focus is single master failure (during any phase of operation) or single backup failure (during any phase of operation).

Design considerations

Client Interface Invariants

Invariant: A read of a key following a write always sees the value associated with that write (barring a power failure on a non-battery backed up host).

Invariant: A write to a key associates a version number with the value that is greater than that of any previous value written to that key location.

Invariant: An application always sees bounded latency.

Invariant: Writes are "k-durable" on return of success.

Pluggable storage backends

The user, on a table-by-table basis, should be able to select between a variety of durability strategies. These should include:

k Disk (default, probably with k = 2)
k Flash
k RAM
Null (no durability)

and a lot of other more interesting ones are possible as well.

PRAM
2 RAM, 1 Disk
1 Flash, 1 Disk

etc.

Implication: Segment backup placement needs complete flexibility, meaning a "backup server schedule" likely isn't going to work for us.

Master Soft State

This is essentially an enumeration of all the things we must convince ourselves we can recover (hopefully quickly).

Segment Backup Locations - who has segments backed up for this master?
Restoration: Given to new master from the cluster manager or from one of the former master's backups. Can be a list or a function. The current working assumption that this isn't strictly needed and contacting each of the servers in the cluster is practical. Some cleverness will be needed in the case of restarting an entire RAMCloud.

Segment List - which segments are live for a given master
Restoration: Have to send a query to all outstanding backups. This is probably okay because we can piggyback this with the fetch of the metadata summaries. Additionally, since we are doing a broadcast to find segment backup locations we can do this with the same query.

Master vector clock - the version stamp for the next operation on this server
Restoration: After re-reading segment metadata from backups fetch the most recent value written to the most recent segment and set the vector clock one higher.

Id -> Ptr Hashtable - the location in RAM for each object id
Restoration: After re-reading segment metadata from backup scan all objects, populating the hashtable.

Objects - the actual data for live objects
Restoration: Scan the hashtable, fetching all objects on demand. This prevents fetching old versions of the objects, for which only the metadata (which has already been fetched) is needed.

Cleaning information - tombstone liveness information to prevent resurrecting stale object versions on loss of hashtable
Must rebuild the delete pointers or refcounts; they maintain the cleaning invariants needed in case this instance of the master crashes to recover the next instance.
How do we decide to do this if the data in the segments is fetched on demand?
Maybe assert we can't clean until we've refetched all the data.
Seems reasonable anyway - just causes an increase in active disk space.
Also, we might be able to do this anyway without refetching the objects. But it seems like we'd need some kind of shadow layout in memory to maintain the pointer links. Perhaps refcounting doesn't have this problem.

Indexes
Since indexes have no guaranteed relationship with the data on the master we may be better off rebuilding them elsewhere to mitigate the incast problem.

Master will have data that doesn't appear in its indexes.
- Implies we must refetch all the index data from those master's to rebuild.
Other master's will have data in which this master's data appears.

Walkthrough

Total crash; no battery backup

Backups will a write list of next segment frames to use to their superblock, on they will be able to recover the list of active segment frames quickly (~60 ms). The super block will also contain lists of which segments are in which segment frames for each master. The system will have to read the additional blocks in the "next segment frames to use" list to see if any of the master segment/segment frame lists have been extended. The amount of time needed for this can be bounded arbitrarily, at the cost of having to fix "next write" locations further in advance.

Notice the last written segment may be partially corrupted. The backup is expected to consider such a segment as garbage.

Implication: We'll want at least two superblocks in case one ends up being partially written.

Thought: It would be nice to say exactly what it is we are guaranteeing in the case of a power failure without battery backups. Is there any way for me to reason about the state of my application after such a failure?

Total crash; battery backup

No difference here, except we shouldn't have any partially written segments.

Master only crash

Master Data	Index Data	Backup Data
m crashes
b notices failure of m
b notifies c
c chooses m' giving it the same name a m
c transfers the set of all machine names to m	c notifies m'' it is now responsible the indexes formerly on m	c notifies m''' that it is responsible for ensuring that each segment of m is backed up at least k times
m' contacts for b in machines, j of them in parallel, asking for a list of segments that b has backed up for m' along with the object ids and versions on each	m'' rebuilds the indexes by fetching the metadata for all associated indexes, some of these calls may go to the new m' so some calls may block until it begins servicing requests	m''' fetches a list of which segments all the servers in the cluster have, for the server it is backing up it checks to see which segments aren't replicated and it coordinates the duplication of those segments
m rebuilds its Hashtable by forward replay, simultaneously recovering tombstone pointers/refcounts
m' fetches the last object written by m, and restores its Master Vector Clock by adding one to the version of that object
m' begins servicing requests	m' fetches live objects in the background

Back of the Envelope

Time for an individual backup to fetch a single 8 MB chunk into RAM
- 8 ms + (8 MB / 150 MB / s) = 8 ms + 53 ms = 61 ms

Metadata per segment
- Segment number
  - 8 bytes for master id + 8 bytes for segment id
- Object ids
  - 8 bytes per object
  - For 100 b objects filling 64 GB that is 5.12 GB.
    - This is replicated by k.
    - So we can expect for k = 2, to get 10.24 GB of data @ 1 Gbps = 5.49 s.

Times of each of the recovery phases

Phase 1: Recovery master metadata: ~5 s

Phase 2: Recovery master data: up to ~10 min

Phase 1': Recover backup data: negligible

Might be worth having a verification phase here to ensure k precisely for all backups.

Phase 1'': Recover indexes: extremely variable, for a table of 1 B entries time is avg length of index key in seconds