Master Recovery

Goal: No client can (overtly) distinguish that a failure has occurred using the specified client interface while less than f uncorrected failures are ongoing - with high probability.
Goal: Client can expect latency less than l - with high probability; they can further expect latency bounded by l' while the system has less than f ongoing, uncorrected failures.

Implication: On master failure low-latency availability must be restored after a failure.
Implication: To maintain the ability to restore low-latency access with high probability we must restore durability.

Important Note: We've effectively "decided" that we'll need warm spare machines since we have eliminated shards. In other words, our minimum unit of recovery in the system is currently an entire machine.

Client Interface Invariants

A read of a key following a write always sees the value associated with that write.
A write to a key associates a version number with the value that is greater than that of any previous value written to that key location.
Bounded latency.
Writes are "k-durable" on return.

Master Soft State

This is essentially an enumeration of all the things we must convince ourselves we can recover (hopefully quickly).

Segment Backup Locations
Restoration: Given to new master from the cluster manager or from one of the former master's backups. One nice possibility - a simple seed value to a function. This requires a cluster manager to slot machines into some dense index, though.

Segment List
Restoration: Have to send a query to all outstanding backups. This is probably okay because we can piggyback this with the fetch of the metadata summaries. All we really need here is where the most recent write went - we then have a total segment list. Min would suffice as well. We'll be able to tell easily if we are missing segments.

Master vector clock
Restoration: After re-reading segment metadata from backups fetch the most recent value written to the most recent segment and set the vector clock one higher.

Id -> Ptr Hashtable
Restoration: After re-reading segment metadata from backup scan all objects, populating the hashtable. This can be done in reverse, except note that you'd have to be able to replay the actual data in the segments in reverse as well which is a pain.

Objects
Restoration: Scan the hashtable, fetching all objects on demand. This prevents fetching old versions of the objects, for which only the metadata (which has already been fetched) is needed.

Cleaning information
Must rebuild the delete pointers or refcounts; they maintain the cleaning invariants needed in case this instance of the master crashes to recover the next instance.
How do we decide to do this if the data in the segments is fetched on demand?
Maybe assert we can't clean until we've refetched all the data.
Seems reasonable anyway - just causes an increase in active disk space.

Indexes
Since indexes have no guaranteed relationship with the data on the master we may be better off rebuilding them elsewhere to mitigate the incast problem.

Master will have data that doesn't appear in its indexes.
- Implies we must refetch all the index data from those master's to rebuild.
- Other master's will have data in which this master's data appears.

Walkthrough

Total crash; no battery backup

Total crash; battery backup

Master only crash

# Detection
** Possibly somekind of heartbeat to a subset of backups
   After m missed heartbeats either recover or initiate vote with other
   backups

# New master is chosen
** Most likely through a cluster coordinator

# Coordinator sends a message to the new master
** "You are recovering master r and your name is r"

# r computes S_i = (H(r + i) % s), i in [0..k), s is total active masters
** This is the seed value for the ith backup of r
# Whenever a new segment is started on r
## r computes b_{i,j} = (H(S_i + i) % s), i in [0..k), j is the next
   segment number
## r writes and commits the jth segment to b_{i,j} for i in [0, k)
* Any segment that is cleaned a free message sent to b_{i,j}
!!! Serious problem - storing b_{i,j} not ideal and will need later
    during cleaning but calculating is proportional to j
--- For that reason we'll want to calculate this function upfront and
    then loop on some set - IOW calculate a small table once
--- Size 4 bytes * 512 * k for k = 2 = 4 KB
** Call these "addresses" B = {b_0, ..., b_{k - 1}}
# r contacts all b in B and asks for metadata from any segments that
  were written by the former r
** Metadata
*** Segment number
*** Object headers
**** Currently about 32 bytes per object, for 8 MB segment with only
     100 byte objects that works out to about 2 MB per segment
     k backups per segment with about 16K segments per machine;
     16 GB because; we can remove the k if we
     force the k backups to agree before one transmits data;
     137 s at 1 Gbps
**** This can be worse depending on the size of the keys stored in
     indexes
*** Object index key payloads
**** Optimizations: ONLY for indexes which formerly resided
     on this server - for a former master serving a partial index we'll
     might want to send a range in which backups report values.
** Rewrite "shadow" versions of all objects with "not present" bit
# Resume normal operation, in parallel
** Next backup chosen based on next segment number
** Run invalidator
*** Looks up objects in the hashtable and fetches them clearing NP bit
* Resume cleaning

TTR

Can't service requests in this system for at least 137 s at 1 Gbps or 13.7 s at 10, because we have to wait for the full metadata to stream back to us.

Plus we have to scan the full log which is at least another 21 s. During this time we cannot do anything except service writes with no index data or with multi-index data only.

Backup Recovery

Once a master's data has been restored we just need to replicate the backups it has once as well.

How does b' know which segments b had? Most likely broadcast or gossip.

Notes from meeting with John

Pluggable storage backends
Recover indexes on different server to mitigate incast
Broadcast recovery
- For restoring backups servers just walk through lists of all machines asking who has data for that master and telling them to replicate the backups they had for that master once more.

PROBLEM: This actually creates too many replicas - this'll have to be coordinated by some server.

Important back of the envelope times

Times of each of the recovery phases

Phase 1: Recovery master metadata
Phase 2: Recovery master data
Phase 1': Recover backup data

Might be worth having a verification phase here to ensure k precisely.
Phase 1'': Recover indexes