Contents:

Master Recovery

Goal: No application can (overtly) distinguish that a failure has occurred using the specified client interface while less than f uncorrected failures are ongoing - with high probability.
Goal: Client can expect latency less than l - with high probability; they can further expect latency bounded by l' while the system has less than f ongoing, uncorrected failures.

Implication: On master failure low-latency availability must be restored after a failure.
Implication: To maintain the ability to restore low-latency access with high probability we must restore durability.

Important Note: We've effectively "decided" that we'll need warm spare machines since we have eliminated shards. In other words, our minimum unit of recovery in the system is currently an entire machine.

Punted for the moment

Focus is single master failure (during any phase of operation) or single backup failure (during any phase of operation).

Design considerations

Client Interface Invariants

Invariant: A read of a key following a write always sees the value associated with that write (barring a power failure on a non-battery backed up host).

Invariant: A write to a key associates a version number with the value that is greater than that of any previous value written to that key location.

Invariant: An application always sees bounded latency.

Invariant: Writes are "k-durable" on return of success.

Pluggable storage backends

The user, on a table-by-table basis, should be able to select between a variety of durability strategies. These should include:

and a lot of other more interesting ones are possible as well.

etc.

Implication: Segment backup placement needs complete flexibility, meaning a "backup server schedule" likely isn't going to work for us.

Master Soft State

This is essentially an enumeration of all the things we must convince ourselves we can recover (hopefully quickly).

Segment Backup Locations - who has segments backed up for this master?
Restoration: Given to new master from the cluster manager or from one of the former master's backups. Can be a list or a function. The current working assumption that this isn't strictly needed and contacting each of the servers in the cluster is practical. Some cleverness will be needed in the case of restarting an entire RAMCloud.

Segment List - which segments are live for a given master
Restoration: Have to send a query to all outstanding backups. This is probably okay because we can piggyback this with the fetch of the metadata summaries. Additionally, since we are doing a broadcast to find segment backup locations we can do this with the same query.

SafeVersion (Master vector clock) - the version number for the next object creation on this server
Restoration: Reading special object 'objectSafeVersion' appended at the beginning of log segment to restore safeVersion when the segment was created. Then at the log replay, safeVersion is updated to maintain safeVersion is bigger than the version numbers of any replayed tomb stones in the restored open segment. Since log cleaner deletes tomb stones in closed segments, 'objectSafeVersion' retains the safeVersion when the head segment is opened.

Id -> Ptr Hashtable - the location in RAM for each object id
Restoration: After re-reading segment metadata from backup scan all objects, populating the hashtable.

Objects - the actual data for live objects
Restoration: Scan the hashtable, fetching all objects on demand. This prevents fetching old versions of the objects, for which only the metadata (which has already been fetched) is needed.

Cleaning information - tombstone liveness information to prevent resurrecting stale object versions on loss of hashtable
Must rebuild the delete pointers or refcounts; they maintain the cleaning invariants needed in case this instance of the master crashes to recover the next instance.
How do we decide to do this if the data in the segments is fetched on demand?
Maybe assert we can't clean until we've refetched all the data.
Seems reasonable anyway - just causes an increase in active disk space.
Also, we might be able to do this anyway without refetching the objects. But it seems like we'd need some kind of shadow layout in memory to maintain the pointer links. Perhaps refcounting doesn't have this problem.

Indexes
Since indexes have no guaranteed relationship with the data on the master we may be better off rebuilding them elsewhere to mitigate the incast problem.

Walkthrough

Total crash; no battery backup

Backups will a write list of next segment frames to use to their superblock, on they will be able to recover the list of active segment frames quickly (~60 ms). The super block will also contain lists of which segments are in which segment frames for each master. The system will have to read the additional blocks in the "next segment frames to use" list to see if any of the master segment/segment frame lists have been extended. The amount of time needed for this can be bounded arbitrarily, at the cost of having to fix "next write" locations further in advance.

Notice the last written segment may be partially corrupted. The backup is expected to consider such a segment as garbage.

Implication: We'll want at least two superblocks in case one ends up being partially written.

Thought: It would be nice to say exactly what it is we are guaranteeing in the case of a power failure without battery backups. Is there any way for me to reason about the state of my application after such a failure?

Total crash; battery backup

No difference here, except we shouldn't have any partially written segments.

Single server (Master + Backup) crash

Phase 1

Master Data

Index Data

Backup Data

m crashes

b notices failure of m

b notifies c

c chooses m' giving it the same name a m

c transfers the set of all machine names to m

c notifies m'' it is now responsible the indexes formerly on m

c broadcasts that m crashed and the masters are responsible for rewriting their own backups to new locations

m' contacts for b in machines, j of them in parallel, asking for a list of segments that b has backed up for m' along with the object ids and versions on each

m'' rebuilds the indexes by fetching the metadata for all associated indexes, some of these calls may go to the new m' so some calls may block until it begins servicing requests

m' rebuilds a Hashtable by forward replay, simultaneously recovering tombstone pointers/refcounts

m' fetches the last object written by m, and restores its Master Vector Clock by adding one to the version of that object

m' begins servicing requests

Phase 2

Master Data

m' fetches live objects in the background, simultaneously generating the tombstone cleaning metadata

m' moves into normal operation

Back of the Envelope

Times of each of the recovery phases

Phase 1: Recovery master metadata: ~5 s

Phase 2: Recovery master data: up to ~10 min

Phase 1': Recover backup data: negligible

Phase 1'': Recover indexes: extremely variable

Some notes from the Recovery Meeting

Master Broadcast in Phase 1

Split the "which segments do servers have?" query from the "what ids are on segments?" query. Helps with incast, lets new master control incoming requests.

Fast Index Recovery

Goal: Index recovery needs to complete (nearly) as fast as Phase 1

newer stuff at IndexRecovery

Problem: Incast + processing time is substantially longer than Phase 1.

  1. Client-side
  2. Index sharding
  3. Index persistent
    1. Copy-on-write
    2. Mutable
      1. Index entry log: Log each index entry remove and index entry add operation. Use cleaning to bound recovery time.
        If we make the simplifying assumption that a (search key, oid) uniquely identifies an index entry within an index (e.g., "foo" -> 1423 next to "foo" -> 1423 is not allowed), it is safe to clean an index entry operation from the log when that index entry is not present in the current state of the index. Unfortunately, there's a snag: in add "foo" -> 1423, remove "foo" -> 1423, add "foo" -> 1423, the cleaner will not know that it is safe to purge the first "foo" -> 1423 because it still appears live.
        • Maybe that's rare enough that we just don't care. We could clean this odd case on recovery/server moves.
        • Maybe we have an occasional expensive cleaning script that checks for this case.
        • If it's really not acceptable, store a segment number with every single index entry in the HT/tree. It's mostly a waste of space and more trash in the cache lines in the usual path, though.
      2. Batch log: Log several index entries at a time. Use cleaning to bound recovery time.
        If we log a batch of index entries at a time, it amortizes the cost of storing a segment number in the HT/tree in some metadata for that batch. This allows the cleaner to know whether a log entry is live.
      3. Snapshots: Log each index entry remove and index entry add operation. Use snapshots bound recovery time.
        Periodically dump the index contents out to disk and log updates with no cleaning.
        • This will suck if we don't spread out the cost over time.
        • If it just takes too long to replay even a fully cleaned log, this option should be faster.

Older, broken idea

Idea: Just as we harness the speed of many spindles we can reuse those same hosts to quickly rebuild miniature index shards on those same backup hosts.

  1. After all backups for a master copy their segments into RAM
  2. "Sort" (key, id) pairs for each of the objects they have in those segments
  3. Bucket the pairs
  4. Ship the pairs to a single host corresponding to that bucket
  5. Hosts for a particular bucket services requests to that key range

At this point the master can begin servicing requests for that index by doing distributed queries against these bucket servers as follows:

  1. For a key/range determine the buckets involved
  2. Send lookups to appropriate bucket servers
  3. Get back a list of ids
  4. Fetch all the objects in the normal way

Could be two round trips between the backups and master for a single indexed lookup during this phase.

Problem #1: How are the key ranges for the bucket determined?

Note, of course, all the backups and the new master need to agree on this mapping.

Problem #2: The data restored on a master crash doesn't necessarily correspond to the data indexed by the master. Meaning, other data may be needed than just that which is live on the backups already.

Possibilities:

Back-of-the-Envelope for Index Recovery

Virtually 0

  1. 8192 machines read 8 MB - this is required by Phase 1 anyway so it's free
  1. Takes virtually no time to extract the (key, id) pairs from an 8 MB segment
  2. If indexes take up 50% of that 8 MB it takes about 3 ms to shift it to another machine

TODO TTR complete recovery using some kind of hierarchical merge

One thing to notice - recovery time using this method is so fast it replaces any optimizations we'd thought of before (e.g. logging indexes, hoping to rebuild indexes based on reconstructed hash table, fetching keys during restore, etc.)

Older thoughts from earlier meetings

Frequently it will be the case that a table was on a single machine and so was its index. If it crashes the new index server will:

Sometimes the index and/or the table will be across multiple machines. In that case:

Note that since we are limited by network bandwidth here we might (likely?) prefer to rebuild each index on a separate server. The naive way of scanning the hashtable on each 'other' server once per index being rebuilt might be fine since that should proceed much more quickly than sending the actual keys.

What ideas do we have to reduce the amount of time requests will stall while the system is recovering on indexed requests?