Contents:

Master Recovery

Goal: No application can (overtly) distinguish that a failure has occurred using the specified client interface while less than f uncorrected failures are ongoing - with high probability.
Goal: Client can expect latency less than l - with high probability; they can further expect latency bounded by l' while the system has less than f ongoing, uncorrected failures.

Implication: On master failure low-latency availability must be restored after a failure.
Implication: To maintain the ability to restore low-latency access with high probability we must restore durability.

Important Note: We've effectively "decided" that we'll need warm spare machines since we have eliminated shards. In other words, our minimum unit of recovery in the system is currently an entire machine.

Punted for the moment

Partitions
Nuclear attacks

Focus is single master failure (during any phase of operation) or single backup failure (during any phase of operation).

Design considerations

Client Interface Invariants

Invariant: A read of a key following a write always sees the value associated with that write (barring a power failure on a non-battery backed up host).

Invariant: A write to a key associates a version number with the value that is greater than that of any previous value written to that key location.

Invariant: An application always sees bounded latency.

Invariant: Writes are "k-durable" on return of success.

Pluggable storage backends

The user, on a table-by-table basis, should be able to select between a variety of durability strategies. These should include:

k Disk (default, probably with k = 2)
k Flash
k RAM
Null (no durability)

and a lot of other more interesting ones are possible as well.

PRAM
2 RAM, 1 Disk
1 Flash, 1 Disk

etc.

Implication: Segment backup placement needs complete flexibility, meaning a "backup server schedule" likely isn't going to work for us.

Master Soft State

This is essentially an enumeration of all the things we must convince ourselves we can recover (hopefully quickly).

Segment Backup Locations - who has segments backed up for this master?
Restoration: Given to new master from the cluster manager or from one of the former master's backups. Can be a list or a function. The current working assumption that this isn't strictly needed and contacting each of the servers in the cluster is practical. Some cleverness will be needed in the case of restarting an entire RAMCloud.

Segment List - which segments are live for a given master
Restoration: Have to send a query to all outstanding backups. This is probably okay because we can piggyback this with the fetch of the metadata summaries. Additionally, since we are doing a broadcast to find segment backup locations we can do this with the same query.

SafeVersion (Master vector clock) - the version number for the next object creation on this server
Restoration: Reading special object 'objectSafeVersion' appended at the beginning of log segment to restore safeVersion when the segment was created. Then at the log replay, safeVersion is updated to maintain safeVersion is bigger than the version numbers of any replayed tomb stones in the restored open segment. Since log cleaner deletes tomb stones in closed segments, 'objectSafeVersion' retains the safeVersion when the head segment is opened.

Id -> Ptr Hashtable - the location in RAM for each object id
Restoration: After re-reading segment metadata from backup scan all objects, populating the hashtable.

Objects - the actual data for live objects
Restoration: Scan the hashtable, fetching all objects on demand. This prevents fetching old versions of the objects, for which only the metadata (which has already been fetched) is needed.

Cleaning information - tombstone liveness information to prevent resurrecting stale object versions on loss of hashtable
Must rebuild the delete pointers or refcounts; they maintain the cleaning invariants needed in case this instance of the master crashes to recover the next instance.
How do we decide to do this if the data in the segments is fetched on demand?
Maybe assert we can't clean until we've refetched all the data.
Seems reasonable anyway - just causes an increase in active disk space.
Also, we might be able to do this anyway without refetching the objects. But it seems like we'd need some kind of shadow layout in memory to maintain the pointer links. Perhaps refcounting doesn't have this problem.

Indexes
Since indexes have no guaranteed relationship with the data on the master we may be better off rebuilding them elsewhere to mitigate the incast problem.

Master will have data that doesn't appear in its indexes.
- Implies we must refetch all the index data from those master's to rebuild.
Other master's will have data in which this master's data appears.

Walkthrough

Total crash; no battery backup

Backups will a write list of next segment frames to use to their superblock, on they will be able to recover the list of active segment frames quickly (~60 ms). The super block will also contain lists of which segments are in which segment frames for each master. The system will have to read the additional blocks in the "next segment frames to use" list to see if any of the master segment/segment frame lists have been extended. The amount of time needed for this can be bounded arbitrarily, at the cost of having to fix "next write" locations further in advance.

Notice the last written segment may be partially corrupted. The backup is expected to consider such a segment as garbage.

Implication: We'll want at least two superblocks in case one ends up being partially written.

Thought: It would be nice to say exactly what it is we are guaranteeing in the case of a power failure without battery backups. Is there any way for me to reason about the state of my application after such a failure?

Total crash; battery backup

No difference here, except we shouldn't have any partially written segments.

Single server (Master + Backup) crash

Phase 1
Master Data	Index Data	Backup Data
m crashes
b notices failure of m
b notifies c
c chooses m' giving it the same name a m
c transfers the set of all machine names to m	c notifies m'' it is now responsible the indexes formerly on m	c broadcasts that m crashed and the masters are responsible for rewriting their own backups to new locations
m' contacts for b in machines, j of them in parallel, asking for a list of segments that b has backed up for m' along with the object ids and versions on each	m'' rebuilds the indexes by fetching the metadata for all associated indexes, some of these calls may go to the new m' so some calls may block until it begins servicing requests
m' rebuilds a Hashtable by forward replay, simultaneously recovering tombstone pointers/refcounts
m' fetches the last object written by m, and restores its Master Vector Clock by adding one to the version of that object
m' begins servicing requests

Phase 2
Master Data
m' fetches live objects in the background, simultaneously generating the tombstone cleaning metadata
m' moves into normal operation

Back of the Envelope

Time for an individual backup to fetch a single 8 MB chunk into RAM
- 8 ms + (8 MB / 150 MB / s) = 8 ms + 53 ms = 61 ms

Metadata per segment
- Segment number
  - 8 bytes for master id + 8 bytes for segment id
- Object ids
  - 8 bytes per object
  - For 100 b objects filling 64 GB that is 5.12 GB.
    - This is replicated by k.
    - So we can expect for k = 2, to get 10.24 GB of data @ 1 Gbps = 5.49 s.

Times of each of the recovery phases

Phase 1: Recovery master metadata: ~5 s

Phase 2: Recovery master data: up to ~10 min

Expected performance: writes full speed, read of unrestored object ~2x slowdown, full speed otherwise.

Phase 1': Recover backup data: negligible

Might be worth having a verification phase here to ensure k precisely for all backups.

Phase 1'': Recover indexes: extremely variable

Some notes from the Recovery Meeting

Master Broadcast in Phase 1

Split the "which segments do servers have?" query from the "what ids are on segments?" query. Helps with incast, lets new master control incoming requests.

Fast Index Recovery

Goal: Index recovery needs to complete (nearly) as fast as Phase 1

Reduced performance is even ok because otherwise we cannot service requests at all during Phase 2.

newer stuff at IndexRecovery

Problem: Incast + processing time is substantially longer than Phase 1.

Client-side
- How do you implement a
  1. Hashtable?
    - Pretty easy
      - Piggy back on the real hashtable
    - Knobs
      - buckets per object
      - open addressing or chaining
  2. Tree?
    - Straight up 245 B+Tree
    - Better algs for var-sized blocks
    - Probably lots of tweaks
- Optimizations from moving to master
  - Cheaper to reorganize tree
  - Fewer network round-trips/processing an RPC
    - Large constant factor less time by number of ops on the data structure
    - How many ops do we expect typical (atypical) changes to use?
  - No client-side cache
Index sharding
- Index size bounded and then sharded, likewise for table data
  - A few index shards, quick recovery because they must be small, many shards quick recovery because of sharding
- Bounds recovery roughly to a constant
- Revists the question of whether table data locality is actually a good thing, particularly if we were considering it to make indexes easier/faster
Index persistent
1. Copy-on-write
2. Mutable
  1. Index entry log: Log each index entry remove and index entry add operation. Use cleaning to bound recovery time.
    If we make the simplifying assumption that a (search key, oid) uniquely identifies an index entry within an index (e.g., "foo" -> 1423 next to "foo" -> 1423 is not allowed), it is safe to clean an index entry operation from the log when that index entry is not present in the current state of the index. Unfortunately, there's a snag: in add "foo" -> 1423, remove "foo" -> 1423, add "foo" -> 1423, the cleaner will not know that it is safe to purge the first "foo" -> 1423 because it still appears live.
    - Maybe that's rare enough that we just don't care. We could clean this odd case on recovery/server moves.
    - Maybe we have an occasional expensive cleaning script that checks for this case.
    - If it's really not acceptable, store a segment number with every single index entry in the HT/tree. It's mostly a waste of space and more trash in the cache lines in the usual path, though.
  2. Batch log: Log several index entries at a time. Use cleaning to bound recovery time.
    If we log a batch of index entries at a time, it amortizes the cost of storing a segment number in the HT/tree in some metadata for that batch. This allows the cleaner to know whether a log entry is live.
  3. Snapshots: Log each index entry remove and index entry add operation. Use snapshots bound recovery time.
    Periodically dump the index contents out to disk and log updates with no cleaning.
    - This will suck if we don't spread out the cost over time.
    - If it just takes too long to replay even a fully cleaned log, this option should be faster.

Older, broken idea

Idea: Just as we harness the speed of many spindles we can reuse those same hosts to quickly rebuild miniature index shards on those same backup hosts.

After all backups for a master copy their segments into RAM
"Sort" (key, id) pairs for each of the objects they have in those segments
Bucket the pairs
Ship the pairs to a single host corresponding to that bucket
Hosts for a particular bucket services requests to that key range

At this point the master can begin servicing requests for that index by doing distributed queries against these bucket servers as follows:

For a key/range determine the buckets involved
Send lookups to appropriate bucket servers
Get back a list of ids
- May want to return key that matched as well for range queries so that the master can aggregate the results without fetching the objects themselves.
Fetch all the objects in the normal way
- Look in log
- If a thunk, dereference it
  Is this to pre-fetch the objects under the assumption that the client will ask for them soon? -Diego

Could be two round trips between the backups and master for a single indexed lookup during this phase.

Problem #1: How are the key ranges for the bucket determined?

Fixed and equal?
Based on past index load?
Dynamically rebalanced at restore?

Note, of course, all the backups and the new master need to agree on this mapping.

Problem #2: The data restored on a master crash doesn't necessarily correspond to the data indexed by the master. Meaning, other data may be needed than just that which is live on the backups already.

Possibilities:

Get said data from the proper masters
- Potentially incast/rebuild time issues again
Have backups for masters that have data indexed with a crashed master reload their segments into RAM as well and participate in the "sort"
- Potentially unbounded amount of data may have to be read from disk on backups into backups' RAM
- Can collate results only onto backups that are truly restoring the master so the role of these other backup servers may be quick and limited

Back-of-the-Envelope for Index Recovery

Virtually 0

8192 machines read 8 MB - this is required by Phase 1 anyway so it's free

(8 MB / 150 MB/s) = 53 ms
with disk latencies ~65 ms

Takes virtually no time to extract the (key, id) pairs from an 8 MB segment
If indexes take up 50% of that 8 MB it takes about 3 ms to shift it to another machine

TODO TTR complete recovery using some kind of hierarchical merge

One thing to notice - recovery time using this method is so fast it replaces any optimizations we'd thought of before (e.g. logging indexes, hoping to rebuild indexes based on reconstructed hash table, fetching keys during restore, etc.)

Older thoughts from earlier meetings

Frequently it will be the case that a table was on a single machine and so was its index. If it crashes the new index server will:

Scan the segment list
Fetch keys from the backup servers which should have this data in RAM by now
- The backup should have a temporary hashtable for the data which is can walk to do this instead of a full scan
Rebuild the index by inserting this information

Sometimes the index and/or the table will be across multiple machines. In that case:

If some objects are local to this master proceed as above
For all 'other' servers serving part of this table
- Walk the entire hashtable for that table on that server
- Return all ids/keys for that index
Rebuild the index by inserting this information

Note that since we are limited by network bandwidth here we might (likely?) prefer to rebuild each index on a separate server. The naive way of scanning the hashtable on each 'other' server once per index being rebuilt might be fine since that should proceed much more quickly than sending the actual keys.

What ideas do we have to reduce the amount of time requests will stall while the system is recovering on indexed requests?

Storing/logging indexes in some way?
Constructing indexes that are usable in some partially restored fashion
- B-Trees in RAMCloud tables out of objects
- Distributed data structures