This page builds on contents from the SOSP 2011 paper. In particular, readers should be comfortable with sections 3.5.1 Finding Log Segment Replicas and 3.5.2 Detecting Incomplete Logs.
The SOSP 2011 paper does not mention how to detect incomplete logs during recovery in the face of backup failures. Here's an example of a situation in which RAMCloud would behave incorrectly:
- Suppose master M is writing object 100 to segment X on backups B1, B2, and B3 when backup B1 crashes.
- Master M re-replicates X to B4 and moves on, filling in more of X with objects 101, 102, etc.
- Then, a massive failure occurs, taking all servers offline.
- Backup B1 comes back online.
- The recovery/cold start for M only finds B1's replica of X, which looks like the head of the log.
- Recovery completes, with objects 101, 102, etc nowhere to be found.
The only correct behavior is to wait for B2, B3, or B4 to come back online with an up-to-date replica of segment X.
We think that to solve this problem, it is necessary to either:
- have a reliable external storage system,
- implement some form of consensus, or
- settle on only probabilistic correctness
We already plan to have a reliable external storage system that internally uses consensus for some critical data that the coordinator stores. So the best solution is one that balances the following goals:
- stores the least amount of information externally,
- updates the information stored externally least frequently,
- is simplest to implement, and
- is most obviously correct
Some of these goals are at odds with each other. For example, Ryan has a cute way to store only one integer per log externally: store the checksum of the entire log. Unfortunately, this requires frequent updating.
Here's the best solution we know of so far (2012-01-18):
- Master M detects that B1 has crashed and confirms with the coordinator that B1 is unavailable. (It still makes sure object 100 has reached B2 and B3.)
- M opens segment Y on a new set of backups, with Y's log digest including segment X.
- M closes segment X on B2 and B3.
- M re-replicates X to a new backup, B4.
- M asks the coordinator to update its minimum acceptable open segment ID to Y. This is stored reliably in external storage.
- M acknowledges the client's write for object 100 and continues accepting writes.
- Step 2 must happen before step 3: the log must always contain at least one open segment (to be the head).
- Steps 3 and 4 must happen before step 5: we want R replicas of closed segment X before declaring replicas of open segment X invalid. This guarantees we can make progress if the rest of the backups temporarily fail and the master crashes.
- Step 5 must happen before step 6: object 100 must be present in all ways the log could be recovered before its durability is confirmed to the client. This only happens once replicas of open segment X are reliably refused during subsequent recoveries.
If master recovery were to occur before step 5 is complete, these are the interesting ways the log may be recovered as. Some include object 100 and some don't, but they are all correct (since the write to object 100 hasn't been acknowledged to the client). For notation, a segment ending in a square bracket is closed, and a segment ending in a parenthesis is open.
[Segment A] ... [Segment W] [Segment X without object 100)
[Segment A] ... [Segment W] [Segment X without object 100) [Segment Y with no data)
[Segment A] ... [Segment W] [Segment X with object 100)
[Segment A] ... [Segment W] [Segment X with object 100) [Segment Y with no data)
[Segment A] ... [Segment W] [Segment X with object 100] [Segment Y with no data)
After step 5, there is only one way the log will be recovered (which will include object 100).
[Segment A] ... [Segment W] [Segment X with object 100] [Segment Y)