Distributed Leases

The problem

In order to maintain strong consistency of data, RAMCloud must ensure that dead servers are really dead. Here's the potential problem:

The coordinator decides that a server is dead and initiates recovery. Let's refer to such a master as condemned.
The condemned server isn't actually dead (perhaps it was temporarily partitioned, or was unable to respond to "ping" requests because of a temporary heavy load, etc.). Let's use the term zombie to refer to such a master; not all condemned masters are zombies but, by definition, all zombies are condemned.
After recovery, a recovery master begins accepting requests for some of the zombie's data, including writes.
Meanwhile, the zombie continues to serve requests, causing clients to see stale data.

In order to prevent this problem RAMCloud must ensure that a zombie will not service any requests once its data has been moved to recovery masters.

Centralized leases

One approach is to use traditional leases:

Each master must acquire a lease from the coordinator with a relatively short expiration time (e.g. 750ms).
If a master fails to renew its lease before it expires, the master must stop serving any requests until it has contacted the coordinator to renew the lease.
As long as recovery takes more than 750ms, the coordinator can be certain that the condemned master is no longer servicing requests once recovery finishes.

The disadvantage of this approach is that it generates significant lease renewal traffic for the coordinator. Assuming that masters renew their leases every 500ms (this leaves a bit of extra time in case of problems reaching the coordinator), then in a large RAMCloud system with 100k masters there would be 200k lease renewals per second, which would consume at least 20% of the coordinator (assuming the coordinator can handle 1M requests/second).

Distributed leases

An alternative approach is to decentralize the lease mechanism and combine it with the failure detection mechanism. RAMCloud already has a mechanism for detecting server failures that works like this:

Every 10ms every server picks another server in the cluster at random (each server has a complete list of all the other servers) and sends it a PING request.
If it doesn't receive a response then it notifies the cluster manager, which begins the recovery process for the nonresponsive server.

This mechanism can be extended to implement something akin to distributed leases, which will ensure that condemned masters stop servicing requests before recovery completes:

At the beginning of recovery the coordinator contacts every server in the cluster. This is already necessary for the proper operation of recovery; it means that once recovery has begun every server in the cluster knows that a master has been condemned.
If the condemned master is not really dead and then it will continue to send PING requests to other servers every 10ms.
When another server receives a PING from a condemned server it responds with a special "Dude, you're a zombie!" response. If a server receives such a response it places itself in a limbo state, which means it will not service any requests from clients until it has contacted the coordinator. This is analogous to expiration of a lease.
When a limbo server contacts the coordinator, the coordinator can respond in one of two ways:
- It can tell the condemned master that it no longer owns any tablets.
- It can pardon the condemned master, tell it to continue normal operation, and abort any recovery that was in progress for the master. This is analogous to renewing the master's lease. In this case the master removes itself from limbo state and continues business as usual.
If a server fails to receive a response from a PING request, it must contact the coordinator as described above so that the coordinator can decide whether the PING target should be recovered. But, in addition, whenever a PING times out the PING initiator must also place itself into limbo state until it has contacted the coordinator. This is necessary to handle partitions where a zombie loses communication with the rest of the cluster. In most cases, the server will respond by saying "continue normally" as described above.
Finally, if a server receives a PING request when it is in limbo state it indicates that fact in its response, and the PING initiator must put itself in limbo and contact the coordinator. This ensures that, in the event of a partition, information about the partition spreads rapidly throughout a disconnected group of servers (see below for an example).

This mechanism should be sufficient to stop zombies very quickly:

First, consider the case of a single zombie, with all other servers operating normally. The zombie will cease operation after its first PING after the coordinator has initiated recovery (and notified all the other servers): either the PING response will put the zombie into limbo, or the PING will timeout, in which case the zombie will put itself into limbo.

Second, consider a pathological case where the cluster is divided in half by a partition. The "main" partition, containing the coordinator, will condemn all of the servers in the disconnected partition. During the first PING round in the disconnected partition, roughly half of the servers will attempt to contact servers in the main partition; these PINGs will all time out, so all of those servers will place themselves in limbo (and they will stay that way, since they won't be able to contact the coordinator). During the next PING round in the disconnected partition, 75% of the PINGs will be directed either at servers in the main partition or limbo servers in the disconnected partition, so most of the remaining zombies will enter limbo. With each PING round there are fewer servers in the disconnected partition that have not yet entered limbo state, so the system will rapidly converge to a point where all servers in the disconnected partition are in limbo.

For example, suppose a cluster has 1000 servers and 50% become partitioned. After the first round of PINGs only 250 zombies will remain; after rounds 2, 3, and 4 the expected number of zombies will be 63, 4, and 0.16. If the cluster has 100,000 servers the expected number of zombies after successive rounds will be (25000, 6250, 390, 1.5, 2.0e-05): increasing the number of servers by 100x only results in 1 extra round of PINGs to converge.

The distributed approach has the advantage of creating no load whatsoever on the coordinator during normal operation, and it piggy-backs on the existing mechanism for failure detection so no extra messages are required for lease renewal, except when there are node failures or partitions.

Potential issues

Could the viral spread of limbo state paralyze the cluster? For example, suppose a server goes into limbo, but before it gets a "continue normally" response from the coordinator it receives a PING from another server. Then that server will also go into limbo state; what if each server infects another server before it gets a response back from the coordinator? In the worst case it might be impossible to get the cluster out of limbo state. Assuming that the time to contact the coordinator is significantly less than the PING interval, limbo spread should not be a serious problem. However, an alternative would be to delay limbo infection for a small period of time to give the server a chance to contact the coordinator:
- If a server receives a PING response from another limbo server, it doesn't immediately put itself in limbo.
- It does immediately attempt to contact the coordinator, though. In the normal case where it gets a quick "continue normally" response back from the coordinator it never enters limbo state.
- If it does not get a quick response from the coordinator, then it places itself in limbo.
- This approach would increase the amount of time it takes a disconnected partition to converge on limbo state, but it would reduce the likelihood of limbo paralysis.

Another potential issue is writes received by a zombie after recovery starts, but before the zombie goes into limbo state (this is also an issue with the "traditional lease" approach). For example, suppose recovery starts but it takes a while for zombie to enter limbo state. It might receive write requests, which it accepts. In the worst case, these write requests could arrive after recovery masters have read the backup data for the segment, in which case the writes will be lost. One solution is for backups to refuse to accept data for a condemned master, returning a "Dude, you're a zombie" response in the same way as for PINGs. This would guarantee that, once recovery has progressed past the initial phase where the coordinator contacts every server, no more writes will be processed by the condemned master. But, what if the zombie and some or all of the backups are in a separate partition? If the coordinator is able to contact any of the backups during recovery, then that will prevent the zombie from writing new data. If all of the backups are in the same partition as the zombie then the zombie will continue to accept write requests unaware of the partition. However, in this case the coordinator will not be able to recover since none of the replicas will be available, so recovery will be deferred until at least one of the replicas becomes available, at which point writes will no longer be possible.

OK, here's another scenario relating to zombies and writes: is it possible that in a partition a zombie could select backups for a new segment that are entirely within the disconnected partition? If this happens, is it possible that the coordinator might be unaware of the existence of that additional segment (it might think that the log ended with the previous segment) and complete recovery "successfully", unaware of the additional segment? If this situation were to occur, it would result in lost writes. I think our approach to managing the log head will prevent this situation:
- A master cannot start writing segment N+1 until it has closed segment N.
- If it is condemned before it closes segment N then it will detect its condemned state during the close.
- If it closes segment N before condemnation is complete, then the coordinator will not treat segment N as the head of the log, and it will not recover until it can find segment N+1. Once it finds segment N+1 the zombie will no longer be able to write to that segment.