The problem

In order to maintain strong consistency of data, RAMCloud must ensure that dead servers are really dead. Here's the potential problem:

In order to prevent this problem RAMCloud must ensure that a zombie will not service any requests once its data has been moved to recovery masters.

Centralized leases

One approach is to use traditional leases:

The disadvantage of this approach is that it generates significant lease renewal traffic for the coordinator. Assuming that masters renew their leases every 500ms (this leaves a bit of extra time in case of problems reaching the coordinator), then in a large RAMCloud system with 100k masters there would be 200k lease renewals per second, which would consume at least 20% of the coordinator (assuming the coordinator can handle 1M requests/second).

Distributed leases

An alternative approach is to decentralize the lease mechanism and combine it with the failure detection mechanism. RAMCloud already has a mechanism for detecting server failures that works like this:

This mechanism can be extended to implement something akin to distributed leases, which will ensure that condemned masters stop servicing requests before recovery completes:

This mechanism should be sufficient to stop zombies very quickly:

First, consider the case of a single zombie, with all other servers operating normally. The zombie will cease operation after its first PING after the coordinator has initiated recovery (and notified all the other servers): either the PING response will put the zombie into limbo, or the PING will timeout, in which case the zombie will put itself into limbo.

Second, consider a pathological case where the cluster is divided in half by a partition. The "main" partition, containing the coordinator, will condemn all of the servers in the disconnected partition. During the first PING round in the disconnected partition, roughly half of the servers will attempt to contact servers in the main partition; these PINGs will all time out, so all of those servers will place themselves in limbo (and they will stay that way, since they won't be able to contact the coordinator). During the next PING round in the disconnected partition, 75% of the PINGs will be directed either at servers in the main partition or limbo servers in the disconnected partition, so most of the remaining zombies will enter limbo. With each PING round there are fewer servers in the disconnected partition that have not yet entered limbo state, so the system will rapidly converge to a point where all servers in the disconnected partition are in limbo.

For example, suppose a cluster has 1000 servers and 50% become partitioned. After the first round of PINGs only 250 zombies will remain; after rounds 2, 3, and 4 the expected number of zombies will be 63, 4, and 0.16. If the cluster has 100,000 servers the expected number of zombies after successive rounds will be (25000, 6250, 390, 1.5, 2.0e-05): increasing the number of servers by 100x only results in 1 extra round of PINGs to converge.

The distributed approach has the advantage of creating no load whatsoever on the coordinator during normal operation, and it piggy-backs on the existing mechanism for failure detection so no extra messages are required for lease renewal, except when there are node failures or partitions.

Potential issues