Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This mechanism should be sufficient to stop condemned masters zombies very quickly.:

First, consider the case of a single zombie. The zombie will cease operation after its first PING after the coordinator has initiated recovery (and notified all the other servers): either the PING response will put the zombie into limbo, or the PING will timeout, in which case the zombie will put itself into limbo.

Second, consider a pathological case where the cluster is divided in half by a partition. The "main" partition, containing the coordinator, will condemn all of the servers in the disconnected partition. During the first PING round in the disconnected partition, roughly half of the servers will attempt to contact servers in the main partition; these PINGs will all time out, so all of those servers will place themselves in limbo (and they will stay that way, since they won't be able to contact the coordinator). During the next PING round in the disconnected partition, 75% of the PINGs will be directed either at servers in the main partition or limbo servers in the disconnected partition, so most of the remaining zombies will enter limbo. With each PING round there are fewer servers in the disconnected partition that have not yet entered limbo state, so the system will rapidly converge to a point where all servers in the disconnected partition are in limbo.

For example, suppose a cluster has 1000 servers and 50% become partitioned. After the first round of PINGs only 250 zombies will remain; after rounds 2, 3, and 4 the expected number of zombies will be 63, 4, and 0.16. If the cluster has 100,000 servers the expected number of zombies after successive rounds will be (25000, 6250, 390, 1.5,2.0e-05): increasing the number of servers by 100x only results in 1 extra round of PINGs to converge.

The distributed approach has the advantage of creating no load whatsoever on the coordinator during normal operation, and it piggy-backs on the existing mechanism for failure detection so no extra messages are required for lease renewal, except when there are node failures or partitions.