Page Comparison

...

The distributed approach has the advantage of creating no load whatsoever on the coordinator during normal operation, and it piggy-backs on the existing mechanism for failure detection so no extra messages are required for lease renewal, except when there are node failures or partitions.

Potential issues

Could the viral spread of limbo state paralyze the cluster? For example, suppose a server goes into limbo, but before it gets a "continue normally" response from the coordinator it receives a PING from another server. Then that server will also go into limbo state; what if each server infects another server before it gets a response back from the coordinator? In the worst case it might be impossible to get the cluster out of limbo state. Assuming that the time to contact the coordinator is significantly less than the PING interval, limbo spread should not be a serious problem. However, an alternative would be to delay limbo infection for a small period of time to give the server a chance to contact the coordinator:
- If a server receives a PING response from another limbo server, it doesn't immediately put itself in limbo.
- It does immediately attempt to contact the coordinator, though. In the normal case where it gets a quick "continue normally" response back from the coordinator it never enters limbo state.
- If it does not get a quick response from the coordinator, then it places itself in limbo.
  This approach would increase the amount of time it takes a disconnected partition to converge on limbo state, but it would reduce the likelihood of limbo paralysis.

Another potential issue is writes received by a zombie after recovery starts, but before the zombie goes into limbo state (this is also a problem with the "traditional lease" approach). For example, suppose recovery starts but it takes a while for zombie to enter limbo state. It might receive write requests, which it accepts. In the worst case, these write requests could arrive after recovery masters have read the backup data for the segment, in which case the writes will be lost. One solution is for backups to refuse to accept data for a condemned master, returning a "Dude, you're a zombie" response in the same way as for PINGs. This would guarantee that, once recovery has progressed past the initial phase with the coordinator contacts every server, no more writes will be processed by the condemned master. But, what if the zombie and some or all of the backups are in a separate partition? If the coordinator is able to contact any of the backups during recovery, then that will prevent the zombie from writing new data. If all of the backups are in the same partition as the zombie then the zombie will continue to accept write requests unaware of the partition. However, in this case the coordinator will not be able to recover since none of the replicas will be available, so recovery will be deferred until at least one of the replicas becomes available, at which point writes will no longer be possible.

OK, here's another scenario relating to zombies and writes: is it possible that in a partition a zombie could select backups for a new segment that are entirely within the disconnected partition? If this happens, is it possible that the coordinator might be unaware of the existence of that additional segment (it might think that the log ended with the previous segment) and complete recovery "successfully", unaware of the additional segment? If this situation were to occur, it would result in lost writes. Do our techniques for finding the head of the log prevent this situation from occurring?

Versions Compared

Old Version 8

New Version 9

Key

Potential issues