ALPO stands for "ALPO is Like Paxos, except (hopefully) more Obvious". This is an experiment to try to define a protocol for managing a replicated log using distributed consensus, in a way that is easier to understand and more complete than Paxos.

Problems with Paxos

Paxos has existed for more than 20 years, is generally believed to be correct, and has been implemented numerous times. Thus it might seem silly to think about alternative algorithms. However, Paxos suffers from the following problems:

Thus, I set out to devise a Paxos-like protocol that I could understand and convince myself to be correct. If that worked, the next step is to see if I could describe the protocol in a way that others could easily understand.

Goals

Leader election

The first part of the ALPO protocol manages the election of leaders so that (a) there is at most one leader at a time and (b) if the current leader crashes, a new leader will be elected to take its place. Note: the description below is slightly incomplete, since it does not describe the interaction between leader election and log management. The protocol will be extended slightly once log management has been introduced.

This protocol is safe because the server will never declare itself leader for a term unless it has received votes from a majority of the servers. Thus it is impossible for more than one server to be elected in a given term.

The defer mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period will be large enough to get through several election cycles (furthermore, the timeout period is also reset whenever a passive machine receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred). Thus it is unlikely to take more than two election cycles to select a new leader.

Log management

This section describes how the log is managed during a particular term, and how log consistency is preserved  when leadership changes.

Clients append to the log by making a request through the leader. The leader adds the new entry to its log, then sends a request containing that log entry to each of the other servers. Each server appends the entry to its log and also writes the data to durable secondary storage; once this is done, the server is said to have accepted the log entry. Once a majority of the cluster has accepted the new log entry, the leader can respond to the client.  At this point the durability of the new entry is guaranteed; the only event that could cause permanent data loss is simultaneous catastrophic failures of more than half the servers in the cluster, causing them to lose their secondary storage.

If a passive server crashes then it will not be able to accept new data from the leader. The leader need not wait for the crashed server to restart before responding to client requests: as long as a majority of the cluster is responsive, the cluster can continue operation. When a server restarts after a crash, it enters passive mode (it does not attempt to contact the leader). If the leader does not receive an acceptance from a server when it sends a new log entry, it continues trying at regular intervals; eventually the server will restart, at which point the leader will "catch it up" on the log entries it has not yet received. This mechanism guarantees that all servers will eventually mirror all log entries, in the absence of leader figures.

Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been partially accepted by the cluster (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been accepted by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated and accepted, or completely expunged from all logs. The entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been accepted but not fully replicated, the new leader must guarantee that these entries are eventually fully replicated.

 

Linearizability for clients

State for each server

Additional state kept by leader

Contents of a log entry

Important parameters