The ALPO consensus protocol

ALPO stands for "ALPO is Like Paxos, except (hopefully) more Obvious". This is an experiment to try to define a protocol for managing a replicated log using distributed consensus, in a way that is easier to understand and more complete than Paxos.

Problems with Paxos

Paxos has existed for more than 20 years, is generally believed to be correct, and has been implemented numerous times. Thus it might seem silly to think about alternative algorithms. However, Paxos suffers from the following problems:

In its purest form, as described by Lamport, it is incomplete. Basic Paxos guarantees safety, but (a) it does not deal with liveness issues (i.e. the algorithm may never terminate), (b) its data model consists of a single value, whereas real applications need to store a sequence of values, such as a log or state machine, and (c) it does not handle listeners (i.e., consensus may be reached but some of the parties may not realize that).
Other people have extended Paxos to be more complete, but there seems to be no agreed-upon way to do this and the various descriptions are quite complicated and difficult to understand.
Systems implementing Paxos all have reputations for being flaky or hard to use (though it's not clear whether this is because of Paxos).
Even in its simplest form the algorithm is hard to understand; the extended versions are nearly impossible to understand, even for experts. There does not seem to exist a description of Paxos that is both complete and relatively easy for a wide audience to understand. For example, most graduate students are not capable of understanding the full Paxos algorithm as part of a graduate course in operating systems.

Thus, I set out to devise a Paxos-like protocol that I could understand and convince myself to be correct. If that worked, the next step is to see if I could describe the protocol in a way that others could easily understand.

Goals

Create a log that is replicated across a cluster of servers. Each log entry will be identical on each of the servers.
The log is sequentially ordered: each entry has a unique integer id, and ids are assigned in ascending order.
If a server has accepted a given log entry, then it also has accepted all log entries with smaller ids than the given one.
If a majority of the servers have accepted a particular log entry, then that log entry is called accepted: it will eventually be accepted by all servers, and it will not be lost unless a majority of the servers suffer simultaneous catastrophic failures that lose all of their persistent data.
One of the servers in the cluster is designated the leader; all client requests that modify the log must be processed by the leader.
Clients can read log entries from any server, though there will usually be a time delay between when data can be read from the leader and when it can be read from other servers.
Leadership can move among the servers in the cluster in response to failures.
There is at most one leader in the cluster at a time. Once a new leader has been elected, it will not be possible for the previous leader to make noticeable updates to the log until it has been reelected.

Leader election

The first part of the ALPO protocol manages the election of leaders so that (a) there is at most one leader at a time and (b) if the current leader crashes, a new leader will be elected to take its place. Note: the description below is slightly incomplete, since it does not describe the interaction between leader election and log management. The protocol will be extended slightly once log management has been introduced.

ALPO servers divide into three categories: passive, leader, and candidate. Most servers at any given time are passive: they respond to requests from the leader but take no actions on their own. Passive servers become candidates during election cycles as described below.
In normal operation one of the servers in the cluster is the leader and all the other servers are passive. The leader must contact each of the passive servers at regular intervals, either by passing them new law entries or with a no-op heartbeat request. Each passive server keep track of the last time it received a message from the leader (or from candidates during an election); if a long period elapses with no such message (the timeout interval) then the server assumes that the leader has crashed and no one else is attempting to replace it; it converts itself from passive to candidate and begins an election cycle.
Time is divided up into terms, where terms have integer ids that increase monotonically. During each term there will be at most one leader in the cluster. The term starts with a single leader election and may be followed by a reign for the election winner. If the election produced a split vote then it is possible that there may be no winner, and hence no reign during this term. Each server stores id for the current term.
When a server becomes a candidate, it increments its current term, marks its own vote to indicate that it is voting for itself, and then contacts each of the other servers in the cluster to request their vote. The other servers increment their current term to match the new value, and respond to the request in one of the following ways:
- "You have my vote": this means that the server has not given its vote to any other candidate in this term. In returning this response, the server promises not to give its vote to any other candidate for the current term, and it will not become a candidate in this term.
- If the server has already given its vote to another candidate, then it returns a rejection that includes the id of the candidate that received its vote.
- If the server is down it may not respond at all.
The candidate continues in this phase (retrying with nonresponsive servers) until one of the following things happens:
- It receives votes from the majority of the servers in the cluster. At this point it declares itself the new leader and begins regular communication with the other servers in the cluster.
- It receives a message from a server claiming to be leader for this term. In this case the candidate accepts the new leader and returns to passive state.
- It receives one or more vote rejections. For each vote rejection the candidate compares its own rank with the data of the candidate that received the vote; for now, rank is determined purely by server id (this will be extended slightly below). If the candidate outranks the vote receiver, then it issues a defer request to the vote receiver. If the vote receiver already has enough votes to become leader, then it responds with that indication, in which case the requesting candidate will accept the new leader and return to passive state. If the vote receiver has not yet become leader, then it defers to the (higher-ranked) requesting candidate by returning itself to passive state.
- The sum of votes received plus votes owned by other candidates who have deferred represents the majority of the cluster. In this case the candidate increments the term and starts a new election cycle. It is likely to win during this cycle, because the competing candidates have all returned to passive state and will not become candidates again until the timeout period elapses.
- It receives a defer message from some other candidate with higher rank. In this case the candidate returns to passive state.

This protocol is safe because the server will never declare itself leader for a term unless it has received votes from a majority of the servers. Thus it is impossible for more than one server to be elected in a given term.

The defer mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period will be large enough to get through several election cycles (furthermore, the timeout period is also reset whenever a passive machine receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred). Thus it is unlikely to take more than two election cycles to select a new leader.

Log management

This section describes how the log is managed during a particular term, and how log consistency is preserved when leadership changes.

Clients append to the log by making a request through the leader. The leader adds the new entry to its log, then sends a request containing that log entry to each of the other servers. Each server appends the entry to its log and also writes the data to durable secondary storage; once this is done, the server is said to have accepted the log entry. Once a majority of the cluster has accepted the new log entry, the leader can respond to the client. At this point the durability of the new entry is guaranteed; the only event that could cause permanent data loss is simultaneous catastrophic failures of more than half the servers in the cluster, causing them to lose their secondary storage.

If a passive server crashes then it will not be able to accept new data from the leader. The leader need not wait for the crashed server to restart before responding to client requests: as long as a majority of the cluster is responsive, the cluster can continue operation. When a server restarts after a crash, it enters passive mode (it does not attempt to contact the leader). If the leader does not receive an acceptance from a server when it sends a new log entry, it continues trying at regular intervals; eventually the server will restart, at which point the leader will "catch it up" on the log entries it has not yet received. This mechanism guarantees that all servers will eventually mirror all log entries, in the absence of leader figures.

Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been partially accepted by the cluster (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been accepted by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated and accepted, or completely expunged from all logs. The entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been accepted but not fully replicated, the new leader must guarantee that these entries are eventually fully replicated.

Linearizability for clients

State for each server

Term number: number of the most recent term seen by this server (updated when notified by a leader or a candidate, or when the server becomes a candidate).
Vote: the server id of the candidate who has received this server's vote for the current term (if any).
Server id of this server.
Log entries that has been accepted by the server.
Id of the most recent log entry that has been accepted by all servers.
Time of receipt of the last request from the leader.
Cluster map: id and location of each server in the cluster, whether dead or alive.

Additional state kept by leader

For each other server, id of most recent log entry accepted by that server.
For each client: serial # of most recent request received from the client.

Contents of a log entry

Id: integer that serializes this entry within log: 1 for first entry, 2 for next, etc.
Client id: identifies the client that created this entry.
Client serial: serial # of the client request that created this entry (used to handle duplicate client requests).

Important parameters

Timeout interval: if a passive server receives no communication from a leader or candidate within this time period, then the server will convert to candidacy and initiate an election. This parameter should be an order of magnitude larger than the normal time it takes for one server to contact all of the other servers in the cluster.