The ALPO consensus protocol

ALPO stands for "ALPO is Like Paxos, except (hopefully) more Obvious". It is an experiment to try defining a protocol for managing a replicated log using distributed consensus, in a way that is easier to understand and more complete than Paxos.

Problems with Paxos

Paxos has existed for more than 20 years, is generally believed to be correct, and has been implemented numerous times. Thus it might seem silly to think about alternative algorithms. However, Paxos suffers from the following problems:

In its purest form, as described by Lamport, it is incomplete. Basic Paxos guarantees safety, but (a) it does not deal with liveness issues (i.e. the algorithm may never terminate), (b) its data model consists of a single value, whereas real applications need to store a sequence of values, such as a log or state machine, and (c) it does not handle listeners (i.e., consensus may be reached but some of the parties may not realize that).
Other people have extended Paxos to be more complete, but there seems to be no agreed-upon way to do this and the various descriptions are quite complicated and difficult to understand.
Systems implementing Paxos all have reputations for being flaky or hard to use (though it's not clear whether this is because of Paxos).
Even in its simplest form the algorithm is hard to understand; the extended versions are nearly impossible to understand, even for experts. There does not seem to exist a description of Paxos that is both complete and relatively easy for a wide audience to understand. For example, most students in graduate operating systems classes are not capable of understanding the full Paxos algorithm.

Thus, I set out to devise a Paxos-like protocol that I could understand and convince myself to be correct. If that worked, the next step is to see if I could describe the protocol in a way that others could easily understand.

Goals

Create a log that is replicated across a cluster of servers. Each log entry will be identical on each of the servers.
The log is sequentially ordered: each entry has a unique integer id, and ids are assigned in ascending order.
Clients will make requests to append entries to the law, and to read log entries in order.
If a server has accepted a given log entry, then it also has accepted all log entries with smaller ids than the given one.
If a majority of the servers have accepted a particular log entry, then that log entry is called guaranteed: it will eventually be accepted by all servers, and it will not be lost unless a majority of the servers suffer simultaneous catastrophic failures that lose all of their persistent data.
One of the servers in the cluster is designated the leader; all client requests that modify the log must be processed by the leader.
Clients can read log entries from any server, though there will be a time delay between when data can be read from the leader and when it can be read from other servers.
Leadership can move among the servers in the cluster in response to failures.
There is at most one leader in the cluster at a time. Once a new leader has been elected, it will not be possible for the previous leader to make updates to the log.

Leader election

The first part of the ALPO protocol manages the election of leaders so that (a) there is at most one leader at a time and (b) if the current leader crashes, a new leader will be elected to take its place. Note: the description below is slightly incomplete, since it does not describe the interaction between leader election and log management. The protocol is extended later in this document once log management has been introduced.

Each ALPO server is in one of three states: passive, leader, or candidate. Most servers at any given time are passive: they respond to requests from the leader but take no actions on their own (a passive server never issues an RPC request). A candidate is a server that is attempting to become leader. Passive servers become candidates during elections as described below.
In normal operation one of the servers in the cluster is the leader and all the other servers are passive. The leader must contact each of the passive servers at regular intervals, either by passing them new log entries or with a no-op heartbeat request. Each passive server keep track of the last time it received a message from the leader (or from candidates during an election); if a long period elapses with no such message (the timeout interval) then the server assumes that the leader has crashed and no one else is attempting to replace it; it converts itself from passive to candidate and begins an election cycle.
Time is divided up into terms, where terms have integer ids that increase monotonically. During each term there will be at most one leader in the cluster. The term starts with a single leader election and may be followed by a reign for the election winner. If the election produced a split vote then it is possible that there will be no winner, and hence no reign during this term. Each server stores the id for the current term; different servers may have different ideas of the current term, but the ids will converge over time, and no leader can be elected unless a majority of the servers have (at some time) reached a particular term.
When a server becomes a candidate, it increments its current term, marks its own vote to indicate that it is voting for itself, and then contacts each of the other servers in the cluster to request their votes. The other servers update their current term if needed to match the new value, and respond to the request in one of the following ways:
- "You have my vote": this means that the server has not given its vote to any other candidate in this term. In returning this response, the responder promises not to give its vote to any other candidate for the current term, and it will not become a candidate in this term.
- If the server has already given its vote to another candidate, then it returns a rejection that includes the id of the candidate that received its vote.
- If the server is down it may not respond at all.
The candidate continues in this phase (retrying with nonresponsive servers) until one of the following things happens:
- It receives votes from the majority of the servers in the cluster. At this point it becomes the leader and begins regular communication with the other servers in the cluster.
- It receives a message from another server claiming to be leader for this term. In this case the candidate accepts the new leader and returns to passive state.
- It receives one or more vote rejections. For each vote rejection the candidate compares its own rank with the rank of the candidate that received the vote; for now, rank is determined purely by server id (this will be extended slightly below). If the candidate outranks the vote receiver, then it issues a defer request to the vote receiver. If the vote receiver already has enough votes to become leader, then it responds with that indication, in which case the requesting candidate will accept the new leader and return to passive state. If the vote receiver has not yet become leader, then it defers to the (higher-ranked) requesting candidate by returning itself to passive state; this will make it possible for the higher-ranked candidate to win a future election. If the candidate's rank is less than that of the vote receiver then the candidate passivates.
- The sum of votes received plus votes owned by lower-ranked candidates who have deferred represents the majority of the cluster. In this case the candidate increments the term and starts a new election cycle (at this point no-one can win the current election cycle). The candidate is likely to win during this cycle, because the competing candidates have all returned to passive state and will not become candidates again until the timeout period elapses.
- It receives a defer request from some other candidate with higher rank. In this case the candidate returns to passive state.

This protocol is safe because the server will never declare itself leader for a term unless it has received votes from a majority of the servers. Thus it is impossible for more than one server to be elected in a given term. Furthermore, once a new leader has been elected, the old leader will be unable to created guaranteed log entries (due to the term management protocol described below).

The defer mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period will be large enough to get through several election cycles. In addition, the timeout period is reset whenever a passive machine receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred. Thus it is unlikely to take more than two election cycles to select a new leader.

Log management

This section describes how the log is managed during a particular term, and how log consistency is preserved when leadership changes.

Clients append to the log by making a request through the leader. The leader adds the new entry to its log, then sends a request containing that log entry to each of the other servers. Each server appends the entry to its log and also writes the data to durable secondary storage; once this is done, the server is said to have accepted the log entry. Once a majority of the cluster has accepted the new log entry, the leader can respond to the client. At this point the entry is called guaranteed because its durability is assured; the only event that could cause it can be lost is simultaneous catastrophic failures of more than half the servers in the cluster, causing them to lose their secondary storage.

If a passive server crashes then it will not be able to accept new data from the leader. The leader need not wait for the crashed server to restart before responding to client requests: as long as a majority of the cluster is responsive, the cluster can continue operation. When a server restarts after a crash, it enters passive mode (it does not attempt to contact the leader). If the leader does not receive an acceptance from a server when it sends a new log entry, it continues trying at regular intervals; eventually the server will restart, at which point the leader will "catch it up" on the log entries it has not yet received. This mechanism insurers that all servers will eventually mirror all log entries (in the absence of leader failures).

Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been partially accepted by the cluster (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been guaranteed by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated and accepted, or completely expunged from all logs. Any entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been guaranteed but not fully replicated, the new leader must make sure that these entries are eventually fully replicated.

First, let's handle the case of guaranteed but not fully replicated entries. ALPO makes sure that the new leader is chosen from among those servers that have accepted all of the guaranteed entries. It does this by modifying the notion of rank during leader election. When candidates request votes they include in the request the term for the election, their server id, and the log id of the most recent entry they have accepted. One candidate automatically outranks another if its "last accepted id" is higher than that of the other candidate. If they both have the same "last accepted id", then rank is determined by server id. Furthermore, a server will automatically reject a request for its vote if its "last accepted id" is higher than that of the requesting candidate, even if its vote is still available; when a candidate receives this form of rejection it immediately drops out of the election and returns to passive state.

Since a candidate requires votes from a majority of the cluster to become leader, and since it has accepted all of the log entries that were accepted by any of the other servers that voted for it, and since any guaranteed log entry must have been accepted by a majority of the servers in the cluster, the new leader is certain to store all of the guaranteed log entries. As it communicates with the other servers in the cluster it can update any that are running behind. In particular, when the leader sends a new log entry to a passive server, the passive server will reject the request unless it already stores all of the log entries with smaller ids. When this happens, the passive server indicates to the leader the highest id that stores, so the leader can then send it all of the missing entries.

The second problem is the case of partially accepted (but not yet guaranteed) log entries. The algorithm described in the previous paragraph makes it likely that the new leader will store these entries, in which case they will get fully replicated by the new leader. However, if an entry has only been accepted by a small number of servers then it is possible that a new leader can be elected without storing the entry. In this case the new leader must make sure that the entry is expunged by all other servers. The way it does this is by initiating a log append for a new entry indicating leadership change. The id for this entry will be the next one in order on the leader. When this entry arrives at a passive server, the passive server deletes any existing entries with this id or higher before it accepts the entry. This is the only situation where a passive server receives a log entry whose id it has already accepted.

Leader failures also introduce the potential for zombie leaders. A zombie leader is a leader that has been replaced but does not yet know it. ALPO must make sure that zombie leaders cannot modify the log. To do this, each request issued by the leader includes the leader's term number. If a passive server receives a request whose term is lower than the server's current term, then it rejects the request. If a leader receives such a request then it knows it has been deposed, so it returns to passive state. Before a server gives its vote to a candidate it must increase its term number to that of the new term. This guarantees that by the time new leader knows that it elected, it is impossible for the previous leader to communicate with a majority of the cluster, so it cannot create guaranteed log entries. Furthermore, if a leader receives any election-related requests from candidates with a higher term number, this also indicates that the leader has been deposed, so it returns to passive state.

One final issue related to log management is log cleaning. ALPO allows each server to perform cleaning (or any other form of log truncation) on its log independently of the other servers. However, there is one restriction on log cleaning: a server must not delete a log entry until it has been fully replicated. Otherwise the server could become leader and need that entry to update a lagging passive server. To ensure this property, the leader keeps track of the highest log id that has been fully replicated and includes this value in any requests that make to other servers. The other servers use this information to restrict cleaning; in most cases the fully-replicated-id will be at or near the head of the log, so this will not impose much of a restriction on cleaning.

Clients: exactly-once semantics

In ALPO, clients must send any requests that result in log modifications to the leader. If such a request arrives at a passive server (for example, because it used to be leader but has been deposed) the passive server rejects the request; in most cases it will be able to tell the client who is currently the leader. Clients can send read requests to any server in the cluster. However, passive servers may not be able to return the most recent log entries, for two reasons. First, the server might not have accepted the most recent log entries yet. Second, only guaranteed log entries can be returned to clients, and a passive server may not know whether its most recent log entries are guaranteed. One way to handle this is for the leader to include the highest guaranteed id in each request to other servers; in most cases, the append for entry N would indicate that entry N-1 is now guaranteed, so the passive server would would lag at most one log entry in comparison to the leader. Thus, if clients can tolerate a small amount of lag they can issue log reads to any server; if they want to be assured of getting all the most recent data, then they must send requests to the leader.

ALPO can provide exactly-once semantics to clients, meaning that if the leader fails while processing a log append request from a client, the client library package can automatically retry the request once a new leader has been elected, and the new entry will be guaranteed to be appended to the log exactly once (regardless of whether the original request completed before the leader crashed). In order to implement exactly-once semantics, clients must provide a unique serial number in each request; this serial number, along with the client identifier, must be included in every log entry so that it is seen by every server. Using this information, each server can keep track of the most recent serial number that has been successfully completed for each client. When a client retry the request because of leader failure, the new leader can use this information to skip the request if it has already been successfully completed.

Managing terms

This section contains more detailed information on managing terms. Terms are used to distinguish votes from different election cycles, and also to help servers detect when they are out of date with respect to the rest of the cluster. In general, if a candidate or leader finds itself out of date it immediately passivates, under the assumption that someone else will take over leadership.

Each server stores a term number called currentTerm. This indicates the most recent term that has been used by this server or received in a message from another server.
When a server starts, currentTerm is set to 0.
Every message from server to server contains the term of the sender, plus an indication whether the sender is a candidate or leader (passive servers never issue requests). This value (call it senderTerm) is used to update currentTerm and to detect out-of-data servers:
- senderTerm < currentTerm: reject the message with an indication that the sender is out of date; the rejection also includes currentTerm. When the sender receives the rejection it updates its currentTerm to match the one in the response, then passivates. This makes sense for leaders because it means their term of leadership is over. This also make sense for candidates because it means there is already a new election cycle with other active candidates; there is no need for the sender to participate.
- senderTerm == currentTerm: if the sender is a leader, then set lastReign to senderTerm.
- senderTerm > currentTerm: set currentTerm to senderTerm. If the recipient is currently a leader or candidate, then it passivates.
When a server switches from passive to candidate, it increments currentTerm to force a new election cycle.
I do not believe that currentTerm needs to be saved on disk. When a server restarts, it sets currentTerm to 0. Servers start out in passive state; most likely the leader will contact them (which will update currentTerm) before the new server times out and becomes a candidate. If all servers crash simultaneously, it should be OK for the term numbers to restart at 1. If all servers but one crash and restart, and for some reason the remaining server is disconnected from the others, it's possible that the new servers will choose a leader with term 1, without contacting the remaining server. At some point they will eventually contact it, which will cause the currentTerms to update. The existing leader will passivate, and a new election cycle will eventually occur; at this point everyone will be caught up to the term number of the server that didn't crash.

State for each server

currentTerm: number of the most recent term seen by this server.
Vote: the server id and log length of the candidate who has received this server's vote for currentTerm (if any).
Server id of this server.
Log entries that have been accepted by the server.
Id of the most recent log entry that has been accepted by all servers.
Time of receipt of the last request from the leader.
Cluster map: id and location of each server in the cluster, whether dead or alive.

Additional state kept by leader

For each other server, id of most recent log entry accepted by that server.
For each client: serial # of most recent request received from the client.

Contents of a log entry

Id: integer that serializes this entry within log: 1 for first entry, 2 for next, etc.
Client id: identifies the client that created this entry.
Client serial: serial # of the client request that created this entry (used to handle duplicate client requests).

Important parameters

Timeout interval: if a passive server receives no communication from a leader or candidate within this time period, then the server will convert to candidacy and initiate an election. This parameter should be an order of magnitude larger than the normal time it takes for one server to contact all of the other servers in the cluster.