Content Comparison

...

This section describes how the log is managed during a particular term, and how log consistency is preserved when leadership changes.

Replicating new entries. Clients append to the log by making requests through the leader. The leader adds the new entry to its log, then sends a request containing that log entry to each of the other servers. Each server appends the entry to its log and also writes the data to durable secondary storage; once this is done, the server is said to have accepted the log entry. Once a majority of the cluster has accepted the new log entry, the leader can respond to the client. At this point the entry is called guaranteed because its durability is assured; the only event that could cause to be lost is simultaneous catastrophic failures (i.e. they lose their secondary storage permanently) of more than half the servers in the cluster.

Passive server crashes. If a passive server crashes then it will not be able to accept new data from the leader. The leader need not wait for the crashed server to restart before responding to client requests: as long as a majority of the cluster is responsive, the cluster can continue operation. When a server restarts after a crash, it enters passive mode (it does not attempt to contact the leader). If the leader does not receive an acceptance from a server when it sends a new log entry, it continues trying at regular intervals; eventually the server will restart, at which point the leader will "catch it up" with the log entries it has not yet accepted. This mechanism ensures that all servers will eventually mirror all log entries.

Leader crashes. Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been partially accepted by the cluster (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been guaranteed by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated, or completely expunged from all logs. Any entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been guaranteed but not fully replicated, the new leader must make sure that these entries are eventually fully replicated.

First, let's handle the case of guaranteed but not fully replicated entries. ALPO makes sure that the new leader is chosen from among those servers whose log is complete (i.e. it includes all of the guaranteed entries). It does this by modifying extending the notion of rank during leader election. When candidates request votes they include in the request the log id of the most recent entry they have accepted. One candidate automatically outranks another if its "last accepted id" is higher than that of the other candidate. If they both have the same "last accepted id", then rank is determined by server id. Furthermore, a server will automatically reject a request for its vote if its "last accepted id" is higher than that of the requesting candidate, even if its vote is still available; when a candidate receives this form of rejection it immediately drops out of the election and returns to passive state.

...

The second problem is the case of partially accepted (but not yet guaranteed) log entries. The algorithm described in the previous paragraph makes it likely that the new leader will store these entries, in which case they will get fully replicated by the new leader. However, if an entry has only been accepted by a small number of servers, it is possible that a new leader can be elected without storing the entry. In this case the new leader must make sure that the entry is expunged by all other servers. It does this by appending a new log entry indicating leadership change. The id for this entry will be the next one in order on the leader. When this entry arrives at a passive server, the passive server deletes any existing entries with this id or higher before it accepts the entry. This is the only situation where a passive server receives a log entry whose id it has already accepted. Problem: passive server P stores a not-yet-guaranteed entry E when the leader goes down. The new leader creates the entry indicating leadership change (C) and starts replicating it; the entry gets rotated enough to be guaranteed, but the new leader crashes before C is stored on P. The next leader will now create a new leadership change entry, but at log position C+1; it's not clear how entry E will get overwritten with C.Leader failures also introduce the potential for zombie The next paragraph describes how this happens.

Restoring log consistency after a leader change. When a new leader takes over, its log is completely up-to-date, but logs on other servers may be inconsistent in two ways:

A log may be missing one or more of the most recent entries (if there are missing entries, they will be the most recent entries in the log).
The most recent entries in the log may be extraneous ones that are not present on the leader because they were not guaranteed at the time of the leader changeover. These entries must be deleted.

Both of these problems are handled in the same way. Whenever the leader sends a log entry to other servers for replication, it includes the term of the previous entry in the log (in addition to the term of the new entry and its log id, which are included in the entry itself). The recipient accepts the new entry only if (a) the id for the new entry is one greater than the id for the most recent entry currently in its log and (b) the term of the most recent entry currently in the log matches the "previous term" included in the request. Test (a) will detect missing entries, and test (b) will detect extraneous entries. If either of these tests fails, the recipient rejects the new entry and returns the id and term for its most recent entry. When the leader receives such a rejection it knows it must reply one or more older entries to the passive server:

The leader looks up its log the entry corresponding to the id returned from the passive server. If this entry exists and its term matches the term returned from the passive server, then the leader replays to the passive server all of its log entries after this one.
If the term does not match, or the entry does not exist, it means that the passive server has extraneous entries that were created at the end of the given term but never got deleted. In this situation the leader replays all of the log entries whose terms are greater than the one returned from the passive server.

Zombie leaders. A zombie leader is a leader that has been replaced but does not yet know it. ALPO must make sure that zombie leaders cannot modify the log. To do this, each request issued by the leader includes the leader's term number. If a passive server receives a request whose term is lower than the server's current term, then it rejects the request. If a leader receives such a request then it knows it has been deposed, so it returns to passive state. Before a server gives its vote to a candidate it must increase its term number to that of the new term. This guarantees that by the time new leader knows that it elected, it is impossible for the previous leader to communicate with a majority of the cluster, so it cannot create guaranteed log entries.

Log cleaning. One final issue related to log management is log cleaning. ALPO allows each server to perform cleaning (or any other form of log truncation) on its log independently of the other servers. However, there is one restriction on log cleaning: a server must not delete a log entry until it has been fully replicated. Otherwise the server could become leader and need that entry to update a lagging passive server. To ensure this property, the leader keeps track of the highest log id that has been fully replicated and includes this value in any requests that make to other servers. The other servers use this information to restrict cleaning; in most cases the fully-replicated-id will be at or near the head of the log, so this will not impose much of a restriction on cleaning.

...

In ALPO, clients must send any requests that result in log modifications to the leader. If such a request arrives at a passive server (for example, because it used to be leader but has been deposed) the passive server rejects the request; in most cases it will be able to tell the client who is currently the leader. Clients can send read requests to any server in the cluster. However, passive servers may not it will be able to return the most recent log entries, for two reasons. First, the server might not have accepted the most recent log entries yet. Second, only guaranteed log entries can be returned to clients, and a passive server may not know whether its most recent log entries are guaranteed. One way to handle this is for the leader to include the highest guaranteed id in each request to other servers; in most cases, the append for entry N would indicate that entry N-1 is now guaranteed, so the passive server would would lag at most one log entry in comparison to the leader. Thus, if clients can tolerate a small amount of lag they can issue log reads to any server; if they want to be assured of getting all the most recent data, then they must send requests to the leader.

ALPO can provide exactly-once semantics to clients, meaning that if the leader fails while processing a log append request from a client, the client library package can automatically retry the request once a new leader has been elected, and the new entry will be guaranteed to be appended to the log exactly once (regardless of whether the original request completed before the leader crashed). In order to implement exactly-once semantics, clients must provide a unique serial number in each request; this serial number, along with the client identifier, must be included in every log entry so that it is seen by every server. Using this information, each server can keep track of the most recent serial number that has been successfully completed for each client. When a client retries a request because of leader failure, the new leader can use this information to skip the request if it has already been successfully completed.

Managing terms

This section contains more detailed information on managing terms. Terms are used to distinguish votes from different election cycles, and also to help servers detect when they are out of date with respect to the rest of the cluster. In general, if a candidate or leader finds itself out of date it immediately passivates, under the assumption that someone else is leader or will become leader soon.

...

senderTerm < currentTerm: reject the message with an indication that the sender is out of date; the rejection also includes currentTerm. When the sender receives the rejection it updates its currentTerm to match the one in the response, then passivates. This makes sense for leaders because it means their term of leadership is over. This also make sense for candidates because it means there is already a new election cycle with other active candidates; there is no need for the sender to participate.
senderTerm == currentTerm: process the request normally.
senderTerm > currentTerm: set currentTerm to senderTerm. If the recipient is currently a leader or candidate, then it passivates. Finally, it processes the request.

...

tell the client who is currently the leader. Clients can send read requests to any server in the cluster. However, passive servers may not be able to return the most recent log entries, for two reasons. First, the server might not have accepted the most recent log entries yet. Second, only guaranteed log entries can be returned to clients, and a passive server may not know whether its most recent log entries are guaranteed. One way to handle this is for the leader to include the highest guaranteed id in each request to other servers; in most cases, the append for entry N would indicate that entry N-1 is now guaranteed, so the passive server would would lag at most one log entry in comparison to the leader. Thus, if clients can tolerate a small amount of lag they can issue log reads to any server; if they want to be assured of getting all the most recent data, then they must send requests to the leader.

ALPO can provide exactly-once semantics to clients, meaning that if the leader fails while processing a log append request from a client, the client library package can automatically retry the request once a new leader has been elected, and the new entry will be guaranteed to be appended to the log exactly once (regardless of whether the original request completed before the leader crashed). In order to implement exactly-once semantics, clients must provide a unique serial number in each request; this serial number, along with the client identifier, must be included in every log entry so that it is seen by every server. Using this information, each server can keep track of the most recent serial number that has been successfully completed for each client. When a client retries a request because of leader failure, the new leader can use this information to skip the request if it has already been successfully completed.

Managing terms

This section contains more detailed information on managing terms. Terms are used to distinguish votes from different election cycles, and also to detect out-of-date servers and extraneous log entries. In general, if a candidate or leader finds itself out of date it immediately passivates, under the assumption that another server is leader or will become leader soon.

Each server stores a term number called currentTerm. This indicates the most recent term that has been seen by this server.
Every log entry contains the term in which the entry was created. When a server starts up, it sets currentTerm to the term from the last entry in its log; if the log is empty, currentTerm is initialized to 0.
Every message from server to server contains the term of the sender, plus an indication whether the sender is a candidate or leader (passive servers never issue requests). This value (call it senderTerm) is used to update currentTerm and to detect out-of-data servers:
- senderTerm < currentTerm: reject the message with an indication that the sender is out of date; the rejection also includes currentTerm. When the sender receives the rejection it updates its currentTerm to match the one in the response, then passivates. This makes sense for leaders because it means their term of leadership is over. This also make sense for candidates because it means there is already a new election cycle with other active candidates; there is no need for the sender to participate.
- senderTerm == currentTerm: process the request normally.
- senderTerm > currentTerm: set currentTerm to senderTerm. If the recipient is currently a leader or candidate, then it passivates. Finally, it processes the request.
When a server switches from passive to candidate, it increments currentTerm to force a new election cycle.

State for each server

currentTerm: number of the most recent term seen by this server.
Vote: the server id and log length of the candidate who has received this server's vote for currentTerm (if any).
Server id of this server.
Log entries that have been accepted by the server.
Id of the most recent log entry that has been fully replicated.
Time of receipt of the last request from the leader.
Cluster map: id and location of each server in the cluster, whether dead or alive.

...

Id: integer that serializes this entry within log: 1 for first entry, 2 for next, etc.
Client id: identifies the client that created this entry.
Client serial: serial # of the client request that created this entry (used to handle duplicate client requests).
Term number in which this entry was created (used to detect stale entries from old terms that should be discarded)

Important parameters

Timeout interval: if a passive server receives no communication from a leader or candidate within this time period, then the server will convert to candidacy and initiate an election. This parameter should be an order of magnitude larger than the normal time it takes for one server to contact all of the other servers in the cluster.

Version	Old Version 15	New Version 16
Changes made by	John Ousterhout	John Ousterhout
Saved on	Apr 17, 2012	Apr 17, 2012

Versions Compared

Key

Managing terms

Managing terms

State for each server

Important parameters