Content Comparison

ALPO stands for "ALPO is Like Paxos, except (hopefully) more Obvious". It is an experiment to try defining a protocol for managing a replicated log state machine using distributed consensus, in a way that is easier to understand and more complete than Paxos.

...

Create a log that is replicated across a cluster of servers. Each log entry is identical on each of the serversThe replicated log can serve as the basis for a replicated state machine: log entries correspond to commands (and their arguments) to be executed by the state machines; if each state machine executes the same commands in the same order, it will reach the same final state.
The log is sequentially ordered: each entry has a unique integer id, and ids are assigned in ascending order.
Clients make requests to append entries to the log, and to read log entries in order.
If a server has accepted a given log entry, then it also has accepted all log entries with smaller ids than the given one.If an append operation returns successfully, it means that the entry was appended with the next available id; the new entry will be durable as long as secondary storage is intact on a majority of the servers and available as long as a majority of the servers are up.
As long as a majority of the servers have accepted a particular log entry, then that log entry is called committed: it will eventually be accepted by all servers, and it will not be lost unless a majority of the servers suffer simultaneous catastrophic failures that lose all of their persistent data.
At any given time one of the servers in the cluster is designated the leader; all client requests that modify the log must be processed by the leader.
Clients can read log entries from any server, though there will be a time delay between when data can be read from the leader and when it can be read from other servers.
Leadership can move among the servers in the cluster in response to failures.
There is at most one leader in the cluster at a time. Once a new leader has been elected, it will not be possible for the previous leader to make updates to the log.

Leader election

The first part of the ALPO protocol manages the election of leaders so that (a) there is at most one leader at a time and (b) if the current leader crashes, a new leader will be elected to take its place. Note: the description below is slightly incomplete, since it does not describe the interaction between leader election and log management. The protocol is extended later in this document once the issue of log consistency has been introduced.

Each ALPO server is in one of three states: follower, leader, or candidate. Most servers at any given time are followers: they respond to requests from the leader but take no actions on their own (a follower never issues an RPC request). A candidate is a server that is attempting to become leader. Followers become candidates during elections as described below.
In normal operation one of the servers in the cluster is the leader and all the other servers are followers. The leader must contact each of the followers at regular intervals by issuing either appendEntry or heartbeat requests. Each follower keeps track of the last time it received a message from the leader (or from candidates during an election). If a long period elapses with no such message (the timeout interval) then the server assumes that the leader has crashed and no one else is attempting to replace it; it converts itself from follower to candidate and begins an election cycle.
Time is divided up into terms, where terms have integer ids that increase monotonically. During each term there will be at most one leader in the cluster. The term starts with a single leader election and may be followed by a reign for the election winner. If the election produced a split vote then it is possible that there will be no winner, and hence no reign during this term. Each server stores the id for the current term; different servers may have different ideas of the current term, but the ids will converge over time, and no leader can be elected unless a majority of the servers have (at some time) reached a particular term.
Each server has a rank, which is used to resolve conflicts between candidates during elections. For now, rank is determined purely by server id (this will be extended slightly in the section on log consistency).
When a server becomes a candidate, it increments its current term, marks its own vote to indicate that it is voting for itself, and then contacts each of the other servers in the cluster to request their votes. The other servers update their current term if needed to match the new value, and respond to the request in one of the following ways:
- "You have my vote": this means that the server has not given its vote to any other candidate in this term. In returning this response, the responder promises not to give its vote to any other candidate for the current term, and it will not become a candidate in this term.
- If the server has already given its vote to another candidate, then it returns a rejection that includes the id of the candidate that received its vote as well as some additional information described below.
- If the server is down it may not respond at all.
The candidate continues requesting votes (retrying with non-responsive servers) until one of the following things happens:
- It receives votes from the majority of the servers in the cluster. At this point it becomes the leader and sends a heartbeat message to each of the other servers in the cluster to announce its leadership.
- It receives a message from another server claiming to be leader for this term. In this case the candidate accepts the new leader and passivates (returns to follower state).
- It discovers the existence of another candidate with higher rank, in which case the lower-ranked candidate defers to the higher-ranked one by returning to follower state. Higher-ranking candidates can be discovered in two ways:
  - A candidate receives a request for its vote and discovers that the requester has higher rank.
  - A candidate requests the vote of another server and receives a response indicating that its vote has already been given to a different candidate; if the vote-receiving candidate has higher rank, then the lower-ranked candidate returns to follower state.
- The sum of votes received plus votes owned by lower-ranked candidates known to have passivated represents the majority of the cluster (a candidate knows that a lower ranked candidate has passivated if either of the servers has requested the other's vote). At this point no candidate can win the current election cycle. The candidate increments its term and starts a new election cycle. The candidate is likely to win during this cycle, because the competing candidates have returned to follower state and will not become candidates again until the timeout period elapses.
- The timeout period elapses with no communication with other servers (this can happen if there is a split vote and some of the candidates receiving votes subsequently crash, so that other candidates cannot tell whether they have passivated). In this case the candidate increments its term and starts a new election cycle.

This protocol is safe because the server will never declare itself leader for a term unless it has received votes from a majority of the servers. Thus it is impossible for more than one server to be elected in a given term. Furthermore, once a new leader has been elected, the old leader will be unable to create committed log entries (due to the term management protocol described below).

The deferral mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period is large enough to get through several election cycles. In addition, the timeout period is restarted whenever a follower receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred. Thus it is unlikely to take more than two election cycles to select a new leader.

Log management

This section describes how the log is managed during a particular term, and how log consistency is preserved when leadership changes.

Replicating new entries. Clients append to the log by making requests through the leader. The leader adds the new entry to its log, then sends an appendEntry request containing that log entry to each of the other servers. Each server appends the entry to its log and also writes the data to secondary storage; once this is done, the server is said to have accepted the log entry. Once a majority of the cluster has accepted the new log entry, the leader can respond to the client. At this point the entry is called committed because its durability is assured; the only event that could cause it to be lost is simultaneous catastrophic failures (i.e. they lose their secondary storage permanently) of more than half the servers in the cluster.

Follower crashes. If a follower crashes then it will not be able to accept new data from the leader. The leader need not wait for the crashed server to restart before responding to client requests: as long as a majority of the cluster is responsive, the cluster can continue operation. When a server restarts after a crash, it enters follower mode (it does not attempt to contact the leader). If the leader does not receive an acceptance from a server when it sends a new log entry, it continues trying at regular intervals; eventually the server will restart, at which point the leader will "catch it up" with the log entries it has not yet accepted. This mechanism ensures that all servers will eventually mirror all log entries.

Leader crashes. Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been accepted by a few nodes but not yet committed (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been committed by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated, or completely expunged from all logs. Any entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been committed but not fully replicated, the new leader must make sure that these entries are eventually fully replicated.

ALPO uses a two-step approach to handling leader crashes. First, it makes sure that the new leader's log is complete (it includes all of the committed entries). It does this by choosing the new leader from among those servers whose log is complete. ALPO extends the notion of rank during leader election to include log length. When candidates request votes they include in the request the log id of the most recent entry they have accepted. One candidate automatically outranks another if its "last accepted id" is higher than that of the other candidate. If they both have the same "last accepted id", then rank is determined by server id. Furthermore, a server will automatically reject a request for its vote if its "last accepted id" is higher than that of the requesting candidate, even if its vote is still available; when a candidate receives this form of rejection it drops out of the election and returns to follower state.

The second step in handling leader crashes is ensuring log consistency of the followers. When a new leader takes over, its log is up-to-date, but logs on other servers may be inconsistent in two ways:

A log may be missing one or more of the most recent entries (if there are missing entries, they will be the most recent entries in the log). If a follower has been down for a long time, its log could be missing several entire terms worth of data.
The most recent entries in the log may be extraneous ones that are not present on the leader because they were not committed at the time of the leader changeover. These entries could, in some cases, span several terms. These entries must be deleted.

In order to guarantee consistency of the log on all nodes, a follower only accepts a new log entry from the leader if it can be sure that all of its entries preceding the new one match exactly the corresponding entries in the leader. If accepting the new entry would violate this rule, then the follower rejects the appendEntry request and returns a log id to the leader; if the leader follows the advice of the follower by sending the requested entry, followed by all entries after that one, the follower's log will eventually become identical to the leader's log. It's possible that in some cases the follower may reject several up appendEntry requests in a row, but eventually this process will identify the log prefix where the leader and follower are identical, and the leader will then send the entries after this prefix.

In order to implement this rule, each appendEntry request includes the term of the previous entry in its log (in addition to the term of the new entry and its log id). When a follower receives an appendEntry request, it first removes from its log any entries with ids greater than or equal to the id of the new entry from the leader, since these are clearly extraneous. Next, the follower makes the following checks:

Does the log id of the new entry follow immediately after the current end of the follower's log? If not, the follower is missing some entries; it rejects the appendEntry request and returns the id of the next entry after its current last entry.
Does the term of its last log entry match the "previous term" in the appendEntry request? If not, the follower's last log entry is extraneous; the follower deletes this entry, rejects the appendEntry request, it returns the id of the entry it just deleted. It's possible that the new last entry is also extraneous; this will be discovered when the leader makes its next appendEntry request.

Zombie leaders. A zombie leader is a leader that has been replaced but does not yet know it. ALPO must make sure that zombie leaders cannot modify the log. To do this, each request issued by the leader includes the leader's term number. If a follower receives a request whose term is lower than the server's current term, then it rejects the request. If a leader receives such a request then it knows it has been deposed, so it returns to follower state. Before a server gives its vote to a candidate it must increase its term number to that of the new term. This guarantees that by the time new leader knows that it elected, it is impossible for the previous leader to communicate with a majority of the cluster, so it cannot create committed log entries.

Log cleaning. One final issue related to log management is log cleaning. ALPO allows each server to perform cleaning (or any other form of log truncation) on its log independently of the other servers. However, there is one restriction on log cleaning: a server must not delete a log entry until it has been fully replicated. Otherwise the server could become leader and need that entry to update a lagging follower. To ensure this property, the leader keeps track of the highest log id that has been fully replicated and includes this value in any requests that make to other servers. The other servers use this information to restrict cleaning; in most cases the fully-replicated-id will be at or near the head of the log, so this will not impose much of a restriction on cleaning.

Clients: exactly-once semantics

In ALPO, clients must send any requests that result in log modifications to the leader. If such a request arrives at a follower (for example, because it used to be leader but has been deposed) the follower rejects the request; in most cases it will be able to tell the client who is currently the leader. Clients can send read requests to any server in the cluster. However, followers may not be able to return the most recent log entries, for two reasons. First, the server might not have accepted the most recent log entries yet. Second, only committed log entries can be returned to clients, and a follower may not know whether its most recent log entries are committed. One way to handle this is for the leader to include the highest committed id in each request to other servers; in most cases, the append for entry N would indicate that entry N-1 is now committed, so the follower would would lag at most one log entry in comparison to the leader. Thus, if clients can tolerate a small amount of lag they can issue log reads to any server; if they want to be assured of getting all the most recent data, then they must send requests to the leader.

ALPO can provide exactly-once semantics to clients, meaning that if the leader fails while processing a log append request from a client, the client library package can automatically retry the request once a new leader has been elected, and the new entry will be committed to be appended to the log exactly once (regardless of whether the original request completed before the leader crashed). In order to implement exactly-once semantics, clients must provide a unique serial number in each request; this serial number, along with the client identifier, must be included in every log entry so that it is seen by every server. Using this information, each server can keep track of the most recent serial number that has been successfully completed for each client. When a client retries a request because of leader failure, the new leader can use this information to skip the request if it has already been successfully completed.

Managing terms

This section contains more detailed information on managing terms. Terms are used to distinguish votes from different election cycles, and also to detect out-of-date servers and extraneous log entries. In general, if a candidate or leader finds itself out of date it immediately passivates, under the assumption that another server is leader or will become leader soon.

Each server stores a term number called currentTerm. This indicates the most recent term that has been seen by this server.
Every log entry contains the term in which the entry was created. When a server starts up, it sets currentTerm to the later of (a) the term from the last entry in its log and (b) the term in its most recent vote; if neither of these values is present, currentTerm is initialized to 0.
Every message from server to server contains the term of the sender. This value (call it senderTerm) is used to update currentTerm and to detect out-of-date servers:
- senderTerm < currentTerm: reject the message with an indication that the sender is out of date; the rejection also includes currentTerm. When the sender receives the rejection it updates its currentTerm to match the one in the response, then passivates. This makes sense for leaders because it means their term of leadership is over. This also make sense for candidates because it means there is already a new election cycle with other active candidates; there is no need for the sender to participate.
- senderTerm == currentTerm: process the request normally.
- senderTerm > currentTerm: set currentTerm to senderTerm. If the recipient is currently a leader or candidate, then it passivates. Finally, it processes the request.
When a server switches from follower to candidate, it increments currentTerm to force a new election cycleare up, the system is live, which means that new entries can be appended and existing entries can be read.

Properties

In implementing the above goals, ALPO restricts the behavior of the system to have the following properties

If a server has accepted a given log entry, then it also has accepted all log entries with smaller ids than the given one.
If a majority of the servers have accepted a particular log entry, then that log entry is called committed: it will eventually be accepted by all servers, and it will not be lost unless a majority of the servers suffer simultaneous catastrophic failures that lose all of their persistent data. If a given log entry has been committed, then all preceding log entries have also been committed.
At any given time one of the servers in the cluster is designated the leader; all client requests that modify the log must be processed by the leader.
The leader's log is always complete, meaning that it contains all committed entries.
Clients can read log entries from any server, though there will be a time delay between when data can be read from the leader and when it can be read from other servers.
Leadership can move among the servers in the cluster in response to failures.
At any given time there is at most one leader capable of committing new log entries (there may exist other servers that think they are still leader, but they will not be able to commit new log entries).
There only three requests between servers in ALPO: appendEntry, heartbeat, and requestVote. These are described below.

Leader election

The first part of the ALPO protocol manages the election of leaders so that (a) there is at most one leader that can commit new log entries at any given time and (b) if the current leader crashes, a new leader will be elected to take its place. Note: the description below is slightly incomplete, since it does not describe the interaction between leader election and log management. The protocol is extended later in this document once the issue of log consistency has been introduced.

Each ALPO server is in one of three states: follower, leader, or candidate. Most servers at any given time are followers: they respond to requests from the leader but take no actions on their own (a follower never issues an RPC request). A candidate is a server that is attempting to become leader. Followers become candidates during elections as described below.
In normal operation one of the servers in the cluster is the leader and all the other servers are followers. The leader must contact each of the followers at regular intervals by issuing either appendEntry or heartbeat requests. Each follower keeps track of the last time it received a message from the leader (or from candidates during an election). If a long period elapses with no such message (the timeout interval) then the server assumes that the leader has crashed and no one else is attempting to replace it; it converts itself from follower to candidate and begins an election cycle.
Time is divided up into terms, where terms have integer ids that increase monotonically. During each term there will be at most one leader in the cluster. The term starts with a single leader election and may be followed by a reign for the election winner. If the election produced a split vote then it is possible that there will be no winner, and hence no reign during this term. Each server stores the id for the current term; different servers may have different ideas of the current term, but the ids will converge over time, and no leader can be elected unless a majority of the servers have (at some time) reached a particular term. The current term on any given server increases monotonically.
Each server has a rank, which is used to resolve conflicts between candidates during elections. For now, rank is determined purely by server id (this will be extended slightly in the section on log consistency).
When a server becomes a candidate, it increments its current term, marks its own vote to indicate that it is voting for itself, and then contacts each of the other servers in the cluster to request their votes. The other servers update their current term if needed to match the new value, and respond to the request in one of the following ways:
- "You have my vote": this means that the server has not given its vote to any other candidate in this term. In returning this response, the responder promises not to give its vote to any other candidate for the current term, and it will not become a candidate in this term.
- If the server has already given its vote to another candidate, then it returns a rejection that includes the id of the candidate that received its vote as well as some additional information described below.
- If the server is down it may not respond at all.
The candidate continues requesting votes (retrying with non-responsive servers) until one of the following things happens:
- It receives votes from the majority of the servers in the cluster. At this point it becomes the leader and sends a heartbeat message to each of the other servers in the cluster to announce its leadership.
- It receives a message from another server claiming to be leader for this term. In this case the candidate accepts the new leader and passivates (returns to follower state).
- It discovers the existence of another candidate with higher rank, in which case the lower-ranked candidate defers to the higher-ranked one by returning to follower state. Higher-ranking candidates can be discovered in two ways:
  - A candidate receives a request for its vote and discovers that the requester has higher rank.
  - A candidate requests the vote of another server and receives a response indicating that its vote has already been given to a different candidate; if the vote-receiving candidate has higher rank, then the lower-ranked candidate returns to follower state.
- The sum of votes received plus votes owned by lower-ranked candidates known to have passivated represents the majority of the cluster (a candidate knows that a lower ranked candidate has passivated if either of the servers has requested the other's vote). At this point no candidate can win the current election cycle. The candidate increments its term and starts a new election cycle. The candidate is likely to win during the new cycle, because the competing candidates have returned to follower state and will not become candidates again until the timeout period elapses.
- The timeout period elapses with no communication with other servers (this can happen if there is a split vote and some of the candidates receiving votes subsequently crash, so that other candidates cannot tell whether they have passivated). In this case the candidate increments its term and starts a new election cycle.

This protocol is safe because a candidate will never declare itself leader for a term unless it has received votes from a majority of the servers. Thus it is impossible for more than one server to be elected in a given term. Furthermore, once a new leader has been elected, the old leader will be unable to create committed log entries (due to the term management protocol described below).

The deferral mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period is large enough to get through several election cycles. In addition, the timeout period is restarted whenever a follower receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred. Thus it is unlikely to take more than two election cycles to select a new leader.

Log management

This section describes how the log is managed during a particular term, and how log consistency is preserved when leadership changes.

Replicating new entries. Clients append to the log by making requests through the leader. The leader adds the new entry to its log, then sends an appendEntry request containing that log entry to each of the other servers. Each server appends the entry to its log and also writes the data to secondary storage; once this is done, the server is said to have accepted the log entry. Once a majority of the cluster has accepted the new log entry, the leader can respond to the client. At this point the entry is called committed because its durability is assured; the only event that could cause it to be lost is simultaneous catastrophic failures of more than half the servers in the cluster (i.e. they lose their secondary storage permanently).

Follower crashes. If a follower crashes then it will not be able to accept new data from the leader. The leader need not wait for the crashed server to restart before responding to client requests: as long as a majority of the cluster is responsive, the cluster can continue operation. When a server restarts after a crash, it enters follower mode (it does not attempt to contact the leader). If the leader does not receive an acceptance from a server when it sends a new log entry, it continues trying at regular intervals; eventually the server will restart, at which point the leader will "catch it up" with the log entries it has not yet accepted. This mechanism ensures that all servers will eventually mirror all log entries.

Leader crashes. Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been accepted by a few nodes but not yet committed (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been committed by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated, or completely expunged from all logs. Any entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been committed but not fully replicated, the new leader must make sure that these entries are eventually fully replicated.

ALPO uses a two-step approach to handling leader crashes. First, it makes sure that the new leader's log is complete (it includes all of the committed entries). It does this by choosing the new leader from among those servers whose log is complete. ALPO extends the notion of rank during leader election to include log length. When candidates request votes they include in the request the log id of the most recent entry they have accepted. One candidate automatically outranks another if its "last accepted id" is higher than that of the other candidate. If they both have the same "last accepted id", then rank is determined by server id. Furthermore, a server will automatically reject a request for its vote if its "last accepted id" is higher than that of the requesting candidate, even if its vote is still available; when a candidate receives this form of rejection it drops out of the election and returns to follower state. These rules guarantee that the new leader's log is complete: the new leader must have contacted a majority of servers in the cluster, and its log must be at least as complete as any other log in that majority, so all committed entries must be in the leader's log.

The second step in handling leader crashes is ensuring log consistency of the followers. When a new leader takes over, its log is up-to-date, but logs on other servers may be inconsistent in either or both of two ways:

A log may be missing one or more of the most recent entries (if there are missing entries, they will be the most recent entries in the log). If a follower has been down for a long time, its log could be missing several entire terms worth of data.
The most recent entries in the log may be extraneous ones that are not present on the leader because they were not committed at the time of the leader changeover. These entries could, in some cases, span several terms. These entries must be deleted.

In order to guarantee consistency of the log on all nodes, a follower only accepts a new log entry from the leader if it can be sure that all of its entries preceding the new one match exactly the corresponding entries in the leader. If accepting the new entry would violate this rule, then the follower rejects the appendEntry request and returns a log id to the leader; if the leader sends the entry indicated by the follower, followed by all entries after that one, the follower's log will eventually become identical to the leader's log. It's possible that in some cases the follower may reject several appendEntry requests in a row, but eventually this process will identify the log prefix where the leader and follower are identical, and the leader will then send the entries after this prefix.

In order to implement this rule, each log entry stores the term in which it was accepted by the leader. Each appendEntry request includes the term of the previous entry in its log (in addition to the term of the new entry and its log id). When a follower receives an appendEntry request, it first removes from its log any entries with ids greater than or equal to the id of the new entry from the leader, since these are clearly extraneous. Next, the follower makes the following checks:

Does the log id of the new entry follow immediately after the current end of the follower's log? If not, the follower is missing some entries; it rejects the appendEntry request and returns the id of the next entry after its current last entry.
Does the term of its last log entry match the "previous term" in the appendEntry request? If not, the follower's last log entry is extraneous; the follower deletes this entry, rejects the appendEntry request, it returns the id of the entry it just deleted. It's possible that the new last entry is also extraneous; this will be discovered when the leader makes its next appendEntry request. Eventually all extraneous entries will be deleted.

Managing terms

This section contains more detailed information on managing terms. Terms are used to distinguish votes from different election cycles, and also to detect out-of-date servers and extraneous log entries. In general, if a candidate or leader finds itself out of date it immediately passivates, under the assumption that another server is leader or will become leader soon.

Each server stores a term number called currentTerm. This indicates the most recent term that has been seen by this server.
Every log entry contains the term in which the entry was originally received by a leader.
When a server starts up, it sets currentTerm to the later of (a) the term from the last entry in its log and (b) the term in its most recent vote; if neither of these values is present, currentTerm is initialized to 0.
Every message from server to server contains the term of the sender. This value (call it senderTerm) is used to update currentTermand to detect out-of-date servers:
- senderTerm < currentTerm: reject the message with an indication that the sender is out of date; the rejection also includes currentTerm. When the sender receives the rejection it updates its currentTerm to match the one in the response, then passivates. This makes sense for leaders because it means their term of leadership is over. This also make sense for candidates because it means there is already a new election cycle with other active candidates; there is no need for the sender to participate.
- senderTerm == currentTerm: process the request normally.
- senderTerm > currentTerm: set currentTerm to senderTerm. If the recipient is currently a leader or candidate, then it passivates. Finally, it processes the request.
When a server switches from follower to candidate, it increments currentTerm to force a new election cycle.
Terms are also used when comparing log lengths during elections: if the term of a server's last log entry is greater than the term of another server's last log entry, then the first server's log is considered to be more complete; if the terms are identical, then the longer of the two logs is considered to be more complete.

Log cleaning. One final issue related to log management is log cleaning. ALPO allows each server to perform cleaning (or any other form of log truncation) on its log independently of the other servers. However, there is one restriction on log cleaning: a server must not delete a log entry until it has been fully replicated. Otherwise the server could become leader and need that entry to update a lagging follower. To ensure this property, the leader keeps track of the highest log id that has been fully replicated and includes this value in any requests that make to other servers. The other servers use this information to restrict cleaning; in most cases the fully-replicated-id will be at or near the head of the log, so this will not impose much of a restriction on cleaning.

Clients: exactly-once semantics

In ALPO, clients must send any requests that result in log modifications to the leader. If such a request arrives at a follower (for example, because it used to be leader but has been deposed) the follower rejects the request; in most cases it will be able to tell the client who is currently the leader. Clients can send read requests to any server in the cluster. However, followers may not be able to return the most recent log entries, for two reasons. First, the server might not have accepted the most recent log entries yet. Second, only committed log entries can be returned to clients, and a follower may not know whether its most recent log entries are committed. One way to handle this is for the leader to include the highest committed id in each request to other servers; in most cases, the append for entry N would indicate that entry N-1 is now committed, so the follower would would lag at most one log entry in comparison to the leader. Thus, if clients can tolerate a small amount of lag they can issue log reads to any server; if they want to be assured of getting all the most recent data, then they must send requests to the leader.

ALPO can provide exactly-once semantics to clients, meaning that if the leader fails while processing a log append request from a client, the client library package can automatically retry the request once a new leader has been elected, and the new entry will be committed to the log exactly once (regardless of whether the original request completed before the leader crashed). In order to implement exactly-once semantics, clients must provide a unique serial number in each request; this serial number, along with the client identifier, must be included in every log entry so that it is seen by every server. Using this information, each server can keep track of the most recent serial number that has been successfully completed for each client. When a client retries a request because of leader failure, the new leader can use this information to skip the request if it has already been successfully completed.

Log cleaning

One final issue related to log management is log cleaning. ALPO allows each server to perform cleaning (or any other form of log truncation) on its log independently of the other servers. However, there is one restriction on log cleaning: a server must not delete a log entry until it has been fully replicated. Otherwise the server could become leader and need that entry to update a lagging follower. To ensure this property, the leader keeps track of the highest log id that has been fully replicated and includes this value in any requests that make to other servers. The other servers use this information to restrict cleaning; in most cases the fully-replicated-id will be at or near the head of the log, so this will not impose much of a restriction on cleaning.

Requests

There are only three requests sent between servers in ALPO:

appendEntry(entry, prevTerm): add entry to the local log, assuming that its id is one greater than that of the last entry currently in the log, and that the term of the last entry in the log matches prevTerm.

heartbeat(leaderIdleaderId, term, lastLogId, lastLogTerm): used to inform followers that there is a leader and that the leader is alive and well. LeaderId is the id of the leader. The lastlogId and lastLogTerm arguments specify the id and term from the leader's last log entry; if these don't match the last log entry in the recipient, then it returns the id and term from its last log entry so the leader can replay missing entries.

requestVote(serverId, term, lastLogId, lastLogTerm): used by a candidate to request another server's vote for term. ServerId is the id of the requesting candidate, and lastLogId and lastLogTerm specify the id and term from the requestor's last log entry; if either of the recipient's corresponding items is larger then it means the requestor's log is not complete enough for it to become the new leader, so the request is rejected.

...

Version	Old Version 28	New Version 29
Changes made by	John Ousterhout	John Ousterhout
Saved on	May 31, 2012	Jun 21, 2012

Versions Compared

Key

Leader election

Log management

Clients: exactly-once semantics

Managing terms

Properties

Leader election

Log management

Managing terms

Clients: exactly-once semantics

Log cleaning

Requests