Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Create a log that is replicated across a cluster of servers. Each log entry will be identical on each of the servers.
  • The log is sequentially ordered: each entry has a unique integer id, and ids are assigned in ascending order.
  • If a server has accepted a given log entry, then it also has accepted all log entries with smaller ids than the given one.
  • If a majority of the servers have accepted a particular log entry, then that log entry is called accepted: it will be durable: it will eventually be accepted by all servers, and it will not be lost unless a majority of the servers suffer simultaneous catastrophic failures that lose all of their persistent data.
  • One of the servers in the cluster is designated the leader; all client requests that modify the log must be processed by the leader.
  • Clients can read log entries from any server, though there will usually be a time delay between when data can be read from the leader and when it can be read from other servers.
  • Leadership can move among the servers in the cluster in response to failures.
  • There is at most one leader in the cluster at a time. Once a new leader has been elected, it will not be possible for the previous leader to make noticeable updates to the log until it has been reelected.

...

The defer mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period will be large enough to get through several election cycles (furthermore, the timeout period is also reset whenever a passive machine receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred). Thus it is unlikely to take more than two election cycles to select a new leader.

Log management

This section describes how the log is managed during a particular term, and how log consistency is preserved  when leadership changes.

Clients append to the log by making a request through the leader. The leader adds the new entry to its log, then sends a request containing that log entry to each of the other servers. Each server appends the entry to its log and also writes the data to durable secondary storage; once this is done, the server is said to have accepted the log entry. Once a majority of the cluster has accepted the new log entry, the leader can respond to the client.  At this point the durability of the new entry is guaranteed; the only event that could cause permanent data loss is simultaneous catastrophic failures of more than half the servers in the cluster, causing them to lose their secondary storage.

If a passive server crashes then it will not be able to accept new data from the leader. The leader need not wait for the crashed server to restart before responding to client requests: as long as a majority of the cluster is responsive, the cluster can continue operation. When a server restarts after a crash, it enters passive mode (it does not attempt to contact the leader). If the leader does not receive an acceptance from a server when it sends a new log entry, it continues trying at regular intervals; eventually the server will restart, at which point the leader will "catch it up" on the log entries it has not yet received. This mechanism guarantees that all servers will eventually mirror all log entries, in the absence of leader figures.

Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been partially accepted by the cluster (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been accepted by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated and accepted, or completely expunged from all logs. The entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been accepted but not fully replicated, the new leader must guarantee that these entries are eventually fully replicated.

 

Linearizability for clients

State for each server

  • Term number: number of the most recent term seen by this server (updated when notified by a leader or a candidate, or when the server becomes a candidate).
  • Vote: the server id of the candidate who has received this server's vote for the current term (if any).
  • Server id of this server.
  • Log entries that has been accepted by the server.
  • Id of the most recent log entry that has been accepted by all servers.
  • Time of receipt of the last request from the leader.
  • Cluster map: id and location of each server in the cluster, whether dead or alive.

Additional state kept by leader

...