Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Create a log that is replicated across a cluster of servers. Each log entry will be identical on each of the servers.
  • The log is sequentially ordered: each entry has a unique integer id, and ids are assigned in ascending order.
  • Clients will make request requests to append entries to the law, and to read log entries in order.
  • If a server has accepted a given log entry, then it also has accepted all log entries with smaller ids than the given one.
  • If a majority of the servers have accepted a particular log entry, then that log entry is called guaranteed: it will eventually be accepted by all servers, and it will not be lost unless a majority of the servers suffer simultaneous catastrophic failures that lose all of their persistent data.
  • One of the servers in the cluster is designated the leader; all client requests that modify the log must be processed by the leader.
  • Clients can read log entries from any server, though there will usually be a time delay between when data can be read from the leader and when it can be read from other servers.
  • Leadership can move among the servers in the cluster in response to failures.
  • There is at most one leader in the cluster at a time. Once a new leader has been elected, it will not be possible for the previous leader to make updates to the log until it has been reelected.

Leader election

The first part of the ALPO protocol manages the election of leaders so that (a) there is at most one leader at a time and (b) if the current leader crashes, a new leader will be elected to take its place. Note: the description below is slightly incomplete, since it does not describe the interaction between leader election and log management. The protocol will be extended slightly is extended later in this document once log management has been introduced.

  • Each ALPO server is in one of three states: passive, leader, orcandidateor candidate. Most servers at any given time are passive: they respond to requests from the leader but take no actions on their own (a passive server never issues an RPC request). A candidate is a server that is attempting to become leader. Passive servers become candidates during elections as described below.
  • In normal operation one of the servers in the cluster is the leader and all the other servers are passive. The leader must contact each of the passive servers at regular intervals, either by passing them new log entries or with a no-op heartbeat request. Each passive server keep track of the last time it received a message from the leader (or from candidates during an election); if a long period elapses with no such message (the timeout interval) then the server assumes that the leader has crashed and no one else is attempting to replace it; it converts itself from passive to candidate and begins an election cycle.
  • Time is divided up into terms, where terms have integer ids that increase monotonically. During each term there will be at most one leader in the cluster. The term starts with a single leader election and may be followed by a reign for the election winner. If the election produced a split vote then it is possible that there will be no winner, and hence no reign during this term. Each server stores the id for the current term; these ids may differ slightly from server to serverdifferent servers may have different ideas of the current term, but the ids will converge over time, and no leader can be elected unless a majority of the servers have (at some time) reached a particular term.
  • When a server becomes a candidate, it increments its current term, marks its own vote to indicate that it is voting for itself, and then contacts each of the other servers in the cluster to request their votevotes. The other servers increment update their current term if needed to match the new value, and respond to the request in one of the following ways:
    • "You have my vote": this means that the server has not given its vote to any other candidate in this term. In returning this response, the responder promises not to give its vote to any other candidate for the current term, and it will not become a candidate in this term.
    • If the server has already given its vote to another candidate, then it returns a rejection that includes the id of the candidate that received its vote.
    • If the server is down it may not respond at all.
  • The candidate continues in this phase (retrying with nonresponsive servers) until one of the following things happens:
    • It receives votes from the majority of the servers in the cluster. At this point it becomes the leader and begins regular communication with the other servers in the cluster.
    • It receives a message from another server claiming to be leader for this term. In this case the candidate accepts the new leader and returns to passive state.
    • It receives one or more vote rejections. For each vote rejection the candidate compares its own rank with the rank of the candidate that received the vote; for now, rank is determined purely by server id (this will be extended slightly below). If the candidate outranks the vote receiver, then it issues a defer request to the vote receiver. If the vote receiver already has enough votes to become leader, then it responds with that indication, in which case the requesting candidate will accept the new leader and return to passive state. If the vote receiver has not yet become leader, then it defers to the (higher-ranked) requesting candidate by returning itself to passive state; this will make it possible for the higher-ranked candidate to win a future election. If the candidate's rank is less than that of the vote receiver then the candidate passivates.
    • The sum of votes received plus votes owned by lower-ranked candidates who have deferred represents the majority of the cluster. In this case the candidate increments the term and starts a new election cycle (at this point no-one can win the current election cycle).  It The candidate is likely to win during this cycle, because the competing candidates have all returned to passive state and will not become candidates again until the timeout period elapses.
    • It receives a defer message request from some other candidate with higher rank. In this case the candidate returns to passive state.

This protocol is safe because the server will never declare itself leader for a term unless it has received votes from a majority of the servers. Thus it is impossible for more than one server to be elected in a given term.  Furthermore, once a new leader has been elected, the old leader will be unable to created guaranteed log entries (due to the term management protocol described below).

The defer mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period will be large enough to get through several election cycles. In addition, the timeout period is also reset whenever a passive machine receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred. Thus it is unlikely to take more than two election cycles to select a new leader.

...