Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Each ALPO server is in one of three states: passive, leader, or candidate. Most servers at any given time are passive: they respond to requests from the leader but take no actions on their own (a passive server never issues an RPC request). A candidate is a server that is attempting to become leader. Passive servers become candidates during elections as described below.
  • In normal operation one of the servers in the cluster is the leader and all the other servers are passive. The leader must contact each of the passive servers at regular intervals, either by passing them new log entries or with a no-op heartbeat request. Each passive server keep track of the last time it received a message from the leader (or from candidates during an election); if a long period elapses with no such message (the timeout interval) then the server assumes that the leader has crashed and no one else is attempting to replace it; it converts itself from passive to candidate and begins an election cycle.
  • Time is divided up into terms, where terms have integer ids that increase monotonically. During each term there will be at most one leader in the cluster. The term starts with a single leader election and may be followed by a reign for the election winner. If the election produced a split vote then it is possible that there will be no winner, and hence no reign during this term. Each server stores the id for the current term; different servers may have different ideas of the current term, but the ids will converge over time, and no leader can be elected unless a majority of the servers have (at some time) reached a particular term.
  • Each server has a rank, which is used to resolve conflicts between candidates during elections. For now, rank is determined purely by server id (this will be extended slightly in the section on log management).
  • When a server becomes a candidate, it increments its current term, marks its own vote to indicate that it is voting for itself, and then contacts each of the other servers in the cluster to request their votes. The other servers update their current term if needed to match the new value, and respond to the request in one of the following ways:
    • "You have my vote": this means that the server has not given its vote to any other candidate in this term. In returning this response, the responder promises not to give its vote to any other candidate for the current term, and it will not become a candidate in this term.
    • If the server has already given its vote to another candidate, then it returns a rejection that includes the id of the candidate that received its vote.
    • If the server is down it may not respond at all.
  • The candidate continues in this phase requesting votes (retrying with nonresponsive servers) until one of the following things happens:
    • It receives votes from the majority of the servers in the cluster. At this point it becomes the leader and begins regular communication with sends a heartbeat message to each of the other servers in the cluster to announce its leadership.
    • It receives a message from another server claiming to be leader for this term. In this case the candidate accepts the new leader and returns to passive state.
    • It receives one or more vote rejections. For each vote rejection the candidate compares its own rank with the rank of the candidate that received the vote; for now, rank is determined purely by server id (this will be extended slightly below). If the candidate outranks the vote receiver, then it issues a defer request to the vote receiver. If the vote receiver already has enough votes to become leader, then it responds with that indication, in which case the requesting candidate will accept the new leader and return to passive state. If the vote receiver has not yet become leader, then it defers to the (higher-ranked) requesting candidate by returning itself to passive state; this will make it possible for the higher-ranked candidate to win a future election. If the candidate's rank is less than that of the vote receiver then the discovers the existence of another candidate with higher rank, in which case the lower-ranked candidate defers to the higher-ranked one by returning to passive state. Higher-ranking candidates can be discovered in two ways:
      • A candidate receives a request for its vote and discovers that the requester has higher rank.
      • A candidate requests the vote of another server and receives a response indicating that the vote has already been given to a different candidate; if the vote-receiving candidate has higher rank, then the lower-ranked candidate passivates.
    • The sum of votes received plus votes owned by lower-ranked candidates who have deferred represents the majority of the cluster. In this case the passivated represents the majority of the cluster (the candidate knows that a lower ranked candidate has passivated if has responded to a request for its vote). At this point no-one can win the current election cycle. The candidate increments the term and starts a new election cycle (at this point no-one can win the current election cycle).  The candidate is likely to win during this cycle, because the competing candidates have all returned to passive state and will not become candidates again until the timeout period elapses.
    • It receives a defer request from some other candidate with higher rank. In this case the candidate returns to passive state.

This protocol is safe because the server will never declare itself leader for a term unless it has received votes from a majority of the servers. Thus it is impossible for more than one server to be elected in a given term.  Furthermore, once a new leader has been elected, the old leader will be unable to created guaranteed log entries (due to the term management protocol described below).

The defer deferral mechanism ensures that the protocol will converge rapidly even if there are initially many candidates. Once a candidate has deferred, it will not become a candidate again until the timeout period has elapsed, and the timeout period will be large enough to get through several election cycles. In addition, the timeout period is reset restarted whenever a passive machine receives a message from a candidate, which further reduces the likelihood that a candidate will reenter an election once it has deferred. Thus it is unlikely to take more than two election cycles to select a new leader.

...