Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Each ALPO server is in one of three states: passive, leader, orcandidate. Most servers at any given time are passive: they respond to requests from the leader but take no actions on their own. A candidate is a server that is attempting to become leader. Passive servers become candidates during elections as described below.
  • In normal operation one of the servers in the cluster is the leader and all the other servers are passive. The leader must contact each of the passive servers at regular intervals, either by passing them new log entries or with a no-op heartbeat request. Each passive server keep track of the last time it received a message from the leader (or from candidates during an election); if a long period elapses with no such message (the timeout interval) then the server assumes that the leader has crashed and no one else is attempting to replace it; it converts itself from passive to candidate and begins an election cycle.
  • Time is divided up into terms, where terms have integer ids that increase monotonically. During each term there will be at most one leader in the cluster. The term starts with a single leader election and may be followed by a reign for the election winner. If the election produced a split vote then it is possible that there will be no winner, and hence no reign during this term. Each server stores the id for the current term; these ids may differ slightly from server to server.
  • When a server becomes a candidate, it increments its current term, marks its own vote to indicate that it is voting for itself, and then contacts each of the other servers in the cluster to request their vote. The other servers increment their current term if needed to match the new value, and respond to the request in one of the following ways:
    • "You have my vote": this means that the server has not given its vote to any other candidate in this term. In returning this response, the responder promises not to give its vote to any other candidate for the current term, and it will not become a candidate in this term.
    • If the server has already given its vote to another candidate, then it returns a rejection that includes the id of the candidate that received its vote.
    • If the server is down it may not respond at all.
  • The candidate continues in this phase (retrying with nonresponsive servers) until one of the following things happens:
    • It receives votes from the majority of the servers in the cluster. At this point it becomes the leader and begins regular communication with the other servers in the cluster.
    • It receives a message from another server claiming to be leader for this term. In this case the candidate accepts the new leader and returns to passive state.
    • It receives one or more vote rejections. For each vote rejection the candidate compares its own rank with the rank of the candidate that received the vote; for now, rank is determined purely by server id (this will be extended slightly below). If the candidate outranks the vote receiver, then it issues a defer request to the vote receiver. If the vote receiver already has enough votes to become leader, then it responds with that indication, in which case the requesting candidate will accept the new leader and return to passive state. If the vote receiver has not yet become leader, then it defers to the (higher-ranked) requesting candidate by returning itself to passive state; this will make it possible for the higher-ranked candidate to win a future election.
    • The sum of votes received plus votes owned by lower-ranked candidates who have deferred represents the majority of the cluster. In this case the candidate increments the term and starts a new election cycle (at this point no-one can win the current election cycle).  It is likely to win during this cycle, because the competing candidates have all returned to passive state and will not become candidates again until the timeout period elapses.
    • It receives a defer message from some other candidate with higher rank. In this case the candidate returns to passive state.

...

ALPO can provide exactly-once semantics to clients, meaning that if the leader fails while processing a log append request from a client, the client library package can automatically retry the request once a new leader has been elected, and the new entry will be guaranteed to be appended to the log exactly once (regardless of whether the original request completed before the leader crashed). In order to implement exactly-once semantics, clients must provide a unique serial number in each request; this serial number, along with the client identifier, must be included in every log entry so that it is seen by every server. Using this information, each server can keep track of the most recent serial number that has been successfully completed for each client. When a client retry the request because of leader failure, the new leader can use this information to skip the request if it has already been successfully completed.

State for each server

...

Managing terms

This section contains more detailed information on managing terms. Terms are used to distinguish different election cycles, and also to help servers detect when they are out of date with respect to the rest of the cluster.

  • Each server stores two term identifiers:
    • lastReign: the identifier of the highest-numbered term for which the server received a message from the leader.
    • currentTerm: identifier of the current term. This is often the same as lastReign, but may be greater than lastReign if an election is underway. CurrentTerm indicates the lowest-numbered term in which this server will accept a leader.
  • CurrentTerm and lastReign must be stored persistently on disk, so they can survive restarts of the server. Is this necessary? Suppose all servers go down simultaneously; is it okay if the terms reset to 0 when they come back up?
  • When a server starts, it uses the saved value to initialize the current term; if there is no saved value then the current term starts at 0.
  • Every message from server to server contains the term of the sender, plus an indication whether the sender is a candidate or leader (passive servers never issue requests). This value (call it senderTerm) is used to update lastReign and currentTerm and to detect out-of-data servers:
    • senderTerm < currentTerm: reject the message with an indication that the sender is out of date; the rejection also includes currentTerm. When the sender receives the rejection it updates its currentTerm to match the one in the response, then passivates. This makes sense for leaders because it means their term of leadership is over. This also make sense for candidates because it means there is already a new election cycle with other active candidates; there is no need for the sender to participate.
    • senderTerm == currentTerm: if the sender is a leader, then set lastReign to senderTerm.
    • senderTerm > currentTerm: set currentTerm to senderTerm. If the message is from a leader, then also set lastReign to senderTerm. If the recipient is currently a leader or candidate, then it passivates.
  • When a server switches from passive to candidate, it increments currentTerm to force a new election cycle.

State for each server

  • currentTerm: number of the most recent term seen by this server.
  • lastReign: number of the most recent term in which this server accepted a message from a leader.
  • Vote: the server id and log length of the candidate who has received this server's vote for the current term currentTerm (if any).
  • Server id of this server.
  • Log entries that have been accepted by the server.
  • Id of the most recent log entry that has been accepted by all servers.
  • Time of receipt of the last request from the leader.
  • Cluster map: id and location of each server in the cluster, whether dead or alive.

...