Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Leader crashes. Leader failures are more interesting. At the time of a leader failure, there may be one or more log entries that have been accepted by a few nodes but not yet guaranteed (i.e., the leader has not yet responded to the requesting client). There may also be any number of log entries that have been guaranteed by the cluster, but are not yet fully replicated on all servers. For the partially accepted entries, the new leader must guarantee that these entries are either fully replicated, or completely expunged from all logs. Any entry that is expunged must never be returned to a client in a read operation: the system must behave as if the client never made its original request. For entries that have been guaranteed but not fully replicated, the new leader must make sure that these entries are eventually fully replicated.

First, let's handle the case of guaranteed but not fully replicated entries. ALPO ALPO uses a two-step approach to handling leader crashes.  First, it makes sure that the new leader is chosen from among those servers whose log is complete (i.e. it includes all of the guaranteed entries). It does this by extending the notion of rank during leader election to include log length. When candidates request votes they include in the request the log id of the most recent entry they have accepted. One candidate automatically outranks another if its "last accepted id" is higher than that of the other candidate. If they both have the same "last accepted id", then rank is determined by server id. Furthermore, a server will automatically reject a request for its vote if its "last accepted id" is higher than that of the requesting candidate, even if its vote is still available; when a candidate receives this form of rejection it drops out of the election and returns to passive state.

Since a candidate requires votes from a majority of the cluster to become leader, and since it has accepted all of the log entries that were accepted by any of the other servers that voted for it, and since any guaranteed log entry must have been accepted by a majority of the servers in the cluster, the new leader is certain to store all of the guaranteed log entries. As it communicates with the other servers in the cluster it can update any that are running behind. In particular, when the leader sends a new log entry to a passive server, the passive server will reject the request unless it already stores all of the log entries with smaller ids. When this happens, the passive server indicates to the leader the highest id that it stores, so the leader can then send it all of the missing entries.

The second problem is the case of partially accepted (but not yet guaranteed) log entries. The algorithm described in the previous paragraph makes it likely that the new leader will store these entries, in which case they will get fully replicated by the new leader. However, if an entry has only been accepted by a small number of servers, it is possible that a new leader can be elected without storing the entry. In this case the new leader must make sure that the entry is expunged by all other servers. The next paragraph describes how this happens.

Restoring log consistency after a leader changeis higher than that of the requesting candidate, even if its vote is still available; when a candidate receives this form of rejection it drops out of the election and returns to passive state.

The second step in handling leader crashes is ensuring log consistency. When a new leader takes over, its log is completely up-to-date, but logs on other servers may be inconsistent in two ways:

...

Both of these problems are handled in the same way. Whenever the leader sends a log entry to other servers for replication, it includes the term of the previous entry in the log (in addition to the term of the new entry and its log id, which are included in the entry itself). The recipient accepts the new entry only if (a) the id for the new entry is one greater than the id for the most recent entry currently in its log and (b) the term of the most recent entry currently in the log matches the "previous term" included in the request. Test (a) will detect missing entries, and test (b) will detect extraneous entries. If either of these tests fails, the recipient rejects the new entry and returns the id and term for its most recent entry. When the leader receives such a rejection it knows it must reply replay one or more older entries to the passive server:

  • The leader looks up its log the entry corresponding to the id returned from the passive server. If this entry exists and its term matches the term returned from the passive server, then the leader replays to the passive server all of its log entries after this one.
  • If the term does not match, or the entry does not exist, it means that the passive server has extraneous entries that were created at the end of the given term but never got deletednot present on the leader. In this situation the leader replays all of the log entries whose terms are greater than the one returned from the passive server. When a passive server receives an entry whose id is already present in its log, this indicates to the passive server that the particular entry, and all that follow it, are extraneous; it deletes them all.

Zombie leaders. A zombie leader is a leader that has been replaced but does not yet know it. ALPO must make sure that zombie leaders cannot modify the log. To do this, each request issued by the leader includes the leader's term number. If a passive server receives a request whose term is lower than the server's current term, then it rejects the request. If a leader receives such a request then it knows it has been deposed, so it returns to passive state. Before a server gives its vote to a candidate it must increase its term number to that of the new term. This guarantees that by the time new leader knows that it elected, it is impossible for the previous leader to communicate with a majority of the cluster, so it cannot create guaranteed log entries.

...

appendEntry(entry, prevTerm): add entry to the local log, assuming that its id is one greater than that of the last entry currently in the log, and that the term of the last entry in the log matches prevTerm.

heartbeat(serverIdleaderId, term, lastLogId, lastLogTerm): used to inform passive servers that there is a leader and that the leader is alive and well. ServerId LeaderId is the id of the leader. The lastlogId and lastLogTerm arguments specify the id and term from the leader's last log entry; if these don't match the last log entry in the recipient, then it returns the id and term from its last log entry so that the leader can replay missing entries.

requestVote(serverId, term, lastLogId, lastLogTerm): used by a candidate to request another server's vote for termServerId is the id of the requestorrequesting candidate, and lastLogId and lastLogTerm specify the id and term from the requestor's last log entry; if either of the recipient's corresponding items is larger then it means the requestor's log is not complete enough for it to become the new leader, so the request is rejected.

State for each server

  • currentTerm: number of the most recent term seen by this server.
  • Vote: the server id of the candidate who has received this server's vote for currentTerm (if any), plus the id and term from its most recent log entry. This information must be persisted on disk.
  • Server id of this server.
  • Log entries that have been accepted by the server.
  • Id of the most recent log entry that has been fully replicated.
  • Time of receipt of the last request from the leader.
  • Cluster map: id and location of each server in the cluster, whether dead or alive.

...