Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

ALPO stands for "ALPO is Like Paxos, except (hopefully) more Obvious". This is an experiment to try to define a protocol for managing a replicated log using distributed consensus, in a way that is easier to understand and more complete than Paxos.

Problems with Paxos

Paxos has existed for more than 20 years, is generally believed to be correct, and has been implemented numerous times. Thus it might seem silly to think about alternative algorithms. However, Paxos suffers from the following problems:

  • In its purest form, as described by Lamport, it is incomplete. Basic Paxos guarantees safety, but (a) it does not deal with liveness issues (i.e. the algorithm may never terminate), (b) its data model consists of a single value, whereas real applications need to store a sequence of values, such as a log or state machine, and (c) it does not handle listeners (i.e., consensus may be reached but some of the parties may not realize that).
  • Other people have extended Paxos to be more complete, but there seems to be no agreed-upon way to do this and the various descriptions are quite complicated and difficult to understand.
  • Systems implementing Paxos all have reputations for being flaky or hard to use (though it's not clear whether this is because of Paxos).
  • Even in its simplest form the algorithm is hard to understand; the extended versions are nearly impossible to understand, even for experts. There does not seem to exist a description of Paxos that is both complete and relatively easy for a wide audience to understand. For example, most graduate students are not capable of understanding the full Paxos algorithm as part of a graduate course in operating systems.

Thus, I set out to devise a Paxos-like protocol that I could understand and convince myself to be correct. If that worked, the next step is to see if I could describe the protocol in a way that others could easily understand.

Goals

  • Create a log that is replicated across a cluster of servers. Each log entry will be identical on each of the servers.
  • The log is sequentially ordered: each entry has a unique integer id, and ids are assigned in ascending order.
  • If a server has accepted a given log entry, then it also has accepted all log entries with smaller ids than the given one.
  • If a majority of the servers have accepted a particular log entry, then that log entry is called accepted: it will be durable: it will eventually be accepted by all servers, and it will not be lost unless a majority of the servers suffer simultaneous catastrophic failures that lose all of their persistent data.
  • One of the servers in the cluster is designated the leader; all client requests that modify the log must be processed by the leader.
  • Clients can read log entries from any server, though there will usually be a time delay between when data can be read from the leader and when it can be read from other servers.
  • Leadership can move among the servers in the cluster in response to failures.
  • Once a new leader has been elected, it will not be possible for the previous leader to make noticeable updates to the log until it has been reelected.

State for each server

  • Term number: number of the most recent term seen by this server (updated when notified by a leader or a candidate, or when the server becomes a candidate).
  • Server id.
  • Log entries.
  • Id of the most recent log entry that has been accepted by all servers.

Additional state kept by leader

  • For each other server, id of most recent log entry accepted by that server.
  • For each client: serial # of most recent request received from the client.

Contents of a log entry

  • Id: integer that serializes this entry within log: 1 for first entry, 2 for next, etc.
  • Client id: identifies the client that created this entry.
  • Client serial: serial # of the client request that created this entry (used to handle duplicate client requests).
  • No labels