In distributed systems, ensuring a set of machines are able to reach consensus on a decision is extremely important. Database systems, for example, need to ensure that all machines agree whether to commit or rollback a transaction, so that each machine holds a consistent view of its data (imagine the problems your bank would have if one machine thought you had $1000 in your account, but another thought that you had $0).
Consensus is difficult to achieve, because messages between machines can be lost or indefinitely delayed, or the machines themselves can fail — how does one machine know whether another machine has processed a message?
Two-phase commit is commonly used to provide consensus, but it suffers from a single point of failure — if the node coordinating a transaction fails, the system must block until this node restarts. The three-phase commit protocol removes the blocking problem, but is still reliant on a single coordinator.
This post discusses Paxos, a distributed alternative to two- and three-phase commit. Paxos guarantees that nodes will only ever choose a single value (meaning it guarantees safety), but does not guarantee that a value will be chosen if a majority of nodes are unavailable (progress). Consensus algorithms aim for safety because it doesn’t matter whether we commit or rollback — neither is more correct than the other — but it is of critical importance that only one answer is chosen.
A Paxos node can take on any or all of three roles: proposer, acceptor, and learner. A proposer proposes a value that it wants agreement upon. It does this by sending a proposal containing a value to the set of all acceptors, which decide whether to accept the value. Each acceptor chooses a value independently — it may receive multiple proposals, each from a different proposer — and sends its decision to learners, which determine whether any value has been accepted. For a value to be accepted by Paxos, a majority of acceptors must choose the same value. In practice, a single node may take on many or all of these roles, but in the examples in this section each role is run on a separate node, as illustrated below.
Paxos By Example
In the standard Paxos algorithm proposers send two types of messages to acceptors: prepare and accept requests. In the first stage of this algorithm a proposer sends a prepare request to each acceptor containing a proposed value, v, and a proposal number, n. The proposed value can be commit or rollback, as in the previous database example, or it can be any other arbitrary value. In this post the Paxos nodes are attempting to reach consensus on an integer value. Each proposer’s proposal number must be a positive, monotonically increasing, unique, natural number, with respect to other proposers’ proposal numbers.
In the example illustrated below, there are two proposers, both making prepare requests. The request from proposer A reaches acceptors X and Y before the request from proposer B, but the request from proposer B reaches acceptor Z first.
If the acceptor receiving a prepare request has not seen another proposal, the acceptor responds with a prepare response which promises never to accept another proposal with a lower proposal number. This is illustrated in Figure 3 below, which shows the responses from each acceptor to the first prepare request they receive.
Figure 3: Paxos. Each acceptor responds to the first prepare request message that it receives.
Eventually, acceptor Z receives proposer A’s request, and acceptors X and Y receive proposer B’s request. If the acceptor has already seen a request with a higher proposal number, the prepare request is ignored, as is the case with proposer A’s request to acceptor Z. If the acceptor has not seen a higher numbered request, it again promises to ignore any requests with lower proposal numbers, and sends back the previous highest proposal number that it has seen along with the value of that proposal. This is the case with proposer B’s request to acceptors X and Y, as illustrated below:
Once a proposer has received prepare responses from a majority of acceptors it can issue an accept request. Since proposer A only received responses indicating that there were no previous proposals, it sends an accept request to every acceptor with the same proposal number and value as its initial proposal (n=2, v=8). However, these requests are ignored by every acceptor because they have all promised not to accept requests with a proposal number lower than 4 (in response to the prepare request from proposer B).
Proposer B sends an accept request to each acceptor containing the proposal number it previously used (n=4) and the value associated with the highest proposal number among the prepare response messages it received (v=8). Note that this is not the value that proposer B initially proposed, but the highest value from the prepare response messages it saw.
If an acceptor receives an accept request for a higher or equal proposal number than it has already seen, it accepts and sends a notification to every learner node. A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors have accepted a value, as is illustrated below:
Once a value has been chosen by Paxos, further communication with other proposers cannot change this value. If another proposer, proposer C, sends a prepare request with a higher proposal number than has previously been seen, and a different value (for example, n=6, v=7), each acceptor responds with the previous highest proposal (n=4, v=8). This requires proposer C to send an accept request containing [n=6, v=8], which only confirms the value that has already been chosen. Furthermore, if some minority of acceptors have not yet chosen a value, this process ensures that they eventually reach consensus on the same value.
Various efficiency improvements to the standard Paxos algorithm are discussed in the papers by Lamport and Baker et al.. For example, a prepare request is not necessary if the proposer knows that it is the first to suggest a value. The proposal for such a request is numbered 0, so that it will be ignored if any higher numbered requests have been received.
Paxos has often been criticized for its complexity, particularly with respect to the challenge of implementing it in a functional form. In spite of this, it’s an interesting example of a particularly challenging distributed systems problem, and a clever, conceptually-clean solution.
If you’re interested in this topic, i’d recommend reading about real-world examples of Paxos and two- and three-phase commit, starting with the references below.
L. Lamport, “Paxos Made Simple” in ACM SIGACT News, vol. 32, no. 4, pp. 18–25, 2001.
Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Larson, J., Léon, J. M., “Megastore: Providing Scalable, Highly Available Storage for Interactive Services” in Proceedings of the Conference on Innovative Data Systems Research, pp. 223-234, 2011.
T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an engineering perspective”, in Proceedings of the twenty-sixth annual ACM Symposium on Principles of Distributed Computing, 2007, pp. 398–407.
M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, in Proceedings of OSDI 2006.
 The method of ensuring the uniqueness of proposal numbers when there are multiple proposers is not specified in the Paxos algorithm itself.
 It may not, but the algorithm is resilient to this.
 Note that this is the highest proposal number that it received from prepare response messages. In this example, proposer B has a higher numbered proposal (n=4) than proposer A (n=2), but it has only received proposer A’s proposal in response to its prepare request. If no previous proposals were returned by the prepare response messages, proposer B would use its own proposal (n=4).