Distributed Consensus: Paxos by Example

In distributed systems, ensuring a set of machines are able to reach consensus on a decision is extremely important. Database systems, for example, need to ensure that all machines agree whether to commit or rollback a transaction, so that each machine holds a consistent view of its data (imagine the problems your bank would have if one machine thought you had $1000 in your account, but another thought that you had $0).

Consensus is difficult to achieve, because messages between machines can be lost or indefinitely delayed, or the machines themselves can fail — how does one machine know whether another machine has processed a message?

Two-phase commit is commonly used to provide consensus, but it suffers from a single point of failure — if the node coordinating a transaction fails, the system must block until this node restarts. The three-phase commit protocol removes the blocking problem, but is still reliant on a single coordinator.

This post discusses Paxos, a distributed alternative to two- and three-phase commit. Paxos guarantees that nodes will only ever choose a single value (meaning it guarantees safety), but does not guarantee that a value will be chosen if a majority of nodes are unavailable (progress). Consensus algorithms aim for safety because it doesn’t matter whether we commit or rollback — neither is more correct than the other — but it is of critical importance that only one answer is chosen.

General Approach

Paxos node can take on any or all of three roles: proposeracceptor, and learner. A proposer proposes a value that it wants agreement upon. It does this by sending a proposal containing a value to the set of all acceptors, which decide whether to accept the value. Each acceptor chooses a value independently — it may receive multiple proposals, each from a different proposer — and sends its decision to learners, which determine whether any value has been accepted. For a value to be accepted by Paxos, a majority of acceptors must choose the same value. In practice, a single node may take on many or all of these roles, but in the examples in this section each role is run on a separate node, as illustrated below.

Figure 1: Basic Paxos architecture. A number of proposers make proposals to acceptors. When an acceptor accepts a value it sends the result to learner nodes.

Paxos By Example

In the standard Paxos algorithm proposers send two types of messages to acceptors: prepare and accept requests. In the first stage of this algorithm a proposer sends a prepare request to each acceptor containing a proposed value, v, and a proposal number, n. The proposed value can be commit or rollback, as in the previous database example, or it can be any other arbitrary value. In this post the Paxos nodes are attempting to reach consensus on an integer value. Each proposer’s proposal number must be a positive, monotonically increasing, unique, natural number, with respect to other proposers’ proposal numbers[1].

In the example illustrated below, there are two proposers, both making prepare requests. The request from proposer A reaches acceptors X and Y before the request from proposer B, but the request from proposer B reaches acceptor Z first.

Figure 2: Paxos. Proposers A and B each send prepare requests to every acceptor. In this example proposer A’s request reaches acceptors X and Y first, and proposer B’s request reaches acceptor Z first.

If the acceptor receiving a prepare request has not seen another proposal, the acceptor responds with a prepare response which promises never to accept another proposal with a lower proposal number. This is illustrated in Figure 3 below, which shows the responses from each acceptor to the first prepare request they receive.

Figure 3: Paxos. Each acceptor responds to the first prepare request message that it receives.

Eventually, acceptor Z receives proposer A’s request[2], and acceptors X and Y receive proposer B’s request. If the acceptor has already seen a request with a higher proposal number, the prepare request is ignored, as is the case with proposer A’s request to acceptor Z. If the acceptor has not seen a higher numbered request, it again promises to ignore any requests with lower proposal numbers, and sends back the previous highest proposal number that it has seen along with the value of that proposal. This is the case with proposer B’s request to acceptors X and Y, as illustrated below:

Figure 4: Paxos. Acceptor Z ignores proposer A’s request because it has already seen a higher numbered proposal (4 > 2). Acceptors X and Y respond to proposer B’s request with the previous highest request that they acknowledged, and a promise to ignore any lower numbered proposals.

Once a proposer has received prepare responses from a majority of acceptors it can issue an accept request. Since proposer A only received responses indicating that there were no previous proposals, it sends an accept request to every acceptor with the same proposal number and value as its initial proposal (n=2, v=8). However, these requests are ignored by every acceptor because they have all promised not to accept requests with a proposal number lower than (in response to the prepare request from proposer B).

Proposer B sends an accept request to each acceptor containing the proposal number it previously used (n=4) and the value associated with the highest proposal number among the prepare response messages it received (v=8)[3]. Note that this is not the value that proposer B initially proposed, but the highest value from the prepare response messages it saw.

Figure 5: Paxos. Proposer B sends an accept request to each acceptor, with its previous proposal number (4), and the value of the highest numbered proposal it has seen (8, from [n=2, v=8]).

If an acceptor receives an accept request for a higher or equal proposal number than it has already seen, it accepts and sends a notification to every learner node. A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors have accepted a value, as is illustrated below:

Figure 6.

Once a value has been chosen by Paxos, further communication with other proposers cannot change this value. If another proposer, proposer C, sends a prepare request with a higher proposal number than has previously been seen, and a different value (for example, n=6, v=7), each acceptor responds with the previous highest proposal (n=4, v=8). This requires proposer C to send an accept request containing [n=6, v=8], which only confirms the value that has already been chosen. Furthermore, if some minority of acceptors have not yet chosen a value, this process ensures that they eventually reach consensus on the same value.

Various efficiency improvements to the standard Paxos algorithm are discussed in the papers by Lamport and Baker et al.. For example, a prepare request is not necessary if the proposer knows that it is the first to suggest a value. The proposal for such a request is numbered 0, so that it will be ignored if any higher numbered requests have been received.

General Utility

Paxos has often been criticized for its complexity, particularly with respect to the challenge of implementing it in a functional form. In spite of this, it’s an interesting example of a particularly challenging distributed systems problem, and a clever, conceptually-clean solution.

If you’re interested in this topic, i’d recommend reading about real-world examples of Paxos and two- and three-phase commit, starting with the references below.

This post is an updated version of one that first appeared on angusmacdonald.me. It was in turn based on work done during my Ph.D.

References

L. Lamport, “Paxos Made Simple” in ACM SIGACT News, vol. 32, no. 4, pp. 18–25, 2001.

Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Larson, J., Léon, J. M., Megastore: Providing Scalable, Highly Available Storage for Interactive Services in Proceedings of the Conference on Innovative Data Systems Research, pp. 223-234, 2011.

T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an engineering perspective”, in Proceedings of the twenty-sixth annual ACM Symposium on Principles of Distributed Computing, 2007, pp. 398–407.

M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, in Proceedings of OSDI 2006.


[1] The method of ensuring the uniqueness of proposal numbers when there are multiple proposers is not specified in the Paxos algorithm itself.

[2] It may not, but the algorithm is resilient to this.

[3] Note that this is the highest proposal number that it received from prepare response messages. In this example, proposer B has a higher numbered proposal (n=4) than proposer A (n=2), but it has only received proposer A’s proposal in response to its prepare request. If no previous proposals were returned by the prepare response messages, proposer B would use its own proposal (n=4).

What We Do

If you’re looking at one of our job ads, chances are you want to know more about what we do — what does a software engineering job at AetherWorks actually involve?

We’re currently building AetherStore, a distributed data store. AetherStore runs over the workstations and servers in your organization, harnessing their unused capacity to create a shared, virtual file system. To a user, we appear as any other networked file system, without requiring any extra hardware.

It’s a wonderfully engaging project to work on because we see such a diverse range of activities every day. From the low level handling of calls from Windows Explorer[1], to the task of breaking up[2], encrypting[3], and dispersing files across the network[4], I’ve covered most of my Computer Science education in some form while working on AetherStore.

Why I Work Here

My own background is in distributed systems (primarily distributed database systems), so the de-centralized and fault-tolerant design of AetherStore has obvious appeal. I love the challenge of creating these systems because you are so constrained by the fundamental properties of distribution[5], and yet it’s still possible to create systems that scale to many hundreds of thousands of machines[6].

With AetherStore our challenge is in creating a shared file system where every piece of data doesn’t have to reside on every machine. We want to spread user data across machines proportional to their size, but we don’t have a centralized authority to lean on when deciding how to do this. And we don’t even know how many machines should be in the system[7]. It’s brilliantly limiting[8]!

For me, it’s hard to imagine a better job. I love the intellectual challenge of creating a new feature or component, and the satisfaction of being able to craft this into a complete, stable software product. It’s  truly an exciting thing to be a part of.

Photo of Angus' Desk
The working environment isn’t bad either!

So, while it’s not easy competing with so many other great New York companies, I think we’ve got a lot to offer. Consider applying!

Our Interviews

My biggest problem with jobs listings (ours included) is that we specify a set of requirements that invariably turn into clichés, and we don’t explain why we need them or how we test for them. So let’s look at a few, and see why they actually matter more than you might think.

“Software Engineer.”

The job title may seem meaningless, but I love this distinction between software engineers and programmers. We want to know that you craft code to a high standard, and that you understand why ‘it just works’ isn’t enough.

In an interview we’ll ask you to review some (bad) code we’ve written, to gauge your code literacy. We’re looking for someone that has an appreciation for clean code and a quick eye for bugs.

“A solid understanding of object-oriented programming.”

We’re building a complex system and we need to make sure that you’re the type of person that can structure code in a logical and maintainable way.  We’ll ask you to do a short programming assignment to get a feel for your general abilities and experience.

“Fundamental Computer Science Background.”

The work I have described in the previous section is challenging, and it requires that you know the relative efficiency of, say, a linked list and an array, but also that you’re capable of creating your own data structures from time to time. For us, the best indicator of this skill-set is an undergraduate degree in Computer Science. In an interview we’ll ask you an algorithmic question that gives you a chance to demonstrate the breadth of your knowledge.

If you do well enough in these questions then we’ll invite you in for a longer interview, asking you to solve a real problem that we’re actually working on in the office.

To Apply

If the idea of working at AetherWorks appeals to you, I’d urge you to check out our available positions. Alternatively, if you have any questions about this post or our interviews, please feel free to email me (first initial, last name[9]).

 


[1] And I mean every single call. Every time you open a directory, right-click on a file, or save a document, Windows Explorer is providing us with a constant stream of calls asking for information and telling us what to update. Since we’re pretending to be a network mount, we have to handle each of these calls, giving responses either from the local machine or a remote copy. This fascinates me more than it probably should, but it gives you some brief insight into the complexity of the operating systems we use every day without thought.

[2] When you store a file we break it up into chunks, both to make it easier to spread data across the network and to increase de-duplication. There are entire classes of research dedicated to finding ways of doing this efficiently. Content-based chunking, in particular, has some really clever uses for hashing (fingerprinting) algorithms and sliding windows, which dramatically improve de-duplication.

[3] We have to encrypt data at rest and in transit, but this is more challenging than in most systems where you have a central authoritative server. Without this, our encryption architecture represents a trade-off between security and usability.

[4] Deciding where to place data is particularly challenging, since we don’t have a central coordinator that can make this decision. All machines must be in agreement as to where data is placed (so that it can be accessed), but it is expensive to allow them to co-ordinate to make this happen.

[6] Constraints can be catalysts for creativity.

[7] Since we don’t know how many machines are ever in the system, we can’t use distributed consensus protocols such as Paxos. These require that a majority of nodes agree on a decision, but if you don’t know how many nodes exist, you don’t know how many nodes form a majority.

[8] The CAP theorem is my favorite (trivial) example of this. Imagine you have 3 machines, and have a copy of some data on each machine. How do you handle an update to that data?

How we handle this update (and anything else in distributed systems) is determined by our response to network partitions – when one set of machines is unable to contact another. If we use a central lock manager to stop conflicting updates we ensure that the data is consistent, but that it will be unavailable if the lock manager cannot be contacted. If we use a majority consensus protocol, we can update our data in the event of a partition, but only if we are in the partition with a majority of nodes. If we assume that neither of these cases is acceptable, we can do away with consistency altogether, allowing updates to each individual copy even when the others are inaccessible. The fundamental properties of a distributed system limit us in each of these options — it’s up to us to decide which is the most appropriate in any given case.

[9] This is our interview puzzle question!