Zero-Conf: Bootstrapping the Network Layer

When you connect a printer to your local network, how does your computer find it?

It has to be addressable, which means it needs an IP address (and ideally a hostname), and it needs to be discoverable so that you can find it from your computer. When that works, you see a window like this:

Searching for a Printer

These tasks are covered by the zero-configuration protocol, which describes how to make this work even when DHCP and DNS servers, which assign IP addresses and hostnames, are not available.

The zero-conf specification covers three broad goals, which I discuss in this post:

  1. Allow devices to obtain an IP address even in the absence of a DHCP server.
  2. Allow devices to obtain a hostname, even in the absence of a DNS server.
  3. Allow devices to advertise and search for (discover) services on the local link.

Obtaining an IP Address

For a device to route messages on a network, it needs an IP address. This is relatively trivial if a DHCP server is available, but when this isn’t the case, zero-conf uses link-local addressing to obtain one.

The address assigned by link-local addressing is in the 169.254.0.0/16 block, which is only useful within the link (because routers won’t forward packets from devices in this address space[1]).

To obtain an address the device sends ARP requests to establish whether its desired IP address (chosen at random from the 169.254 range[2]) is available. This is a two-step process:

  1. First, an ARP probe is sent asking for the MAC address of the machine with a given IP. This probe is sent a number of times to confirm that the given IP address is not in use (there will be no response if it isn’t).
  2. If no reply is received, the device sends out an ARP announcement saying that it is now the machine with the given IP.

There are various protocols for conflict resolution of addresses that I won’t discuss here[3].

Obtaining a Hostname

Once a device has an IP address it can be contacted through the network, but IP addresses are volatile and for the device to be consistently discoverable it needs a hostname (which is less likely to change). Assigning as hostname is simple if you have a DNS server, but, if you don’t, zero-conf can assign hostnames using Multicast DNS.

Multicast DNS (mDNS) uses IP multicast to send broadcasts on the local link (something we wrote about recently).

To claim a local hostname with mDNS, a device sends DNS messages to probe for the uniqueness of the hostname[4]. Three queries are sent in 250ms intervals, and if no device reports using this hostname, the requesting device then sends an announce message to claim ownership.

In the case of a conflict (where two devices believe they own the same hostname), lexicographic ordering is used to determine a winner.

mDNS hostnames must use a local top-level domain (.com, .org, .gov, etc.) to distinguish them from globally accessible hosts. Apple devices and many others use the .local domain.

Browsing for Services

Once a device has an IP address and hostname, it can be contacted, but only if we know the name of the specific device we are looking for. If we don’t, DNS Service Discovery (also known as DNS-SD) can be used to search for services available on the local link. Rather than looking for a printer called ‘jose’, we can look for all devices supporting a print protocol (and then select ‘jose,’ if available).

What this looks like

Services are advertised in the form:

ServiceType.Domain

For example, _ipp.example.com advertises devices supporting the Internet Printing Protocol in the example.com domain.

Individual services are identified by a specific instance name:

InstanceName.ServiceType.Domain

So our printer, ‘jose,’ would be identified as:

jose._ipp._tcp.local[5]

How this works

DNS-SD uses mDNS to announce the presence of services. To achieve this, it uses DNS PTR and SRV records.

PTR records  (pointer records) are used in lookups[6] to search for a specific service type (_ipp._tcp.local in the above example) and return the name of the SRV record for each service supporting the specified protocol (there is one PTR record for each SRV record).

A query for _ipp._tcp.local will return jose._ipp._tcp.local and all other printers supporting the IPP protocol in the local domain.

SRV records (service records), record the protocol a service supports and its address. If the PTR record is used to find devices supporting a protocol, the SRV record is used to find a specific device’s hostname.  For the printer ‘jose,’ the SRV record would contain the service type and domain, and the hostname for the printer itself:

_ipp._tcp.local | jose.local

At this point we have discovered and can connect to our printer.

In addition to this, there are various extensions to zero-conf that I don’t describe here. These include:

  • TXT records, which allow extra attributes to be recorded in the service announcement (for example, extra information needed to make a proper connection to a service).
  • Subtypes, which allow the same service to advertise different levels or types of support within a service.
  • Flagship Service Types, which enable applications to determine when the same device is supporting and announcing multiple protocols. This makes it possible to remove duplicate entries in a listing, where a device supports multiple protocols that perform the same function.

Implementations

The most commonly used implementation of zero-conf (or at least the most written about) is Apple’s Bonjour.

We have used this library as the basis for our own implementation inside AetherStore, linking in with the provided dns_sd.jar wrapper. There are various native implementations that I haven’t yet tried.

If you’d like to read more on zero-configuration, I’d recommend the O’Reilly Zero Configuration Networking book, which provides an exhaustive description of everything I’ve touched on in this post.

Other sources are included below:


[1] This is good because it stops local information from potentially clogging upstream networks.

[2] Unless it has already been on this network, in which case it uses its previously chosen address.

[3] Edgar Danielyan discusses this in an article of the Internet Protocol Journal.

[4] This is the same process as with regular unicast DNS, but with some optimizations. For example, individual queries can include multiple requests and can receive multiple responses, to limit the number of multicast requests being made.

Unicast DNS packets are limited to 512 bytes, whereas mDNS packets can have 9,000 bytes (though you can only transmit 1,472 bytes without fragmenting the message into multiple Ethernet packets).

[5] The _tcp line specifies the service runs over TCP (rather than UDP).

[6] PTR records are similar to CNAME records but they only return the name, rather than performing further processing (source).

Meet Our Summer Interns

AetherWorks offers internships each summer to take full advantage of the fresh perspective and innovative ideas interns bring to the office. In exchange, we want to provide a hands-on learning experience where interns can be fully involved in all aspects of working at a start-up. This summer is no different, and we spent a long time reviewing resumes before finding the two exceptional interns you’ll meet below.

Laney Caldwell, Marketing and Communications

Originally from Birmingham, Alabama, Laney will be a senior at Brown University this fall, where she’s studying Political Science. She spent the past two summers working in Boston for nonprofits, where she focused on marketing and operations. She also works for the Maddock Alumni Center as a student liaison for large-scale alumni events. Laney is a Division 1 Varsity Fencer at Brown and member of Brown Women in Business.


LC HSSSSS 
“I want to work for a start-up after graduation, and was particularly interested in working in the tech industry. I think AetherWorks is the right place to get experience because they’ve created an environment that simultaneously supports, challenges, and motivates the entire team. From brushing up on my technical knowledge to working on new marketing strategies, I’m most excited about the sheer breadth of experience I will gain during the next two and a half months watching AetherStore develop.”   

 

Fredrik Kjellberg, Venture Development

Fredrik is from Stockholm, Sweden, and will be a senior at the University of Michigan this fall where he is completing a double major in Economics and Informatics. He is a co-founder of Michigan International Student Society and active member in the Global Investment Club at the Ross School of Business. He has over 5 years of sales and marketing experience working for OKQ8 in Sweden, where he also had the chance to run a number of small entrepreneurial projects. However, it was not until he was accepted to the University of Michigan’s start-up incubator last summer that he really had the opportunity to test out and launch a number of tech ventures with his team. Fredrik spent the beginning of the summer working with an Intelligence-centric Digital Agency in Stockholm.


FK Headshot“I know I want to work in the intersection of business and technology, and what better way to start than working as a Venture Development Associate for an R&D firm that has both a smart team and some really cool, advanced products and ideas in the pipeline. I am most excited to put my interdisciplinary background to the test working with AetherStore, and also look forward to gaining exposure in the NYC start-up and venture capital world.”

 

 

If you’re interested in working at AetherWorks, check out or careers page here!

AetherStore Early Adopters Program

If you’ve been keeping up with the AetherWorks Blog, you probably know we’ve been developing software that will allow users to make the most efficient use of their storage resources. It’s called AetherStore, and as we approach beta release we’re looking for Early Adopters to be some of the first to benefit from the technology.

Sign up here to become an Early Adopter!

What is AetherStore?

AetherStore works by pooling unused space on machine hard drives to create a shared, distributed storage network. The software chunks your data and saves those chunks many times over multiple machines, distributing the burden of data and removing the central point of failure. AetherStore stores data intelligently based on the amount of space machines have available, so you’re never limited to the smallest hard drive. All of your data is encrypted, too, so it’s storage-compliant with even the most regulated industries like healthcare, finance, and education.

Remote access and BYOD policies being as widely embraced as they are, AetherStore can also be coupled with a cloud solution for remote access and long-term backup.

Capture Post

Why Sign Up for the Early Adopters Program?

While there are certain concerns (security, scalability, latency, availability) that all organizations consider when developing a data storage strategy, differences in infrastructure and business processes mean the storage needs of each organization vary widely. The information we receive from our Early Adopters allows us to address highly specific use cases and tailor our technology to maximize its effectiveness for different organizations. In return, we’re offering you the technology necessary to make the most efficient use of the storage resources you’re already paying for.

If you’re interested in learning more about AetherStore, sign up to be an Early Adopter here!

PhD: Perfect Preparation for a Start-up?

In late 2011 I moved from life as a Computer Science PhD student to my current role at AetherWorks. Going from a town of seventeen thousand people to a city of eight million was a big change, but going from a life in academia to life in a start-up was much easier than you might think.

This post is a response to two sentiments that I’ve often heard expressed, which I’ve paraphrased below:

“It must be a major change going from a PhD to the world of work.” – The “transition” problem.

“A PhD doesn’t prepare you for life in the world of work.” – The “relevance” problem.

Clearly I don’t think this is true, but I think it’s surprising how easy my transition was, and how useful my past experience has been[1].

Transition

Is there a chasm between the roles of a CS PhD student and a start-up founder? I don’t believe so – in fact, when writing down the stages of each role, the job descriptions are strikingly similar:

Find the Problem: In the initial stages your focus is on finding a problem. You search for a gap in existing work that is ready to be exploited.

Initial Research: You spend the initial months of your work scoping out the identified area, examining existing work and figuring out what related areas have potential relevance. In this stage, you work to find out where the problems are, and start to think about the best ways to fix them.

Development: Once you have established the problem space, you start developing your idea. This should be fast, because you want to show the initial results of your efforts as quickly as possible. You start developing your MVP.

The earlier you get your work out, the quicker you receive the feedback that is critical to your progress: Where can you take the idea? Has it been done before? Does it need to be refined?

Pivot: At some point in these processes you realize that your initial idea wasn’t quite as original as you thought, or you find a related area and a problem in much greater need of a solution. You pivot.

Titles: While this is happening you keep a healthy disdain for your job title, because it seems to mean very little. When you work with such a small group of people, you have to wear many hats — the developer, system administrator, network engineer, and barista are all the same person[2].

Collaboration: You spend days with a very tightly-knit group of collaborators, working through ideas and combing over documents[3]. When the work is presented one person might get the credit, but it’s a team effort.

Self-Motivation: When you are one of only a few people working on a project, a week of unproductive work can seriously damage a project’s velocity. You don’t have a boss looking over your shoulder, so you need to motivate yourself.

Networking: You don’t want to develop your idea in a vacuum, so you go out and meet others in your field at conferences and meet-ups. You learn the importance of a good elevator pitch.

Relevance

Every PhD is different, but I find it hard to reconcile my experience with the view that it produces students with an intense focus but few transferable skills[4].

There are periods of confined study, but as with most jobs, there are opportunities for personal development. My ‘job’ happened to be the work of a PhD student, but that didn’t prevent me from developing other interests.

I tutored, lectured, and presented where possible. I even designed coffee mugs[5] and indulged in pointless side-projects, all of which produced a well-roundedness that has served me well in a start-up. The notion that a PhD creates people that aren’t prepared for work outside of academia is not an inherent truth. As with any job, it is what you make it.

Primary Difference

The primary difference between a PhD and a start-up is the end goal. In my PhD I created software with the goal of evaluating it and publishing results, so there were bugs and usability issues that I wasn’t focused on fixing. Conversely my current job is to make the most stable and usable product possible, which leads to a much stricter view of development, tighter schedules, and a more stringent QA process. The end goal of a start-up is to create a useful product that people will pay for, whereas the end goal of a PhD is to create something that advances the state of the art.

Some of the skills I developed during my PhD have proved to be incredibly useful for life at a start-up. I improved my time management and learned what it takes to motivate myself to do my best work, even on days when no-one is looking — skills that aren’t as necessary in a deadline-driven undergraduate degree, and which aren’t necessarily easily obtained in an entry-level job. It also made me a better researcher, and a more analytical thinker.

Ultimately, very few jobs can fully prepare you for life at a start-up, but a CS PhD can come surprisingly close.

 


[1] Major disclaimer: what I’m going to describe is and was true for me. I had a very practical Computer Science PhD, I had to write a lot of code, and I did a lot of work on software architectures. This makes for an easier transition than many other types of PhD even within Computer Science.

[2] Arguably the greatest output of my PhD was a talent for making smooth lattes. You won’t find this on my resume.

[3] A PhD is typically thought of as a very solitary endeavor, but it’s easy to forget the countless hours of discussions and input from supervisors and colleagues. Your name is on the document, but it wasn’t created in isolation.

[4] This is a view that I’ve heard from many people, but sadly most of the references I could find online were from PhD student’s themselves [a] [b] [c].

[5] This is the second greatest output of my PhD. Sadly, my most innovative designs were banned by the university’s image guidelines.

The Multicast Revival: LAN Multicast with Code Example

The rise of datacenters and cloud computing is bringing about a resurgence in the use of Multicast. Previously neglected by networking courses and confined to LANs, Multicast now has new relevance. As a brief introduction, this post covers the basic theory behind IPv4 and Ethernet Multicast and provides a code sample showing how it can be used in practice.

Multicast sits on a spectrum between Unicast and Broadcast (and can operate as Broadcast, but with a higher overhead). Unicast, point-to-point messaging is the dominant communication paradigm in networks today. In future networking environments, however, we may need to consider alternative communication paradigms.

When we think of computer networks, we tend to envisage point-to-point Unicast communication. For example, consider HTTP requests to a website: every host wishing to see the page makes an individual request for that page and receives an individual copy. Thus, if 1000 people wish to view the web page, 1000 copies are sent across the network. This duplication is acceptable for web-pages, as web content is generally static and web requests are created, and arrive, independently of one another[i].

In contrast, imagine a scenario in which requested data goes out of date quickly, or many hosts wish to receive the data simultaneously. Consider, for example, the live stream of a sports event or the delivery of timely stock market information to multiple hosts in a datacenter. In both of these cases, sending out multiple copies of the data is inefficient. Ideally we would address a message to multiple recipients and rely on the network to distribute the message accordingly.

Multicast

Multicast is a form of networking in which messages can be delivered to one or more recipients. The Internet Protocol (IP) is used to deliver data across network boundaries[ii]. We use IP addresses to identify a host across multiple networks (see IP header figure below).  If we note that an IP header only has enough space for one destination address, then how does Multicast provide support for multiple recipients[iii]?

xxx
Example Internet Datagram Header[iv]
The solution is to map a single IP address to multiple recipients.  Routers (and intelligent switches) have built-in support for maintaining these Multicast mappings. I will focus on behavior within a single LAN, rather than between routers.

When a packet destined for one of these Multicast addresses (which is mapped to multiple recipients) arrives at a router or switch, the packet is forwarded such that each link only receives one copy of the message at most. All hosts on a link receive all frames on that link and hosts simply filter out those frames not intended for them.

The key point is that as few copies as possible of the messages are sent. For instance, at the link layer, a frame only needs to be sent to links where group members reside[v].

IGMP

In IPv4, the Class D addresses 224.0.0.0 through 239.255.255.255 are reserved for Multicast. Each IP address represents a group of recipients, thus there is space for 248,720,625 different sets of recipients (Multicast groups). In practice, however, routers and switches generally do not have enough memory to support this number of groups simultaneously, and there are a number of reserved Multicast addresses used by various protocols[vi].

These Multicast mappings are maintained using the Internet Group Messaging Protocol (IGMP). When a recipient (host) wishes to join a group, it sends a membership report message to the “all routers” group (224.0.0.2). When the router receives the report, it maps the interface on which it received the report to the group IP address, and any messages sent to the group IP address are forwarded to the host via the specified interface. Thus, the router only needs to know of one group member on each interface.

The router periodically (roughly every minute) sends out a query to the “all hosts” group (224.0.0.1), to check whether there are any hosts still wishing to receive Multicast messages. All hosts in the network receive this query, but only those hosts wishing to remain members reply with a membership report. Hosts can also leave at any time by sending an explicit leave group message.

All IGMP messages are IP datagrams with the following format:

xxx
IGMP Packet Format[vii]
There are several types of IGMP message:

  • 0x11 = Membership Query:
    • General Query, used to learn which groups have members on an attached network.
    • Group-Specific Query, used to learn if a particular group has any members on an attached network.
  • 0x16 = Version 2 Membership Report: Used by hosts to declare membership
  • 0x17 = Leave Group: Used by hosts to leave a group

The Data Link Layer (Layer 2)

As with all IP traffic, Multicast relies on the data link layer (layer 2) for delivery. So far, I have only covered the network layer (layer 3) implementation, but it is important to also understand the layer 2 implementation.

To deliver to an end host, an IP Multicast address must be converted into an Ethernet address because Ethernet addresses are required for link layer delivery. Ethernet has its own rules for handling Multicast frames, as it can be used independently of IPv4 Multicast[viii].

An Ethernet address consists of 6 bytes, split into two three-octet identifiers. The first three bytes indicate the Organizationally Unique Identifier (OUI), which is generally that of the manufacturer of the network card. The second three bytes indicate the Network Interface Controller (NIC) Specific[ix]. Any packet with an OUI of 01-00-5E is considered to be a Multicast address, meaning that Ethernet Multicast addresses fall into the range 01-00-5E-00-00-00 to 01-00-5E-7F-FF-FF.

Generally, a switch will only forward frames on an interface where it knows the destination Ethernet address resides. In contrast, a Multicast destination address in an Ethernet frame does not map to a known host in the network, so the switch needs to treat a Multicast frame differently than a standard Ethernet frame. A switch should forward all Multicast addressed frames on all interfaces because the Multicast address OUI starts with 01, which signifies the frame is a Broadcast frame.

By making use of the Broadcast octet, Multicast can be supported on both vanilla and feature-rich switches. A dumb bridge might just forward all packets on all interfaces; a standard off-the-shelf switch with no knowledge of Multicast will forward the Multicast packets on all interfaces due to the presence of the Broadcast octet. An intelligent switch, however, will maintain its own Multicast group mappings to use for forwarding.

By maintaining a mapping of Multicast addresses and only forwarding Multicast frames on links where group members are present, an intelligent switch can avoid needlessly congesting links in the layer 2 network with Multicast frames. Intelligent switches are therefore highly valued in datacenter Multicast deployments.

Intelligent switches can snoop IGMP messages as they pass through. They maintain their own mapping of interfaces to groups (in some cases MAC addresses to groups), and send IGMP reports upstream. By performing this IGMP snooping, switches avoid the congestion caused by many IGMP reports being sent from hosts to the router (this has partly been addressed in IGMPv3 which supports multiple groups per report).

A sharp observer will notice that Ethernet is a 48-bit address space, while IPv4 is a 32-bit address space. What does this mean for IPv4 to Ethernet address conversion? The first four bits of the IPv4 address are fixed for class-D addresses, thus we are left with 28 bits. We use the lower 23 bits of the IPv4 address to convert into the lower 23 bits of the Ethernet address (the first bit of the NIC specific bytes of the Ethernet address must be 0, hence 23 bits rather than 24). This leaves five (28 bits – 23 bits) unused  bits in the IPv4 address, meaning each Multicast Ethernet address can represent 32 IPv4 Multicast groups. On receipt of a Multicast frame, the host will analyze the address in the IP header (if present) and filter the unintended messages received for groups for which it is not a member[x].

Multicast in Action (code example)

Generally, best-effort protocols such as UDP and RTP are used with Multicast, as re-transmission causes duplicate packet receipt for the other recipients. All hosts must share the ability to decrypt and encrypt traffic, meaning many applications that wish to use Multicast have to generate a shared key for all recipients.

Multicast is not supported on the internet due to security concerns, largely stemming from fear of Denial of Service attacks[xi]. Multicast is generally supported out-of-the-box on off-the-shelf routers like the ones found in the average home. This code example demonstrates rudimentary Multicast communication between nodes within a network.

To compile and run, execute the following command on two or more machines on your LAN:

javac Node.java && java Node;

You should see messages being sent and received by each node using Multicast!

For the full code listing, see the following Multicast gist.

Summary

Multicast allows us to distribute data to multiple recipients simultaneously. Hosts join a group, and from that point onwards they are sent copies of all messages destined for that group. Any host on the network can attempt to join one of these groups by sending a special IGMP message to the network’s router to receive that group’s messages.

There may be efficiency benefits to using Multicast in environments where large numbers of sensors or sensor-enabled devices communicate, such as Ubiquitous (and Personal Area) Networks (UPANs). Imagine, for example, a controller device Multicasting information to a set of sensor devices worn around the body.

In the short term, Big Data and Cloud environments, categorized by large numbers of co-located servers housed in data-centers, can take advantage of Multicast communication. This need for timely data analysis by large numbers of machines is currently driving the resurgence of Multicast.

If you are interested in the specifics of Multicast please see the references below. For a more detailed overview, however, have a look at the following links:

 


[viii] There are two types of Multicast: layer 3 and layer 2 Multicast. Layer 2 Multicast can be used independently of layer 3 Multicast. Both implementations use IGMP, necessitating the conversion from IP address to MAC address. Layer 3 Multicast has too large a scope to discuss here.

Distributed Consensus: Paxos by Example

In distributed systems, ensuring a set of machines are able to reach consensus on a decision is extremely important. Database systems, for example, need to ensure that all machines agree whether to commit or rollback a transaction, so that each machine holds a consistent view of its data (imagine the problems your bank would have if one machine thought you had $1000 in your account, but another thought that you had $0).

Consensus is difficult to achieve, because messages between machines can be lost or indefinitely delayed, or the machines themselves can fail — how does one machine know whether another machine has processed a message?

Two-phase commit is commonly used to provide consensus, but it suffers from a single point of failure — if the node coordinating a transaction fails, the system must block until this node restarts. The three-phase commit protocol removes the blocking problem, but is still reliant on a single coordinator.

This post discusses Paxos, a distributed alternative to two- and three-phase commit. Paxos guarantees that nodes will only ever choose a single value (meaning it guarantees safety), but does not guarantee that a value will be chosen if a majority of nodes are unavailable (progress). Consensus algorithms aim for safety because it doesn’t matter whether we commit or rollback — neither is more correct than the other — but it is of critical importance that only one answer is chosen.

General Approach

Paxos node can take on any or all of three roles: proposeracceptor, and learner. A proposer proposes a value that it wants agreement upon. It does this by sending a proposal containing a value to the set of all acceptors, which decide whether to accept the value. Each acceptor chooses a value independently — it may receive multiple proposals, each from a different proposer — and sends its decision to learners, which determine whether any value has been accepted. For a value to be accepted by Paxos, a majority of acceptors must choose the same value. In practice, a single node may take on many or all of these roles, but in the examples in this section each role is run on a separate node, as illustrated below.

Figure 1: Basic Paxos architecture. A number of proposers make proposals to acceptors. When an acceptor accepts a value it sends the result to learner nodes.

Paxos By Example

In the standard Paxos algorithm proposers send two types of messages to acceptors: prepare and accept requests. In the first stage of this algorithm a proposer sends a prepare request to each acceptor containing a proposed value, v, and a proposal number, n. The proposed value can be commit or rollback, as in the previous database example, or it can be any other arbitrary value. In this post the Paxos nodes are attempting to reach consensus on an integer value. Each proposer’s proposal number must be a positive, monotonically increasing, unique, natural number, with respect to other proposers’ proposal numbers[1].

In the example illustrated below, there are two proposers, both making prepare requests. The request from proposer A reaches acceptors X and Y before the request from proposer B, but the request from proposer B reaches acceptor Z first.

Figure 2: Paxos. Proposers A and B each send prepare requests to every acceptor. In this example proposer A’s request reaches acceptors X and Y first, and proposer B’s request reaches acceptor Z first.

If the acceptor receiving a prepare request has not seen another proposal, the acceptor responds with a prepare response which promises never to accept another proposal with a lower proposal number. This is illustrated in Figure 3 below, which shows the responses from each acceptor to the first prepare request they receive.

Figure 3: Paxos. Each acceptor responds to the first prepare request message that it receives.

Eventually, acceptor Z receives proposer A’s request[2], and acceptors X and Y receive proposer B’s request. If the acceptor has already seen a request with a higher proposal number, the prepare request is ignored, as is the case with proposer A’s request to acceptor Z. If the acceptor has not seen a higher numbered request, it again promises to ignore any requests with lower proposal numbers, and sends back the previous highest proposal number that it has seen along with the value of that proposal. This is the case with proposer B’s request to acceptors X and Y, as illustrated below:

Figure 4: Paxos. Acceptor Z ignores proposer A’s request because it has already seen a higher numbered proposal (4 > 2). Acceptors X and Y respond to proposer B’s request with the previous highest request that they acknowledged, and a promise to ignore any lower numbered proposals.

Once a proposer has received prepare responses from a majority of acceptors it can issue an accept request. Since proposer A only received responses indicating that there were no previous proposals, it sends an accept request to every acceptor with the same proposal number and value as its initial proposal (n=2, v=8). However, these requests are ignored by every acceptor because they have all promised not to accept requests with a proposal number lower than (in response to the prepare request from proposer B).

Proposer B sends an accept request to each acceptor containing the proposal number it previously used (n=4) and the value associated with the highest proposal number among the prepare response messages it received (v=8)[3]. Note that this is not the value that proposer B initially proposed, but the highest value from the prepare response messages it saw.

Figure 5: Paxos. Proposer B sends an accept request to each acceptor, with its previous proposal number (4), and the value of the highest numbered proposal it has seen (8, from [n=2, v=8]).

If an acceptor receives an accept request for a higher or equal proposal number than it has already seen, it accepts and sends a notification to every learner node. A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors have accepted a value, as is illustrated below:

Figure 6.

Once a value has been chosen by Paxos, further communication with other proposers cannot change this value. If another proposer, proposer C, sends a prepare request with a higher proposal number than has previously been seen, and a different value (for example, n=6, v=7), each acceptor responds with the previous highest proposal (n=4, v=8). This requires proposer C to send an accept request containing [n=6, v=8], which only confirms the value that has already been chosen. Furthermore, if some minority of acceptors have not yet chosen a value, this process ensures that they eventually reach consensus on the same value.

Various efficiency improvements to the standard Paxos algorithm are discussed in the papers by Lamport and Baker et al.. For example, a prepare request is not necessary if the proposer knows that it is the first to suggest a value. The proposal for such a request is numbered 0, so that it will be ignored if any higher numbered requests have been received.

General Utility

Paxos has often been criticized for its complexity, particularly with respect to the challenge of implementing it in a functional form. In spite of this, it’s an interesting example of a particularly challenging distributed systems problem, and a clever, conceptually-clean solution.

If you’re interested in this topic, i’d recommend reading about real-world examples of Paxos and two- and three-phase commit, starting with the references below.

This post is an updated version of one that first appeared on angusmacdonald.me. It was in turn based on work done during my Ph.D.

References

L. Lamport, “Paxos Made Simple” in ACM SIGACT News, vol. 32, no. 4, pp. 18–25, 2001.

Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Larson, J., Léon, J. M., Megastore: Providing Scalable, Highly Available Storage for Interactive Services in Proceedings of the Conference on Innovative Data Systems Research, pp. 223-234, 2011.

T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an engineering perspective”, in Proceedings of the twenty-sixth annual ACM Symposium on Principles of Distributed Computing, 2007, pp. 398–407.

M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, in Proceedings of OSDI 2006.


[1] The method of ensuring the uniqueness of proposal numbers when there are multiple proposers is not specified in the Paxos algorithm itself.

[2] It may not, but the algorithm is resilient to this.

[3] Note that this is the highest proposal number that it received from prepare response messages. In this example, proposer B has a higher numbered proposal (n=4) than proposer A (n=2), but it has only received proposer A’s proposal in response to its prepare request. If no previous proposals were returned by the prepare response messages, proposer B would use its own proposal (n=4).

Endorsement Spamming

Recently LinkedIn added a feature that allows you to endorse people for certain skills. Simply put, this is the ability to ‘plus 1’ a skill for any connection.

Most would agree that the ability to gauge someone’s perceived expertise at a glance is useful.  My problem with endorsements is that the lack of restrictions on handing them out undermines the value of the process. For example, I have one colleague, who shall remain anonymous, who simply endorses everyone for everything (more on this later). The consequence of this approach is that people are happy about receiving endorsements and then feel obliged to endorse him/her back.  My colleague is now a perceived expert in an astonishing number of areas!

Perceived Expert
This guy is good!

Problem 1 – Reciprocal Endorsements

There is no cap on the amount of endorsements one person can give. With the general encouragement LinkedIn gives to ‘pass endorsements forward,’ we arrive at a situation where it is beneficial to endorse everyone for everything in hopes the favor will be returned. Since there is no way of viewing how many endorsements any particular user has given, the theory suggests a heavy endorser will likely be made to look far more impressive than they actually are – the most effective strategy is to endorse indiscriminately. As this trend continues, we tend toward a scenario where honest endorsements are marginalized.

Should you 'pay it forward?'
Should you ‘pay it forward?’

The obvious preventative solution is to limit the number of endorsements a user can give. This would limit the range of the tool, but increase the value of each endorsement, while still using a metric that is easy to understand.

If they didn’t limit endorsements, LinkedIn could display how many recommendations a user has given next to how many that user has received. A user viewing a profile can then gauge how meaningful someone’s received endorsements are. The downside to this approach is that the metric cannot pin-point reciprocal endorsements and thus leaves a large margin of error for different types of users. Interestingly enough, LinkedIn already does this for the written recommendations; although as the amount of these is far fewer, the metric is easier to assess for users.

Personally, I think the best solution would be a dynamic cap. One endorsement per connection added on your profile, with no restrictions on their use. Make people consider what their use case is and how to disperse them in a meaningful way.

Problem 2 – Suggested Skill Box

In a bid to encourage endorsements, LinkedIn displays an endorsement box on all profile pages. It proposes four suggestions:

LinkedIn Endorsements
Who do you want to endorse?

First, the skills generated in the skill box are not always the skills specified on the relevant user’s profile. The LinkedIn algorithm often suggests skills that are completely irrelevant. Overnight you can become a perceived expert in something you have no experience in at all. My personal favorites this week were Bash and Python, for which I received 3 endorsements each. I have never, in my life, held a developer/engineer position or written a line of Python!

Second, these questions are far too suggestive. Being offered a name and an associated skill is simply leading the user down a path, and as discussed above, this is often the wrong path. “Who would you recommend for Java?” would be a far better question. The user should at least have to think about the skill set and their connections.  Even better, how about limiting the suggested endorsements to those that you have noted as your own skills? Maybe this wouldn’t be enforced if the user were to go to a connection’s page to endorse, as this requires more effort than a skill box endorsement.

Finally, the interface for the endorsement box just encourages inaccuracies. The most prominent (and highlighted button) is, of course, ‘Endorse all 4’. If you close one recommended endorsement, another appears. If you endorse one connection, another will appear.  If you ‘endorse all 4’, you can do four more! You can endorse 100 connections in your network for a random skill in less than 30 seconds. Now that’s efficiency. I have been told by my anonymous source that if you continue to ‘endorse all 4’ the suggestions do eventually dry up…

Wrap Up

I’ve had quite a lot of fun on LinkedIn this week, adding new connections and endorsing people I know well, a little, and although I must confess to being part of the system I’m complaining about, really not at all.  I’ve had a variety of responses, from thank you emails and return endorsements, to the dreaded no response (disclaimer – I have since deleted all the extra skills that were created and the related endorsements!).

As attractive as it is to have endorsements coming out your ears, with the current approach LinkedIn is taking to try and encourage its use, it is only a matter of time before people will disregard what could be an incredibly useful feature. You can see the inevitable happening already – try it.

Why are software downloads so large?

I’ve never understood why software installers are so large. If you can package 100,000 lines of code into 4 Mb, why does any program I’m downloading need 50 Mb?

When we started developing AetherStore, I wanted to ensure that our installer didn’t turn out this way. But, after one year of development, our latest release is 50 Mb… and I’m okay with that. Here’s why:

Don’t Repeat Yourself

A good programmer has the ‘DRY’ mantra firmly engrained in their head — re-writing an algorithm that someone has already written and tested is a waste of time and an easy source of error. But it’s easy to take this too far when you start out on a project.

For example, if you’ve got a collections problem, Google Guava probably has a solution, but it adds 2.5 Mb.

Similarly, there are numerous Apache projects that look small at first glance, but actually require a whole other set of dependencies to run correctly. Once you’ve added the commons-io, commons-logging, mail, and activation JARs, you’ve used a few more megabytes in space.

So before you’ve even started writing your own code, you can end up with a 5 Mb release. We attempted to avoid this pitfall by not using any external libraries unless they were absolutely necessary. With roughly 35,000 lines of our own code, we had a compact release.

Current size:                Code: 2.5 Mb    Total: 2.5 Mb

Graphics

As development continued, we started adding graphics and other assets that increased the size of the project. For some programs, graphics take up a significant chunk of space, but because AetherStore relies on the existing Windows Explorer interface, the space needed for these assets is relatively small.

Current size:                Code: 2.5 Mb    Graphics: 600 Kb          Total: 3.1 Mb

Bundling

To create our first piece full AetherStore release, we bundled this code and graphics into an installer, along with a handful of JARs and the native libraries needed for file system integration:

Code: 2.5 Mb

Graphics: 600 Kb

Packaged Libraries: 1.8 Mb

Native Libraries: 2 Mb

Installer overhead: 600 Kb

Total size: 7.2 Mb

This is where I’d like our story to have ended, but sadly it doesn’t.  While the installer works perfectly, the user experience can be terrible. If a user doesn’t have Java installed on their machine (a pre-requisite), the installer provides a link to the Oracle download site. This takes a relatively short and simple installation process and adds a few extra steps that require the user to navigate an entirely separate installation process. Moreover, requiring a hidden 41 Mb download (Java SE 7u21, Windows x64) doesn’t make our program any more compact.

The alternative is to abandon our aim of providing a small download and bundle Java with the AetherStore installer. This balloons the size of our installer, but massively improves the user experience. It also means we can use the latest version of Java without worrying about compatibility on user machines.

AetherStore without bundling: 7.2 Mb

AetherStore with bundling: 51.2 Mb

How Important is Size?

All of this leads to a simple question. In a trade-off between size and simplicity, how much do we care about size?

I’m just about old enough to remember when a 5 Kb/s down link was cause for mass celebration, and a 10 Mb download represented a significant commitment. But given the ubiquity of fast internet connections, should the size of your software even factor into any development or release decisions?

I’m increasingly, and somewhat reluctantly, starting to think that it shouldn’t  The cost of having an extra few megabytes is insignificant compared to the additional cognitive complexity of an extra installation step, or the time required to emulate functionality already found in widely used, well tested utilities. We can still provide an un-bundled installer, but it won’t be the most common option.

Ultimately, the size of an application isn’t as important as it once was.

The Waiting Game: Fast-Food Queuing Theory

At AetherWorks, we leave the office every day to get lunch from one of the many dining establishments in the vicinity of our Bryant Park HQ, and we probably spend most of this time stuck in line.

Different restaurants offer different queue strategies, and unfortunately, all queue strategies are not created equal. But which of the restaurants offers the most efficient queue, so that I can eat there more frequently and waste less time stuck in line?

This post compares the performance of the queues at our favorite lunch spots using a branch of mathematics known as queuing theory.

Generally we can think of purchasing food in a take-out restaurant as involving a series of workstations, each with a separate task that takes some time (e.g. ordering food, instructing workers, collecting food, paying). These stations are generally attended in sequence, and each station takes some time to process one customer. The sequence of stations is known as a pipeline.

When a customer is waiting to be served by a workstation (because the workstation is still dealing with the customer in front), they are idling, which counts as wasted time. Note that idling is distinct from waiting for a station to complete the task associated with your order. For example, processing your payment constitutes an action toward your food order.

I am interested in the proportion of time wasted relative to the time being served, as opposed to the total amount of time wasted.

There are several properties of a queue that we can measure:

  • Total time in the shop
  • Time spent waiting for service (that is, in queues between workstations)
  • Number of people waiting in the queue
  • Total number of people waiting in the queue
  • Utilization of workstations

Interestingly, a high utilization results in an experience with high waiting times for the customer. Thus, the queue functions better, from the customer’s perspective, with a higher capacity and servers who are not as busy.

If the proportion of time spent in the queue is high, the restaurant could do things better and improve their system. You have to balance queue efficiency, however, with the cognitive abilities of customers — some queuing systems are so complicated that people can have a hard time figuring out where to go next!

Methodology

Take-out restaurants have several different approaches to their queuing systems:

  • Single channel / single station: buying a cupcake
    • Select a cake and pay, all the while dealing with the same staff member
  • Single channel / multiple stations: fast food drive through
    • Order
    • Pay/pickup
  • Multi-channel / single station: Toll booth
    • Multiple stations each with a separate queue
  • Multi-channel / multiple stations: Lunch deli
    • Multiple types of food, each from a separate counter
    • Multiple tills, each with a separate queue

Each restaurant will generally follow one of the above approaches, with the workstations arranged such that a customer moves from one queue to the next. Using a set of known equations derived from the behavior of queues, it’s possible to analyze the properties of a given system of queues. In this analysis I use what is known as an M/M/c queue.

An M/M/c queue assumes that the customer arrival rate and job completion rate follow a Poisson distribution, and it also accounts for the number of servers at each station. For example, multiple baristas may be making coffee simultaneously for a single queue of waiting customers.

I won’t go into the detail about the equations here, but you can read the linked articles and look at my code provided. The remainder of this post looks at the queuing systems for several of our favorite restaurants. For each restaurant I will list the stations and their respective processing times. Each station has its own queue and a number of servers associated with it (assume one server unless explicitly stated otherwise).

Single-pipeline restaurants

Chipotle, Subway, and Starbucks all have queuing systems through which there is only one path. All customers follow the same path.

Chipotle (burrito restaurant)

There is one pipeline, and each customer must wait for the previous customer to finish at a station before progressing to that station.

  • 5 stations:
    • Burrito/bowl/tacos – 5 seconds
    • Rice and meat – 8 seconds
    • Salsa and salad – 12 seconds
    • Wrapping and pricing – 9 seconds
    • Paying – 10 seconds

With this model, someone can block the entire chain; which is most noticeable if someone pays by check or cash which can hold up the line.

Mr Zaccardo holds up the line.
Mr. Zaccardo holding up the line.

Subway (sandwich restaurant)

Subway is similar to Chipotle in that there is a single pipeline. The main difference, however, is the presence of a toaster. Customers can opt to have their sandwich toasted, which adds 30 seconds onto the service time, but can overtake earlier customers if they opt out of having their sandwich toasted. I have estimated the probability of using the toaster as 80%.

  • 5 stations:
    • Choice of bread and slicing – 8 seconds
    • Meat and cheese selection – 17 seconds
    • Toaster – 32 seconds – probability of use 80%
    • Salad and dressing – 17 seconds
    • Paying and wrapping – 15 seconds

Starbucks (coffee shop)

Different Starbucks shops use different models, but in this analysis I assume that there is a single employee who operates the till, taking payments, and that they are backed up by three baristas who make coffee in parallel, fulfilling the orders.

  • 2 stations:
    • Order and pay – 22 seconds
    • Wait for coffee – 92 seconds – 3 servers

Multi-pipeline queue system

Some restaurants allow several parallel queues of customers, which includes cases where customers can switch between stations in chains of queues. This means that we have to take into account multiple queues of customers, as opposed to the previous section that had single queues being serviced by multiple employees.

McDonalds (Burger restaurant)[1]

McDonalds provides several queues in parallel, the first for ordering and paying, and the second, an (invisible) station where customers wait while their food is gathered and served. The time it takes to cook the food is accounted for in the time taken to gather the food items.

  • 2 stations:
    • Order and pay – 42 seconds – 5 parallel queues
    • Wait while food is gathered – 102 seconds – 5 servers

Chop’t (Salad restaurant)

Chop’t is a restaurant providing takeout salads. They use the most complicated queuing strategy of all the chains analyzed here. First, customers wait in a line to be placed into one of several queues for workstations that will place ingredients in a bowl. Second, the customer is placed into one of several queues for having their ingredients chopped and a dressing applied. Finally, customers queue in a single line for one of several payment stations[2].

  • 4 Stations:
    • Wait for queue placement – 4 second
    • Collect ingredients – 32 seconds – 4 parallel queues
    • Chop ingredients and apply dressing – 47 seconds – 6 parallel queues
    • Pay for salad – 17 seconds – 3 servers

Thoughts/assumptions

In a single pipeline system, it is good to minimize the amount of time at each station. For customers it is better to ensure stations don’t block getting their items (i.e. waiting to pay at Chipotle causes your food to cool down). In this sense, Starbucks has a better approach, since they take your money first. Many places can’t do this because customers can customize their order while in the pipeline. For a simpler order like coffee, it makes sense to order and pay at the beginning, but is much harder at somewhere like Chop’t.

Theoretical Analysis

To compare the stores with a theoretical analysis we need to provide a common arrival rate of customers, so the results below use a customer arrival rate of 1 customer every 200 seconds. This is generally unrealistic, but if the time taken to serve customers is greater than the arrival rate, the queue size tends to infinity and the equations will not produce a result.

The code for this analysis is available at: https://gist.github.com/gbigwood/5304126

Store Waiting Time (seconds) Total time in system (seconds) Proportion of time wasted Customers Waiting to be served
Chipotle 2.18 46.18 0.05 0.01
Subway 9.41 98.41 0.09 0.04
Starbucks 2.72 116.72 0.02 0.01
McDonalds 4.88e-14 164.00 2.97e-16 2.439e-16
Chop’t 3.33 103.33 0.03 0.004

According to this analysis, you can see that Subway wastes the largest proportion of its customers’ time, and that McDonalds is the most efficient.

However, this doesn’t account for cases such as a limited initial queue size (because people won’t queue if the line is too long), or where a customer leaves the restaurant halfway through the pipeline. For this more advanced behavior we need to run a simulation.

Simulation Analysis

The theoretical model is under-powered due to the low arrival rate required for a direct comparison of restaurants — customers arrive at too low a rate to allow for meaningful comparison of waiting times.

A simulation allows us to perform a much deeper analysis of these queues, including stopping customers from entering the shop if the wait time is too long. For this analysis, if the wait time of customers is larger than 6 minutes, customers do not join a shop’s pipeline.

I simulated 10,000 customers arriving at the restaurants with the same arrival distribution, assuming an arrival rate of one customer every five seconds[3].

Shop Name Average Waiting Time (seconds) Average Service Time (seconds) Proportion of Time Wasted Customers Served
Chipotle 914.2 41.6 0.96 201
Subway 2182.4 77.1 0.97 225
Starbucks 2367.8 122.5 0.95 169
McDonalds 1731.8 147.7 0.92 216
Chop’t 591.1 99.0 0.86 273

In these results the Chop’t queue is the most efficient, as a lower proportion of time is wasted. This is most likely because customers can swap between several queues, rather than being funneled down a single pipeline open to blocking. In theory, however, a single queue, multi-server model is the most efficient.

Improvements

With this knowledge, is it possible to improve the queues of any of the test restaurants?

For example, if we assume that the Subway sandwich toaster is blocking customers, what happens if we add more toasters into the pipeline?

Shop Name Average Waiting Time (seconds) Average Service Time (seconds) Proportion of Time Wasted Customers Served
Subway 1 Toaster 2182.4 77.1 0.97 225
Subway 2 Toasters 1814.4 81.9 0.95 201
Subway 100 toasters 1885.3 82.57 0.95 246

If we have 100 toasters, the number of customers served has increased to 246, but the average waiting time has also increased. This happens because the current waiting time is calculated when a customer leaves the store, meaning the current calculation is equivalent to asking people as they leave: “How long did you spend waiting”? It is not a calculation of the waiting time of people currently in the pipeline, as not all customers are at the same stage within the pipeline. Thus, because the toaster is the largest time sink, eventually it fills up to maximum capacity and blocks, increasing waiting times. A higher number of toasters allows more customers to enter the shop before the average waiting time is over 6 minutes, meaning that these customers end up waiting longer than they otherwise would have with a single toaster. It’s probably not in Subway’s best interest to run the extra toasters for the marginal improvement in customer turnover[4].

Thoughts

The models I have used in this study do not necessarily reflect the actual times taken in the store, but my estimates are based on experience and are roughly proportionate. If you disagree, feel free to play with the numbers in the code and let us know what you find. It’s also worth noting that these models allow large queues to develop at workstations, which is unlikely to occur in real life.

I’d be interested in seeing some meta-analysis on the economics of the potential increase in revenue from introducing additional workstations, given a particular customer arrival rate and the cost of the additional workstations.

As always, please feel free to comment below with suggestions or thoughts on improvements to the system.

Links

 


[1] Editor’s note: we don’t actually go to McDonalds for lunch.

[2] This can be difficult for customers to understand, so someone is employed to manage customers in the queue!

[3] Each station has an infinite queue, but I simulate a finite number of customers.

[4] They, and the other companies I’ve discussed, have probably all done some variant of this analysis already.

What We Do

If you’re looking at one of our job ads, chances are you want to know more about what we do — what does a software engineering job at AetherWorks actually involve?

We’re currently building AetherStore, a distributed data store. AetherStore runs over the workstations and servers in your organization, harnessing their unused capacity to create a shared, virtual file system. To a user, we appear as any other networked file system, without requiring any extra hardware.

It’s a wonderfully engaging project to work on because we see such a diverse range of activities every day. From the low level handling of calls from Windows Explorer[1], to the task of breaking up[2], encrypting[3], and dispersing files across the network[4], I’ve covered most of my Computer Science education in some form while working on AetherStore.

Why I Work Here

My own background is in distributed systems (primarily distributed database systems), so the de-centralized and fault-tolerant design of AetherStore has obvious appeal. I love the challenge of creating these systems because you are so constrained by the fundamental properties of distribution[5], and yet it’s still possible to create systems that scale to many hundreds of thousands of machines[6].

With AetherStore our challenge is in creating a shared file system where every piece of data doesn’t have to reside on every machine. We want to spread user data across machines proportional to their size, but we don’t have a centralized authority to lean on when deciding how to do this. And we don’t even know how many machines should be in the system[7]. It’s brilliantly limiting[8]!

For me, it’s hard to imagine a better job. I love the intellectual challenge of creating a new feature or component, and the satisfaction of being able to craft this into a complete, stable software product. It’s  truly an exciting thing to be a part of.

Photo of Angus' Desk
The working environment isn’t bad either!

So, while it’s not easy competing with so many other great New York companies, I think we’ve got a lot to offer. Consider applying!

Our Interviews

My biggest problem with jobs listings (ours included) is that we specify a set of requirements that invariably turn into clichés, and we don’t explain why we need them or how we test for them. So let’s look at a few, and see why they actually matter more than you might think.

“Software Engineer.”

The job title may seem meaningless, but I love this distinction between software engineers and programmers. We want to know that you craft code to a high standard, and that you understand why ‘it just works’ isn’t enough.

In an interview we’ll ask you to review some (bad) code we’ve written, to gauge your code literacy. We’re looking for someone that has an appreciation for clean code and a quick eye for bugs.

“A solid understanding of object-oriented programming.”

We’re building a complex system and we need to make sure that you’re the type of person that can structure code in a logical and maintainable way.  We’ll ask you to do a short programming assignment to get a feel for your general abilities and experience.

“Fundamental Computer Science Background.”

The work I have described in the previous section is challenging, and it requires that you know the relative efficiency of, say, a linked list and an array, but also that you’re capable of creating your own data structures from time to time. For us, the best indicator of this skill-set is an undergraduate degree in Computer Science. In an interview we’ll ask you an algorithmic question that gives you a chance to demonstrate the breadth of your knowledge.

If you do well enough in these questions then we’ll invite you in for a longer interview, asking you to solve a real problem that we’re actually working on in the office.

To Apply

If the idea of working at AetherWorks appeals to you, I’d urge you to check out our available positions. Alternatively, if you have any questions about this post or our interviews, please feel free to email me (first initial, last name[9]).

 


[1] And I mean every single call. Every time you open a directory, right-click on a file, or save a document, Windows Explorer is providing us with a constant stream of calls asking for information and telling us what to update. Since we’re pretending to be a network mount, we have to handle each of these calls, giving responses either from the local machine or a remote copy. This fascinates me more than it probably should, but it gives you some brief insight into the complexity of the operating systems we use every day without thought.

[2] When you store a file we break it up into chunks, both to make it easier to spread data across the network and to increase de-duplication. There are entire classes of research dedicated to finding ways of doing this efficiently. Content-based chunking, in particular, has some really clever uses for hashing (fingerprinting) algorithms and sliding windows, which dramatically improve de-duplication.

[3] We have to encrypt data at rest and in transit, but this is more challenging than in most systems where you have a central authoritative server. Without this, our encryption architecture represents a trade-off between security and usability.

[4] Deciding where to place data is particularly challenging, since we don’t have a central coordinator that can make this decision. All machines must be in agreement as to where data is placed (so that it can be accessed), but it is expensive to allow them to co-ordinate to make this happen.

[6] Constraints can be catalysts for creativity.

[7] Since we don’t know how many machines are ever in the system, we can’t use distributed consensus protocols such as Paxos. These require that a majority of nodes agree on a decision, but if you don’t know how many nodes exist, you don’t know how many nodes form a majority.

[8] The CAP theorem is my favorite (trivial) example of this. Imagine you have 3 machines, and have a copy of some data on each machine. How do you handle an update to that data?

How we handle this update (and anything else in distributed systems) is determined by our response to network partitions – when one set of machines is unable to contact another. If we use a central lock manager to stop conflicting updates we ensure that the data is consistent, but that it will be unavailable if the lock manager cannot be contacted. If we use a majority consensus protocol, we can update our data in the event of a partition, but only if we are in the partition with a majority of nodes. If we assume that neither of these cases is acceptable, we can do away with consistency altogether, allowing updates to each individual copy even when the others are inaccessible. The fundamental properties of a distributed system limit us in each of these options — it’s up to us to decide which is the most appropriate in any given case.

[9] This is our interview puzzle question!