Zero-Conf: Bootstrapping the Network Layer

When you connect a printer to your local network, how does your computer find it?

It has to be addressable, which means it needs an IP address (and ideally a hostname), and it needs to be discoverable so that you can find it from your computer. When that works, you see a window like this:

Searching for a Printer

These tasks are covered by the zero-configuration protocol, which describes how to make this work even when DHCP and DNS servers, which assign IP addresses and hostnames, are not available.

The zero-conf specification covers three broad goals, which I discuss in this post:

  1. Allow devices to obtain an IP address even in the absence of a DHCP server.
  2. Allow devices to obtain a hostname, even in the absence of a DNS server.
  3. Allow devices to advertise and search for (discover) services on the local link.

Obtaining an IP Address

For a device to route messages on a network, it needs an IP address. This is relatively trivial if a DHCP server is available, but when this isn’t the case, zero-conf uses link-local addressing to obtain one.

The address assigned by link-local addressing is in the 169.254.0.0/16 block, which is only useful within the link (because routers won’t forward packets from devices in this address space[1]).

To obtain an address the device sends ARP requests to establish whether its desired IP address (chosen at random from the 169.254 range[2]) is available. This is a two-step process:

  1. First, an ARP probe is sent asking for the MAC address of the machine with a given IP. This probe is sent a number of times to confirm that the given IP address is not in use (there will be no response if it isn’t).
  2. If no reply is received, the device sends out an ARP announcement saying that it is now the machine with the given IP.

There are various protocols for conflict resolution of addresses that I won’t discuss here[3].

Obtaining a Hostname

Once a device has an IP address it can be contacted through the network, but IP addresses are volatile and for the device to be consistently discoverable it needs a hostname (which is less likely to change). Assigning as hostname is simple if you have a DNS server, but, if you don’t, zero-conf can assign hostnames using Multicast DNS.

Multicast DNS (mDNS) uses IP multicast to send broadcasts on the local link (something we wrote about recently).

To claim a local hostname with mDNS, a device sends DNS messages to probe for the uniqueness of the hostname[4]. Three queries are sent in 250ms intervals, and if no device reports using this hostname, the requesting device then sends an announce message to claim ownership.

In the case of a conflict (where two devices believe they own the same hostname), lexicographic ordering is used to determine a winner.

mDNS hostnames must use a local top-level domain (.com, .org, .gov, etc.) to distinguish them from globally accessible hosts. Apple devices and many others use the .local domain.

Browsing for Services

Once a device has an IP address and hostname, it can be contacted, but only if we know the name of the specific device we are looking for. If we don’t, DNS Service Discovery (also known as DNS-SD) can be used to search for services available on the local link. Rather than looking for a printer called ‘jose’, we can look for all devices supporting a print protocol (and then select ‘jose,’ if available).

What this looks like

Services are advertised in the form:

ServiceType.Domain

For example, _ipp.example.com advertises devices supporting the Internet Printing Protocol in the example.com domain.

Individual services are identified by a specific instance name:

InstanceName.ServiceType.Domain

So our printer, ‘jose,’ would be identified as:

jose._ipp._tcp.local[5]

How this works

DNS-SD uses mDNS to announce the presence of services. To achieve this, it uses DNS PTR and SRV records.

PTR records  (pointer records) are used in lookups[6] to search for a specific service type (_ipp._tcp.local in the above example) and return the name of the SRV record for each service supporting the specified protocol (there is one PTR record for each SRV record).

A query for _ipp._tcp.local will return jose._ipp._tcp.local and all other printers supporting the IPP protocol in the local domain.

SRV records (service records), record the protocol a service supports and its address. If the PTR record is used to find devices supporting a protocol, the SRV record is used to find a specific device’s hostname.  For the printer ‘jose,’ the SRV record would contain the service type and domain, and the hostname for the printer itself:

_ipp._tcp.local | jose.local

At this point we have discovered and can connect to our printer.

In addition to this, there are various extensions to zero-conf that I don’t describe here. These include:

  • TXT records, which allow extra attributes to be recorded in the service announcement (for example, extra information needed to make a proper connection to a service).
  • Subtypes, which allow the same service to advertise different levels or types of support within a service.
  • Flagship Service Types, which enable applications to determine when the same device is supporting and announcing multiple protocols. This makes it possible to remove duplicate entries in a listing, where a device supports multiple protocols that perform the same function.

Implementations

The most commonly used implementation of zero-conf (or at least the most written about) is Apple’s Bonjour.

We have used this library as the basis for our own implementation inside AetherStore, linking in with the provided dns_sd.jar wrapper. There are various native implementations that I haven’t yet tried.

If you’d like to read more on zero-configuration, I’d recommend the O’Reilly Zero Configuration Networking book, which provides an exhaustive description of everything I’ve touched on in this post.

Other sources are included below:


[1] This is good because it stops local information from potentially clogging upstream networks.

[2] Unless it has already been on this network, in which case it uses its previously chosen address.

[3] Edgar Danielyan discusses this in an article of the Internet Protocol Journal.

[4] This is the same process as with regular unicast DNS, but with some optimizations. For example, individual queries can include multiple requests and can receive multiple responses, to limit the number of multicast requests being made.

Unicast DNS packets are limited to 512 bytes, whereas mDNS packets can have 9,000 bytes (though you can only transmit 1,472 bytes without fragmenting the message into multiple Ethernet packets).

[5] The _tcp line specifies the service runs over TCP (rather than UDP).

[6] PTR records are similar to CNAME records but they only return the name, rather than performing further processing (source).

Distributed Consensus: Paxos by Example

In distributed systems, ensuring a set of machines are able to reach consensus on a decision is extremely important. Database systems, for example, need to ensure that all machines agree whether to commit or rollback a transaction, so that each machine holds a consistent view of its data (imagine the problems your bank would have if one machine thought you had $1000 in your account, but another thought that you had $0).

Consensus is difficult to achieve, because messages between machines can be lost or indefinitely delayed, or the machines themselves can fail — how does one machine know whether another machine has processed a message?

Two-phase commit is commonly used to provide consensus, but it suffers from a single point of failure — if the node coordinating a transaction fails, the system must block until this node restarts. The three-phase commit protocol removes the blocking problem, but is still reliant on a single coordinator.

This post discusses Paxos, a distributed alternative to two- and three-phase commit. Paxos guarantees that nodes will only ever choose a single value (meaning it guarantees safety), but does not guarantee that a value will be chosen if a majority of nodes are unavailable (progress). Consensus algorithms aim for safety because it doesn’t matter whether we commit or rollback — neither is more correct than the other — but it is of critical importance that only one answer is chosen.

General Approach

Paxos node can take on any or all of three roles: proposeracceptor, and learner. A proposer proposes a value that it wants agreement upon. It does this by sending a proposal containing a value to the set of all acceptors, which decide whether to accept the value. Each acceptor chooses a value independently — it may receive multiple proposals, each from a different proposer — and sends its decision to learners, which determine whether any value has been accepted. For a value to be accepted by Paxos, a majority of acceptors must choose the same value. In practice, a single node may take on many or all of these roles, but in the examples in this section each role is run on a separate node, as illustrated below.

Figure 1: Basic Paxos architecture. A number of proposers make proposals to acceptors. When an acceptor accepts a value it sends the result to learner nodes.

Paxos By Example

In the standard Paxos algorithm proposers send two types of messages to acceptors: prepare and accept requests. In the first stage of this algorithm a proposer sends a prepare request to each acceptor containing a proposed value, v, and a proposal number, n. The proposed value can be commit or rollback, as in the previous database example, or it can be any other arbitrary value. In this post the Paxos nodes are attempting to reach consensus on an integer value. Each proposer’s proposal number must be a positive, monotonically increasing, unique, natural number, with respect to other proposers’ proposal numbers[1].

In the example illustrated below, there are two proposers, both making prepare requests. The request from proposer A reaches acceptors X and Y before the request from proposer B, but the request from proposer B reaches acceptor Z first.

Figure 2: Paxos. Proposers A and B each send prepare requests to every acceptor. In this example proposer A’s request reaches acceptors X and Y first, and proposer B’s request reaches acceptor Z first.

If the acceptor receiving a prepare request has not seen another proposal, the acceptor responds with a prepare response which promises never to accept another proposal with a lower proposal number. This is illustrated in Figure 3 below, which shows the responses from each acceptor to the first prepare request they receive.

Figure 3: Paxos. Each acceptor responds to the first prepare request message that it receives.

Eventually, acceptor Z receives proposer A’s request[2], and acceptors X and Y receive proposer B’s request. If the acceptor has already seen a request with a higher proposal number, the prepare request is ignored, as is the case with proposer A’s request to acceptor Z. If the acceptor has not seen a higher numbered request, it again promises to ignore any requests with lower proposal numbers, and sends back the previous highest proposal number that it has seen along with the value of that proposal. This is the case with proposer B’s request to acceptors X and Y, as illustrated below:

Figure 4: Paxos. Acceptor Z ignores proposer A’s request because it has already seen a higher numbered proposal (4 > 2). Acceptors X and Y respond to proposer B’s request with the previous highest request that they acknowledged, and a promise to ignore any lower numbered proposals.

Once a proposer has received prepare responses from a majority of acceptors it can issue an accept request. Since proposer A only received responses indicating that there were no previous proposals, it sends an accept request to every acceptor with the same proposal number and value as its initial proposal (n=2, v=8). However, these requests are ignored by every acceptor because they have all promised not to accept requests with a proposal number lower than (in response to the prepare request from proposer B).

Proposer B sends an accept request to each acceptor containing the proposal number it previously used (n=4) and the value associated with the highest proposal number among the prepare response messages it received (v=8)[3]. Note that this is not the value that proposer B initially proposed, but the highest value from the prepare response messages it saw.

Figure 5: Paxos. Proposer B sends an accept request to each acceptor, with its previous proposal number (4), and the value of the highest numbered proposal it has seen (8, from [n=2, v=8]).

If an acceptor receives an accept request for a higher or equal proposal number than it has already seen, it accepts and sends a notification to every learner node. A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors have accepted a value, as is illustrated below:

Figure 6.

Once a value has been chosen by Paxos, further communication with other proposers cannot change this value. If another proposer, proposer C, sends a prepare request with a higher proposal number than has previously been seen, and a different value (for example, n=6, v=7), each acceptor responds with the previous highest proposal (n=4, v=8). This requires proposer C to send an accept request containing [n=6, v=8], which only confirms the value that has already been chosen. Furthermore, if some minority of acceptors have not yet chosen a value, this process ensures that they eventually reach consensus on the same value.

Various efficiency improvements to the standard Paxos algorithm are discussed in the papers by Lamport and Baker et al.. For example, a prepare request is not necessary if the proposer knows that it is the first to suggest a value. The proposal for such a request is numbered 0, so that it will be ignored if any higher numbered requests have been received.

General Utility

Paxos has often been criticized for its complexity, particularly with respect to the challenge of implementing it in a functional form. In spite of this, it’s an interesting example of a particularly challenging distributed systems problem, and a clever, conceptually-clean solution.

If you’re interested in this topic, i’d recommend reading about real-world examples of Paxos and two- and three-phase commit, starting with the references below.

This post is an updated version of one that first appeared on angusmacdonald.me. It was in turn based on work done during my Ph.D.

References

L. Lamport, “Paxos Made Simple” in ACM SIGACT News, vol. 32, no. 4, pp. 18–25, 2001.

Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Larson, J., Léon, J. M., Megastore: Providing Scalable, Highly Available Storage for Interactive Services in Proceedings of the Conference on Innovative Data Systems Research, pp. 223-234, 2011.

T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an engineering perspective”, in Proceedings of the twenty-sixth annual ACM Symposium on Principles of Distributed Computing, 2007, pp. 398–407.

M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, in Proceedings of OSDI 2006.


[1] The method of ensuring the uniqueness of proposal numbers when there are multiple proposers is not specified in the Paxos algorithm itself.

[2] It may not, but the algorithm is resilient to this.

[3] Note that this is the highest proposal number that it received from prepare response messages. In this example, proposer B has a higher numbered proposal (n=4) than proposer A (n=2), but it has only received proposer A’s proposal in response to its prepare request. If no previous proposals were returned by the prepare response messages, proposer B would use its own proposal (n=4).

Why are software downloads so large?

I’ve never understood why software installers are so large. If you can package 100,000 lines of code into 4 Mb, why does any program I’m downloading need 50 Mb?

When we started developing AetherStore, I wanted to ensure that our installer didn’t turn out this way. But, after one year of development, our latest release is 50 Mb… and I’m okay with that. Here’s why:

Don’t Repeat Yourself

A good programmer has the ‘DRY’ mantra firmly engrained in their head — re-writing an algorithm that someone has already written and tested is a waste of time and an easy source of error. But it’s easy to take this too far when you start out on a project.

For example, if you’ve got a collections problem, Google Guava probably has a solution, but it adds 2.5 Mb.

Similarly, there are numerous Apache projects that look small at first glance, but actually require a whole other set of dependencies to run correctly. Once you’ve added the commons-io, commons-logging, mail, and activation JARs, you’ve used a few more megabytes in space.

So before you’ve even started writing your own code, you can end up with a 5 Mb release. We attempted to avoid this pitfall by not using any external libraries unless they were absolutely necessary. With roughly 35,000 lines of our own code, we had a compact release.

Current size:                Code: 2.5 Mb    Total: 2.5 Mb

Graphics

As development continued, we started adding graphics and other assets that increased the size of the project. For some programs, graphics take up a significant chunk of space, but because AetherStore relies on the existing Windows Explorer interface, the space needed for these assets is relatively small.

Current size:                Code: 2.5 Mb    Graphics: 600 Kb          Total: 3.1 Mb

Bundling

To create our first piece full AetherStore release, we bundled this code and graphics into an installer, along with a handful of JARs and the native libraries needed for file system integration:

Code: 2.5 Mb

Graphics: 600 Kb

Packaged Libraries: 1.8 Mb

Native Libraries: 2 Mb

Installer overhead: 600 Kb

Total size: 7.2 Mb

This is where I’d like our story to have ended, but sadly it doesn’t.  While the installer works perfectly, the user experience can be terrible. If a user doesn’t have Java installed on their machine (a pre-requisite), the installer provides a link to the Oracle download site. This takes a relatively short and simple installation process and adds a few extra steps that require the user to navigate an entirely separate installation process. Moreover, requiring a hidden 41 Mb download (Java SE 7u21, Windows x64) doesn’t make our program any more compact.

The alternative is to abandon our aim of providing a small download and bundle Java with the AetherStore installer. This balloons the size of our installer, but massively improves the user experience. It also means we can use the latest version of Java without worrying about compatibility on user machines.

AetherStore without bundling: 7.2 Mb

AetherStore with bundling: 51.2 Mb

How Important is Size?

All of this leads to a simple question. In a trade-off between size and simplicity, how much do we care about size?

I’m just about old enough to remember when a 5 Kb/s down link was cause for mass celebration, and a 10 Mb download represented a significant commitment. But given the ubiquity of fast internet connections, should the size of your software even factor into any development or release decisions?

I’m increasingly, and somewhat reluctantly, starting to think that it shouldn’t  The cost of having an extra few megabytes is insignificant compared to the additional cognitive complexity of an extra installation step, or the time required to emulate functionality already found in widely used, well tested utilities. We can still provide an un-bundled installer, but it won’t be the most common option.

Ultimately, the size of an application isn’t as important as it once was.

Plan For Disruptions: Networking in a Disconnected World

Modern networked applications are generally developed under the assumption of ubiquitous, high availability networks to afford communication between computing devices. This assumption is based on the tenets that all nodes in the network are addressable at all times, and all nodes in the network are contactable at all times. But, what if we consider a network environment where not all present devices are contactable? What can we learn from building software that operates in a low-availability environment?

The word “networking”, can both refer to a set of communicating computing devices (a TCP/IP computer network), and to a set of people meeting to build inter-personal connections (a NYC start-up networking event, for example).

If we frame these two concepts together, we can gain understanding of how to build applications tolerant of low-availability environments.

We should consider the properties of each of these networking concepts. What makes them similar? What makes them different? In the typical office environment, desktop machines are always plugged in to the office backbone network, which is always present and always has access to the internet using the office’s connection. Regular computing networks are designed to afford communication in such an environment. IP/Ethernet addresses of network members are assumed to resolve to a machine, and we assume that machines will not change Ethernet addresses (generally they do not change IP addresses either). All functional machines, therefore, are assumed to be addressable and contactable, at all times.

This is clearly quite a simplistic example. We can, however, contrast it with the human networking example. In an NYC start-up event, we consider the communication network to be human discussion. When the attendees are mingling over drinks, they are moving in and out of audible range of one another. Not everyone present at the event is immediately contactable by every other attendee, despite being in the same room (network), because not all attendees are part of the same group conversation. All attendees are addressable, however, because everyone is wearing their name badge!

I like to think conceptually that these types of networking lie on opposite ends of a spectrum. On one end we have a “solid” networking state, where computers do not change location, routing paths are assumed to be static (or effectively static) and all connected machines are assumed to be addressable and contactable by all other machines on the network.

At the other end of the spectrum we have a sort of “gaseous” network where meet-up attendees are in small, disparate networks.  Members are available to communicate locally, but are aware of all other attendees who, while they cannot be communicated with at present, are addressable and assumed to be contactable at some point in the future (as attendees mingle).[i]

Most of my work in academia focused on routing protocols designed for these low connectivity environments similar to the human tech meet-up. In these types of networks, a node may be aware of the address of the device that it is attempting to contact, but there may be no routing path for communication. In this case, a full path will never be available, and therefore, the node will never successfully communicate with the intended destination. Nodes must therefore pass messages through intermediate nodes that store and forward messages on each other’s behalf.  These types of networks are called opportunistic networks, as nodes may pass messages opportunistically whenever the opportunity to do so arises.[ii]

A good example of an opportunistic network would be an SMS network of everyone in NYC, where messages could only be passed between phones using Bluetooth, with  messages forwarded on each other’s behalf as soon as they came into range. By exploiting six degrees of separation, messages could travel throughout the city, producing an approximation of  a free SMS delivery network, albeit a rather slow one!

Manhattan provides a great location for an opportunistic network due to its high population density and small geographic area.
Manhattan provides a great location for an opportunistic network due to its high population density and small geographic area.

It’s not hard for me to see the link between this and my new job at AetherWorks, where we develop distributed application software. Consider AetherStore, our distributed peer-to-peer storage platform that presents the free disk space of multiple machines as a single address space. Data can be written to any subset of machines, and the data and backups are automatically synced and managed by the nodes themselves. Like most modern software, it is designed to operate in a heterogeneous network environment.

AetherStore uses a decentralized network where nodes may join and leave at any time, so it operates in a problem space at the convergence of well-connected TCP/IP networks and a disconnection-prone environment. Consider a customer using AetherStore to sync files between two desktop machines and a laptop at their office.  They may wish to take their laptop out of the office and work on the stored files in a park.  If someone else in the office decides to modify “the big presentation.ppt” simultaneously with our user in the park, when the two devices sync there will undoubtedly be conflicts.

This synchronization may seem like a trivial problem, but it is it not. Time stamps cannot be trusted. How do you know which machine has the correct time? Furthermore, how do you know how to construct the differences between the files? One way to quickly determine if conflicts are present is to construct a tree of changes to files, similar to the approach of modern Distributed Version Control Software (e.g. Git, Mercurial). These change-trees are then compared when machines can once again communicate. In our example of the laptop taken home, we can immediately draw some parallels to our disconnected network environment. The two nodes in the office and the laptop in the park are part of the same AetherStore network, and yet not all nodes are contactable or addressable by all other nodes.

By building a system that can handle the difficulty of the disconnected environment, a state that may not occur often but must be accounted for, we can necessarily cope with the well-connected, high availability network environment.

I present no answers here. I will, however, leave you with a few questions:

  • Should nodes queue change-trees to be exchanged between devices when they meet? How big should we set this limit? Can we discard old change sets?
  • Should certain machines always defer to others versions of files when conflicts occur?
  • Can we develop a system in which timings of changes can be trusted?
  • Should we take human factors into account? Should we consider employee hierarchy? Is it a good idea to always accept your boss’s changes?
  • Is there a “plasma” state for networks somewhere past “gaseous” on my spectrum? I may not know who is going to turn up (non-addressable), or for how long they will be part of the network (address lifetime).
  • Should we allow users and addresses to be separated?
  • Perhaps we could predict addresses that might be usable in future on such a network?
  • Should we use address ranges instead of unique addresses?
  • Can nodes share addresses for certain points in time?[iii]

 


[i] Somewhere in between solid and gaseous networks we have ‘liquid’ state Mobile Ad hoc NETworks (MANETs). In a MANET the routing paths between nodes may change frequently but nodes are all are addressable contactable at any given point.
http://en.wikipedia.org/wiki/Mobile_ad_hoc_network

[ii] Note the distinction between opportunistic networks and Delay Tolerant Networks. A DTN may assume some periodicity to connections, which gives rise to a different set of routing algorithms. http://www.dtnrg.org/

[iii] For discussion of phase transitions in opportunistic networks: http://dx.doi.org/10.1145/1409985.1409999

Fast Reads in a Replicated Storage System

People don’t like slow systems. Users don’t want to wait to read or write to their files, but they still need guarantees that their data is backed up and available. This is a fundamental trade-off in any distributed system.

For a storage system like AetherStore it is especially true, so we attach particular importance to speed. More formally, we have the following goals:

  1. Perform writes as fast as your local disk will allow.
  2. Perform reads as fast as your local disk will allow.
  3. Ensure there are at least N backups of every file.

With these goals the most obvious solution is to backup all data to every machine so that reads and writes are always local. This, however, effectively limits our storage capacity to the size of the smallest machine (which is unacceptable), so we’ll add another goal.

  1. N can be less than the total number of machines.

Here we have a problem. If we don’t have data stored locally, how can we achieve reads and writes as fast as a local drive? If we store data on every machine how do we allow N to be less than that? With AetherStore, goal 4 is a requirement, so we have to achieve this while attempting to meet goals 1 & 2 as much as possible.

This proves to be relatively trivial for writes, but harder for reads.

Write Fast

When you write to the AetherStore drive your file is first written to the local disk, at which point we indicate successful completion of the write — the operating system returns control to the writing application. This allows for the write speed of the local disk.

AetherStore then asynchronously backs up the file to a number of other machines in the network, providing the redundancy of networked storage.

Files are written locally, then asynchronously backed up to remote machines.

It is, therefore, possible for users to write conflicting updates to files, a topic touched on in a previous post.

Read Fast

Ensuring that reads are fast is much more challenging, because the data you need may not be stored locally. To ensure that this data is available on the local machine as often as possible, we pre-allocate space for AetherStore and use it to aggressively cache data beyond our N copies.

Pre-Allocating Space

When you start AetherStore on your PC you must pre-allocate a certain amount of space for AetherStore to use, both for primary storage and caching. As an example, a 1 GB AetherStore drive on a single workstation machine can be visualized like this when storing 400 MB:

Replicated data takes up a portion of allocated space.

The 400MB of data indicated in orange is so-called replicated data, which counts towards the required N copies of the data we are storing across the whole system. Each of these files is also stored on N-1 other machines, though not necessarily on the same N-1 machines[1].

Caching Data

The remaining 600GB on the local drive is unused, but allocated for AetherStore. We use this space to cache as much data as we can, ensuring that we get the read speed of the local drive as often as possible. Cached data is treated differently than replicated data (as illustrated in our example below), as it must be evicted from the local store when there is insufficient space available for newly incoming replicated data. At this point some read speeds will be more similar to a local networked server.

Cached data takes up space not used by replicated data.

The advantage of this approach is that it allows us to store data in as many places as possible when we have the space, but then store as much data as possible as the drive fills up. In the worst case we’re as good as a networked server, in the best-case we’re as good as your local file system.

Of greater importance, is that we are able to store data across only a subset of machines. This allows the capacity of the AetherStore drive to outgrow the capacity of any individual machine, while ensuring that users of every machine can still access all of the data on the drive.



[1] Replica locations are determined on a file-by-file basis, rather than on an entire set of files, so it is unlikely that large groups of files will be backed up to the same set of machines. This allows us to make use of low capacity machines by replicating fewer files in comparison to larger machines.

Challenges and Rewards of P2P Software

I love the challenge of creating peer-to-peer systems and the flexibility that they give us.

A well constructed peer-to-peer system allows us to create applications that work just as well with one hundred machines as they do with two, all without predetermined co-ordination or configuration; applications that don’t rely on a single machine or a specific network topology to run correctly.

With AetherStore, this is precisely what we need. We are creating a software system that eliminates the need for a storage server, instead allowing you to make use of the capacity you already have. If you have ten machines each with 1TB of free storage, AetherStore allows you to combine this capacity to create 10TB [1] of networked, shared storage, without any additional hardware.

With no shared server, we want to avoid making any one machine more important than the others, because we don’t want a single point of failure. We can’t delegate a machine to manage locks for file updates or to determine where data should be stored. Instead we need a system that is able to run without any central co-ordination, and that dynamically up-scales or down-scales as machines start up or fail.

This post discusses one of the ways in which AetherStore achieves this as a peer-to-peer system.

Conflict Resolution

As we have no central server and no guarantee that any one machine will always be active, we have no way of locking out files for update — two users can update the same file at the same time and we have no way of stopping them. Instead we need to resolve the resulting conflict.

Consider the following example. When two users decide to concurrently update the same file, we have a conflict. These updates are gossiped to the other machines in the network [2], which must independently decide how to resolve the conflict and make the same decision regardless of the order in which the updates were received.

conflict

This independent evaluation of conflicts is critical to the scalability of the system and to peer-to-peer architectures in general. If each node makes the ‘correct’ decision without having to contact any other nodes, the system is able to scale without introducing any bottlenecks [3]. This is the advantage of the peer-to-peer architecture, but it is also the challenge.

In the case of AetherStore, to deterministically resolve file conflicts we can only use one of the two pieces of information available to us: the time of the file update and the identity of the machine making the update. Time is an imperfect comparison, however, because the system clocks of each machine in the network are unlikely to be synchronized. Using machine ID for comparison is even less suitable because it results in an ordering of updates entirely determined by a user’s choice of machine [4].

Both options are imperfect, but they are the only choices we have without resorting to some form of central co-ordination. Consequently, we use the time of the update — the lesser of two evils — to determine which update takes precedence, with the other, conflicting update being added into a renamed copy of the file. If each update occurred at precisely the same time, we use the machine ID as a tiebreaker [5].

Truly Peer-to-Peer

The advantage of this approach is that every machine is an equal peer to every other machine. The failure of one machine doesn’t disproportionately affect the operation of the system, and we haven’t had to add a special ‘server’ machine to our architecture. Also, because each node resolves updates independently, we can easily scale out the system without fear of overloading a single machine.

Machines can be temporarily disconnected, users can take laptops home, a lab can be shut down at night, and the system remains operational [6].

Contrast this with a more traditional setup, where users are reliant on continued connectivity to a single server to have any chance of access to their data.

The key point here is that the removal of any central co-ordination greatly increases the flexibility of the system and its tolerance of failures. In AetherStore we have a system that is resilient to the failure of individual machines and one that seamlessly scales, allowing you to add or reintegrate machines into your network without configuration or downtime.

There is no central point of failure, no bottleneck, and no server maintenance.

And, for this, I love peer-to-peer systems.

 


[1] You probably want to keep multiple copies of this data, so the total space available may be slightly less.

[2] Rather than sending updates to all machines immediately, they are sent to random subsets of machines, eventually reaching them all. This allows us to scale.

[3] This is beautifully illustrated in Chord, which can scale to 1000’s of nodes with each node only having to know about a handful of other nodes to participate in the ring.

[4] Tom’s update will always override Harry’s.

[5] This approach is similar to, among other things, the conflict resolution used by CouchDB.

[6] Provided we have provisioned enough copies of user data. This is the topic for another blog post.