Modern networked applications are generally developed under the assumption of ubiquitous, high availability networks to afford communication between computing devices. This assumption is based on the tenets that all nodes in the network are addressable at all times, and all nodes in the network are contactable at all times. But, what if we consider a network environment where not all present devices are contactable? What can we learn from building software that operates in a low-availability environment?
The word “networking”, can both refer to a set of communicating computing devices (a TCP/IP computer network), and to a set of people meeting to build inter-personal connections (a NYC start-up networking event, for example).
If we frame these two concepts together, we can gain understanding of how to build applications tolerant of low-availability environments.
We should consider the properties of each of these networking concepts. What makes them similar? What makes them different? In the typical office environment, desktop machines are always plugged in to the office backbone network, which is always present and always has access to the internet using the office’s connection. Regular computing networks are designed to afford communication in such an environment. IP/Ethernet addresses of network members are assumed to resolve to a machine, and we assume that machines will not change Ethernet addresses (generally they do not change IP addresses either). All functional machines, therefore, are assumed to be addressable and contactable, at all times.
This is clearly quite a simplistic example. We can, however, contrast it with the human networking example. In an NYC start-up event, we consider the communication network to be human discussion. When the attendees are mingling over drinks, they are moving in and out of audible range of one another. Not everyone present at the event is immediately contactable by every other attendee, despite being in the same room (network), because not all attendees are part of the same group conversation. All attendees are addressable, however, because everyone is wearing their name badge!
I like to think conceptually that these types of networking lie on opposite ends of a spectrum. On one end we have a “solid” networking state, where computers do not change location, routing paths are assumed to be static (or effectively static) and all connected machines are assumed to be addressable and contactable by all other machines on the network.
At the other end of the spectrum we have a sort of “gaseous” network where meet-up attendees are in small, disparate networks. Members are available to communicate locally, but are aware of all other attendees who, while they cannot be communicated with at present, are addressable and assumed to be contactable at some point in the future (as attendees mingle).[i]
Most of my work in academia focused on routing protocols designed for these low connectivity environments similar to the human tech meet-up. In these types of networks, a node may be aware of the address of the device that it is attempting to contact, but there may be no routing path for communication. In this case, a full path will never be available, and therefore, the node will never successfully communicate with the intended destination. Nodes must therefore pass messages through intermediate nodes that store and forward messages on each other’s behalf. These types of networks are called opportunistic networks, as nodes may pass messages opportunistically whenever the opportunity to do so arises.[ii]
A good example of an opportunistic network would be an SMS network of everyone in NYC, where messages could only be passed between phones using Bluetooth, with messages forwarded on each other’s behalf as soon as they came into range. By exploiting six degrees of separation, messages could travel throughout the city, producing an approximation of a free SMS delivery network, albeit a rather slow one!
It’s not hard for me to see the link between this and my new job at AetherWorks, where we develop distributed application software. Consider AetherStore, our distributed peer-to-peer storage platform that presents the free disk space of multiple machines as a single address space. Data can be written to any subset of machines, and the data and backups are automatically synced and managed by the nodes themselves. Like most modern software, it is designed to operate in a heterogeneous network environment.
AetherStore uses a decentralized network where nodes may join and leave at any time, so it operates in a problem space at the convergence of well-connected TCP/IP networks and a disconnection-prone environment. Consider a customer using AetherStore to sync files between two desktop machines and a laptop at their office. They may wish to take their laptop out of the office and work on the stored files in a park. If someone else in the office decides to modify “the big presentation.ppt” simultaneously with our user in the park, when the two devices sync there will undoubtedly be conflicts.
This synchronization may seem like a trivial problem, but it is it not. Time stamps cannot be trusted. How do you know which machine has the correct time? Furthermore, how do you know how to construct the differences between the files? One way to quickly determine if conflicts are present is to construct a tree of changes to files, similar to the approach of modern Distributed Version Control Software (e.g. Git, Mercurial). These change-trees are then compared when machines can once again communicate. In our example of the laptop taken home, we can immediately draw some parallels to our disconnected network environment. The two nodes in the office and the laptop in the park are part of the same AetherStore network, and yet not all nodes are contactable or addressable by all other nodes.
By building a system that can handle the difficulty of the disconnected environment, a state that may not occur often but must be accounted for, we can necessarily cope with the well-connected, high availability network environment.
I present no answers here. I will, however, leave you with a few questions:
- Should nodes queue change-trees to be exchanged between devices when they meet? How big should we set this limit? Can we discard old change sets?
- Should certain machines always defer to others versions of files when conflicts occur?
- Can we develop a system in which timings of changes can be trusted?
- Should we take human factors into account? Should we consider employee hierarchy? Is it a good idea to always accept your boss’s changes?
- Is there a “plasma” state for networks somewhere past “gaseous” on my spectrum? I may not know who is going to turn up (non-addressable), or for how long they will be part of the network (address lifetime).
- Should we allow users and addresses to be separated?
- Perhaps we could predict addresses that might be usable in future on such a network?
- Should we use address ranges instead of unique addresses?
- Can nodes share addresses for certain points in time?[iii]
[i] Somewhere in between solid and gaseous networks we have ‘liquid’ state Mobile Ad hoc NETworks (MANETs). In a MANET the routing paths between nodes may change frequently but nodes are all are addressable contactable at any given point.
[ii] Note the distinction between opportunistic networks and Delay Tolerant Networks. A DTN may assume some periodicity to connections, which gives rise to a different set of routing algorithms. http://www.dtnrg.org/