Towards Pervasive Personal Data

This week we welcomed our newest software engineer, Lewis Headden, to AetherWorks.  In this post he discusses his Bachelors dissertation, “Towards Pervasive Personal Data”, which looked at issues in Distributed File Synchronization. This is an active area of research for his supervisor at the University of St Andrews, Dr. Graham Kirby.

Prior to AetherWorks, Lewis worked for Amazon Development Centre (Scotland) and spent a number of years freelancing – primarily developing web applications.

Pervasive Personal Data is the vision of a world where data flows autonomously to exactly where a user needs it.

Pervasive Personal Data is defined by intelligent systems taking proactive actions to attempt to maximize the availability and consistency of files even in a partitioned network. Systems learn what files the user wants and where they want them, and utilize every available connection to deliver it to the appropriate devices. The world of Pervasive Personal Data is the post-user configured, post-file synchronization world.

Why consider Pervasive Personal Data?

As the number of devices in use by the average consumer grows, the media they are consuming becomes fragmented across all their devices. Word processing files may be easily synchronized over the internet, but larger files – such as high resolution photos and video – struggle even on high bandwidth connections. These large media files are increasingly lacking in redundancy (or require a cumbersome backup process) and are often not available when people want them most. Yet if devices and machines worked autonomously to replicate this media then this problem would disappear. Files would be replicated on other devices, providing redundancy, and appear on the user’s other machines as they moved around, increasing their availability.

How is distribution of files in Pervasive Personal Data achieved?

Data transfer in such a system could be achieved in three ways.

First, the system could use the Local Area Network (LAN) to quickly transfer files. Obviously this is limited in usefulness, as the system cannot spread files outside of the network the device is in. Device portability, however, could help solve this. As laptops are often moved between home, work, and other locations, they could synchronize files to computers on all of the networks that they visit.

Second, the system could determine where it has seen high bandwidth connections and log them. By understanding past encounters with these connections, and predicting when it will be connected to them again, it can utilize the excess bandwidth to spread files without congesting the user’s connection. Of course, this is slower than using the LAN in most situations. However, Pervasive Personal Data is largely about synchronizing a user’s files for their personal use. In most cases, if the user is working with a device then it is likely to be connected either to the same LAN as another of their devices or to be used infrequently. In these cases, the system will either be able to quickly copy on the LAN or tolerate the slow transfer (as it is unlikely the machine will be used frequently).

Finally, the system can use a Sneakernet to distribute files. As devices like phones, tablets, and USB drives are moved between devices and/or networks, they all contain excess space that could be used to disseminate the user’s data.

How does a Pervasive Personal Data system learn preferences?

I previously introduced the idea that a system could make determinations about what files to synchronize. The simplest way to do this would be to bootstrap the system by asking the user questions about their preferences. These questions could include “I frequently look at recent photos at work”, “I do not want files created at work on my home network”, or “I do not watch my videos on my tablet”. Through this, some simple preferences could be worked out and used to determine the priority of data transfers.

Users could tweak these with more complex rules if they found that the desired data was not being synchronized.

On a larger scale, data about what files users actually interact with, and where, could be mined. Through this, a generic set of rules could be developed using machine learning, and adjusted over time with input from the user.

How are networks like “Home” and “Work” determined?

By building a network map based on the computers that a given device in the system interacts with, the system quickly partitions devices into distinct networks. Devices that travel between networks are also identified as key routes.

The system then labels these networks by combining user input with other information such as time of day or the types of modified files. Once it has determined the types of encountered networks, it uses that data to filter the files that are spread between devices.

What are the problems for Pervasive Personal Data?

There are a number of components that would make up such a system, and few of them are simple to engineer.  A starting point for thinking about these problems would be to ask questions. For example, how does a system:

  • Work out what to synchronize?
  • Decide which route (Sneakernet/LAN/WAN) to use?
  • Predict network availability and the transit of devices?
  • Deal with changing networks and new devices?
  • Simplify the configuration process for users but still meet their demands?

These problems, however numerous, are not insurmountable. A distributed storage system is a key part of the problem; analyzing the network and predicting availability is another. Furthermore, the system would need to learn user preferences in file usage, evaluate route selection when high latency paths like Sneakernets exist, and deal with the unpredictable whims of human beings.

This is an area of research that is still relatively unexplored, with some big challenges to overcome in order to deliver a solution that satisfies users.

Zero-Conf: Bootstrapping the Network Layer

When you connect a printer to your local network, how does your computer find it?

It has to be addressable, which means it needs an IP address (and ideally a hostname), and it needs to be discoverable so that you can find it from your computer. When that works, you see a window like this:

Searching for a Printer

These tasks are covered by the zero-configuration protocol, which describes how to make this work even when DHCP and DNS servers, which assign IP addresses and hostnames, are not available.

The zero-conf specification covers three broad goals, which I discuss in this post:

  1. Allow devices to obtain an IP address even in the absence of a DHCP server.
  2. Allow devices to obtain a hostname, even in the absence of a DNS server.
  3. Allow devices to advertise and search for (discover) services on the local link.

Obtaining an IP Address

For a device to route messages on a network, it needs an IP address. This is relatively trivial if a DHCP server is available, but when this isn’t the case, zero-conf uses link-local addressing to obtain one.

The address assigned by link-local addressing is in the 169.254.0.0/16 block, which is only useful within the link (because routers won’t forward packets from devices in this address space[1]).

To obtain an address the device sends ARP requests to establish whether its desired IP address (chosen at random from the 169.254 range[2]) is available. This is a two-step process:

  1. First, an ARP probe is sent asking for the MAC address of the machine with a given IP. This probe is sent a number of times to confirm that the given IP address is not in use (there will be no response if it isn’t).
  2. If no reply is received, the device sends out an ARP announcement saying that it is now the machine with the given IP.

There are various protocols for conflict resolution of addresses that I won’t discuss here[3].

Obtaining a Hostname

Once a device has an IP address it can be contacted through the network, but IP addresses are volatile and for the device to be consistently discoverable it needs a hostname (which is less likely to change). Assigning as hostname is simple if you have a DNS server, but, if you don’t, zero-conf can assign hostnames using Multicast DNS.

Multicast DNS (mDNS) uses IP multicast to send broadcasts on the local link (something we wrote about recently).

To claim a local hostname with mDNS, a device sends DNS messages to probe for the uniqueness of the hostname[4]. Three queries are sent in 250ms intervals, and if no device reports using this hostname, the requesting device then sends an announce message to claim ownership.

In the case of a conflict (where two devices believe they own the same hostname), lexicographic ordering is used to determine a winner.

mDNS hostnames must use a local top-level domain (.com, .org, .gov, etc.) to distinguish them from globally accessible hosts. Apple devices and many others use the .local domain.

Browsing for Services

Once a device has an IP address and hostname, it can be contacted, but only if we know the name of the specific device we are looking for. If we don’t, DNS Service Discovery (also known as DNS-SD) can be used to search for services available on the local link. Rather than looking for a printer called ‘jose’, we can look for all devices supporting a print protocol (and then select ‘jose,’ if available).

What this looks like

Services are advertised in the form:

ServiceType.Domain

For example, _ipp.example.com advertises devices supporting the Internet Printing Protocol in the example.com domain.

Individual services are identified by a specific instance name:

InstanceName.ServiceType.Domain

So our printer, ‘jose,’ would be identified as:

jose._ipp._tcp.local[5]

How this works

DNS-SD uses mDNS to announce the presence of services. To achieve this, it uses DNS PTR and SRV records.

PTR records  (pointer records) are used in lookups[6] to search for a specific service type (_ipp._tcp.local in the above example) and return the name of the SRV record for each service supporting the specified protocol (there is one PTR record for each SRV record).

A query for _ipp._tcp.local will return jose._ipp._tcp.local and all other printers supporting the IPP protocol in the local domain.

SRV records (service records), record the protocol a service supports and its address. If the PTR record is used to find devices supporting a protocol, the SRV record is used to find a specific device’s hostname.  For the printer ‘jose,’ the SRV record would contain the service type and domain, and the hostname for the printer itself:

_ipp._tcp.local | jose.local

At this point we have discovered and can connect to our printer.

In addition to this, there are various extensions to zero-conf that I don’t describe here. These include:

  • TXT records, which allow extra attributes to be recorded in the service announcement (for example, extra information needed to make a proper connection to a service).
  • Subtypes, which allow the same service to advertise different levels or types of support within a service.
  • Flagship Service Types, which enable applications to determine when the same device is supporting and announcing multiple protocols. This makes it possible to remove duplicate entries in a listing, where a device supports multiple protocols that perform the same function.

Implementations

The most commonly used implementation of zero-conf (or at least the most written about) is Apple’s Bonjour.

We have used this library as the basis for our own implementation inside AetherStore, linking in with the provided dns_sd.jar wrapper. There are various native implementations that I haven’t yet tried.

If you’d like to read more on zero-configuration, I’d recommend the O’Reilly Zero Configuration Networking book, which provides an exhaustive description of everything I’ve touched on in this post.

Other sources are included below:


[1] This is good because it stops local information from potentially clogging upstream networks.

[2] Unless it has already been on this network, in which case it uses its previously chosen address.

[3] Edgar Danielyan discusses this in an article of the Internet Protocol Journal.

[4] This is the same process as with regular unicast DNS, but with some optimizations. For example, individual queries can include multiple requests and can receive multiple responses, to limit the number of multicast requests being made.

Unicast DNS packets are limited to 512 bytes, whereas mDNS packets can have 9,000 bytes (though you can only transmit 1,472 bytes without fragmenting the message into multiple Ethernet packets).

[5] The _tcp line specifies the service runs over TCP (rather than UDP).

[6] PTR records are similar to CNAME records but they only return the name, rather than performing further processing (source).