Towards Pervasive Personal Data

This week we welcomed our newest software engineer, Lewis Headden, to AetherWorks.  In this post he discusses his Bachelors dissertation, “Towards Pervasive Personal Data”, which looked at issues in Distributed File Synchronization. This is an active area of research for his supervisor at the University of St Andrews, Dr. Graham Kirby.

Prior to AetherWorks, Lewis worked for Amazon Development Centre (Scotland) and spent a number of years freelancing – primarily developing web applications.

Pervasive Personal Data is the vision of a world where data flows autonomously to exactly where a user needs it.

Pervasive Personal Data is defined by intelligent systems taking proactive actions to attempt to maximize the availability and consistency of files even in a partitioned network. Systems learn what files the user wants and where they want them, and utilize every available connection to deliver it to the appropriate devices. The world of Pervasive Personal Data is the post-user configured, post-file synchronization world.

Why consider Pervasive Personal Data?

As the number of devices in use by the average consumer grows, the media they are consuming becomes fragmented across all their devices. Word processing files may be easily synchronized over the internet, but larger files – such as high resolution photos and video – struggle even on high bandwidth connections. These large media files are increasingly lacking in redundancy (or require a cumbersome backup process) and are often not available when people want them most. Yet if devices and machines worked autonomously to replicate this media then this problem would disappear. Files would be replicated on other devices, providing redundancy, and appear on the user’s other machines as they moved around, increasing their availability.

How is distribution of files in Pervasive Personal Data achieved?

Data transfer in such a system could be achieved in three ways.

First, the system could use the Local Area Network (LAN) to quickly transfer files. Obviously this is limited in usefulness, as the system cannot spread files outside of the network the device is in. Device portability, however, could help solve this. As laptops are often moved between home, work, and other locations, they could synchronize files to computers on all of the networks that they visit.

Second, the system could determine where it has seen high bandwidth connections and log them. By understanding past encounters with these connections, and predicting when it will be connected to them again, it can utilize the excess bandwidth to spread files without congesting the user’s connection. Of course, this is slower than using the LAN in most situations. However, Pervasive Personal Data is largely about synchronizing a user’s files for their personal use. In most cases, if the user is working with a device then it is likely to be connected either to the same LAN as another of their devices or to be used infrequently. In these cases, the system will either be able to quickly copy on the LAN or tolerate the slow transfer (as it is unlikely the machine will be used frequently).

Finally, the system can use a Sneakernet to distribute files. As devices like phones, tablets, and USB drives are moved between devices and/or networks, they all contain excess space that could be used to disseminate the user’s data.

How does a Pervasive Personal Data system learn preferences?

I previously introduced the idea that a system could make determinations about what files to synchronize. The simplest way to do this would be to bootstrap the system by asking the user questions about their preferences. These questions could include “I frequently look at recent photos at work”, “I do not want files created at work on my home network”, or “I do not watch my videos on my tablet”. Through this, some simple preferences could be worked out and used to determine the priority of data transfers.

Users could tweak these with more complex rules if they found that the desired data was not being synchronized.

On a larger scale, data about what files users actually interact with, and where, could be mined. Through this, a generic set of rules could be developed using machine learning, and adjusted over time with input from the user.

How are networks like “Home” and “Work” determined?

By building a network map based on the computers that a given device in the system interacts with, the system quickly partitions devices into distinct networks. Devices that travel between networks are also identified as key routes.

The system then labels these networks by combining user input with other information such as time of day or the types of modified files. Once it has determined the types of encountered networks, it uses that data to filter the files that are spread between devices.

What are the problems for Pervasive Personal Data?

There are a number of components that would make up such a system, and few of them are simple to engineer.  A starting point for thinking about these problems would be to ask questions. For example, how does a system:

  • Work out what to synchronize?
  • Decide which route (Sneakernet/LAN/WAN) to use?
  • Predict network availability and the transit of devices?
  • Deal with changing networks and new devices?
  • Simplify the configuration process for users but still meet their demands?

These problems, however numerous, are not insurmountable. A distributed storage system is a key part of the problem; analyzing the network and predicting availability is another. Furthermore, the system would need to learn user preferences in file usage, evaluate route selection when high latency paths like Sneakernets exist, and deal with the unpredictable whims of human beings.

This is an area of research that is still relatively unexplored, with some big challenges to overcome in order to deliver a solution that satisfies users.