Security vs. Privacy in the Modern World

This week we welcomed our newest software engineer, Isabel Peters, to AetherWorks.  In this post she discusses the tradeoff between security and privacy, an area related to her recent Masters project at Imperial College London. She previously studied at the University of St Andrews.

As we store more information digitally, the dangers of data theft are increasingly serious. Do you know how your data is secured?

We live in a digitized world where almost every kind of transaction requires some form of computational data handling. For example, virtually any online purchase requires a user to enter personal information including their name, home address, and credit card information. This information is often necessary – whether for online purchases, for gaining access to restricted buildings, or for border control.

The movement of data into a digital format in some ways allows companies to greatly improve our lives, as we can process information more effectively and efficientlyOn the other hand, this also means that our privacy can be breached in many new ways, such as with identity fraud, Internet scams and other types of cyber-crime.

The threat from cyber-crime requires new levels of security. To increase protection of sensitive data sent through the network, security weaknesses have been fixed and message encryption techniques have been added to standard protocols, though they still need to be revised and updated periodically, such as with IPSec and TLS.

In the courts, extensive digital disclosure of personal data has emerged as a new challenge to the notion of privacy. The main issue is that technology is fast and the law is slow’. For example, the term ‘privacy’ is nowhere to be found in the US Constitution to date and non-disclosure policies are mostly completely voluntary for organizations. This is a dilemma for the consumer who wants to use online services but may be subject to unwanted collection and usage of personal data. For these reasons many people are wary of storing data in the cloud.

How is your data secured?

There are different types of protection mechanisms for information stored on a computer or sent through a network. First of all, data that is stored on disk or sent through a network can be protected with cryptography. This is the application of mathematical techniques to design ciphers and encrypt plaintext in such a way that it is hidden to anyone without the cipher key attempting to read it. The degree of security largely depends on the ‘strength’ of the cryptographic cipher, which is determined by both the cryptographic algorithm and the size of the cryptographic key.

For instance, Triple-DES encrypts each data block three times using DES (Data Encryption Standard) and  was initially believed to be sufficiently secure using a key-size of 56 bits. However, the US Government now recommends key-sizes between 80 to 120 bits because the previous version became vulnerable to brute-force attacks as a result of increased computational power. Other encryption algorithms currently approved by the by the US government (National Institute of Standards and Technology) include SKIPJACK (80 bits) and AES (128, 192, 256 bits).

Ultimately the factors that determine the optimal choice of the encryption technique and the size of the key depend on its purpose, the sensitivity of the data, and also on the computational resources and available time of the organization and/or individual who will use it [1].

How is your data accessed?

If a user has managed to encrypt their data, they still need a means of accessing it, while still ensuring that no-one else is capable of doing the same thing. Access to data can be managed through various mechanisms. For example, a database can be restricted by an authentication system that uses a type of ‘authenticator’. Authenticators can be:

  • Knowledge-based, such as a password or a ‘secret’ question such us “What is your pet’s name?
  • Object-based, such as a physical possession of a token that grants access to a resource such as a metal key to an apartment
  • ID-based, referring to authenticators that are unique to the individual such as a passport. For example, a special type of ID-based authenticator is an individual’s biometric, such as the human iris.

What makes biometrics extremely secure is its complete uniqueness to every human individual [5], but many people have doubts about such technology because it requires them to give away a very personal, unique, and unchangeable identifier. Whereas a compromised password can be reset, a stolen biometric is irreplaceable, making it particularly sensitive and vulnerable. This represents a key trade-off between privacy and security.

The dilemma: Privacy vs. Security

As attackers become more sophisticated and begin to understand security systems, and computational power increases, our security requirements must become more stringent. Recent trends show that people are relinquishing more private information in favor of security and self-protection [6]. Consequently, the continuously growing connectivity and the need for data storage are in constant and increasing conflict with people’s privacy concerns. This constitutes a predicament highlighted by President Obama: that “We can’t have 100% security and also then have 100% privacy and zero inconvenience” [4].

“We can’t have 100% security and also then have 100% privacy and zero inconvenience” – Barack Obama.

What are the implications of this shift?

As the demand for personal information increases, people will face increasingly stark choices between security and privacy. As a user of a system it’s important to figure out what is important to you.

New storage solutions must be designed to store sensitive information, and to follow the laws and standards on security and data encryption of the present day. Despite this, legal frameworks are still lagging behind, and there is an urgent need for new laws and standards that restrict the voluntary collection and abuse of user data, and protect users from the illegal distribution of such information on the internet.

However, we also have to accept and understand that in order to provide a certain level of security (e.g. at the airport) sensitive data must be captured and stored. In recent years interesting projects, such as the PrimeLife [2] project, have emerged, introducing new concepts and developments with regard to privacy and identity management, but there is still work to be done [3].

There is an onus on individuals to understand and accept how their data is protected, and on companies to produce products with an adequate degree of protection against modern threats to data security.


[5] The iris, for instance is one of the most widely deployed biometrics due to the stability of the human-eye over a life-time, its good protection from the environment and most importantly, its great mathematical advantage, given its excess of up to 266 degrees of freedom (the number of parameters that may vary independently).

[6] In a survey performed by the Joseph Rowntree Reform Trust 65 % of the respondents said that collecting information about citizens on large computer systems is a bad idea. 83% did not approve the access to phone, mail and Internet browsing records by the government (source).


Workstation Resource Utilization Goes Local (Again)

For much of the last 25 years, Computer Scientists have looked for ways to make use of the unused resources of workstation machines. During this time the capacities of machines have greatly increased, as have the number that are available and under-utilized. But while the focus of this work gradually shifted from local area networks to web-scale systems, the last few years have seen a shift back to the small scale. Here’s why.

The Workstation Revolution

In the late eighties, a group of researchers from the University of Wisconsin were trying to parallelize computation over workstation machines.

They observed large collections of machines, connected but under-utilized, and many users whose demand for computation far exceeded their current capacity. If the unused capacity could be harnessed, it provided a greater degree of parallelism than was previously available, and could do so with existing resources.

Their approach worked, and the resulting system, Condor [1], is still used today. Indeed, the two fundamental problems of distributed desktop computing are the same now as they were 25 years ago:

  1. The desktop is run for the benefit of its interactive user, so background processes shouldn’t affect that user’s activities.
  2. The desktop is unreliable, and can become unavailable at any time.

Condor was designed such that computations could be paused, making it possible to minimize impact on user activity during busy periods. It was also designed such that computations could be re-run from a previous checkpoint, ensuring the failure of a machine didn’t critically impact results.

It created a pool of computational capacity which users could tap into, which was perfect for local-area networks, because long running computations could be run on remote machines and paused whenever the machine was in use. Since these machines typically resided within a single organization, administrators could ensure that the machines rarely became unavailable for reasons other than user activity.

The Internet Generation

The nineties saw an explosion in the number of workstations that were available, connected, and unused. With the advent of the World Wide Web, the problem of resource utilization had gone global.

The characteristics of this network were vastly different to the LANs used by Condor. Most importantly, with machines outwith the control of any one organization, they were less reliable than ever. Worse still, results could not be trusted. For example, a user with a home workstation may have had a lot of unused capacity, but unlike the lab machines typically used by Condor, there was no guarantee that this machine would be available for any length of time. Even if it was available, with the machine being at a remote site, we now had to consider that results could be faked or corrupted.

More positively, there were now so many of these machines available that any system able to use a tiny fraction of their resources would be able to harness great computational power. In fact, there were now so many machines available that systems such as [2] and Seti@Home [3] could take advantage of redundant computation.

Both systems broke computations down into relatively small problems, and sent duplicates of these problems to many machines. The problems were small enough that even a short period of activity would allow the remote machine to complete, and the number of duplicate computations was great enough to ensure that (a) some of the machines would eventually finish the computation, and (b) enough of these machines would finish that their results could be compared, ensuring that corrupted results were ignored.

Forging beyond computation, P2P systems such as Napster [4] and BitTorrent [5] allowed people to use the storage capacity of their workstations to share and distribute data. Like Seti@Home did for computation, the key to these systems was that redundant copies of data were stored across many machines, meaning the system’s operation was not dependent on the continued availability of only a few workstations.

Local Again

Until recently, this was as far as these systems came. But the ubiquity of multi-machine home networks and an ever increasing demand for storage has created a new focus — storage on local area networks.

Napster and BitTorrent work well on a large scale, when the sheer number of users makes it unlikely that popular items will be inaccessible, but poorly on a small scale where they provide no guarantees that all data will be backed up.

Workstations and LANs now have the capacity to be used for storage, without affecting the activities of the user of a machine (problem 1). New storage systems are capable of ensuring that there are always enough copies of data to prevent data loss (problem 2).

Companies such as AetherStore (disclaimer: this is us)[6], AeroFS [7], and BitTorrent Sync [8] are working to make use of this under-utilized storage space.

Why do we do this? Many small companies have storage needs, but no desire for a server (and regulatory compliance issues with the cloud), while many others have one or two servers but no redundancy. The ability to provide backup without additional capital expenditure makes that additional backup or server a much more realistic prospect.

Resource utilization has gone local again. This time the focus is on storage.


[1] Condor: A Hunter of Idle Workstations


[3] Seti@Home

[4] Napster

[5] BitTorrent

[6] AetherStore

[7] AeroFS

[8] BitTorrent Sync

Towards Pervasive Personal Data

This week we welcomed our newest software engineer, Lewis Headden, to AetherWorks.  In this post he discusses his Bachelors dissertation, “Towards Pervasive Personal Data”, which looked at issues in Distributed File Synchronization. This is an active area of research for his supervisor at the University of St Andrews, Dr. Graham Kirby.

Prior to AetherWorks, Lewis worked for Amazon Development Centre (Scotland) and spent a number of years freelancing – primarily developing web applications.

Pervasive Personal Data is the vision of a world where data flows autonomously to exactly where a user needs it.

Pervasive Personal Data is defined by intelligent systems taking proactive actions to attempt to maximize the availability and consistency of files even in a partitioned network. Systems learn what files the user wants and where they want them, and utilize every available connection to deliver it to the appropriate devices. The world of Pervasive Personal Data is the post-user configured, post-file synchronization world.

Why consider Pervasive Personal Data?

As the number of devices in use by the average consumer grows, the media they are consuming becomes fragmented across all their devices. Word processing files may be easily synchronized over the internet, but larger files – such as high resolution photos and video – struggle even on high bandwidth connections. These large media files are increasingly lacking in redundancy (or require a cumbersome backup process) and are often not available when people want them most. Yet if devices and machines worked autonomously to replicate this media then this problem would disappear. Files would be replicated on other devices, providing redundancy, and appear on the user’s other machines as they moved around, increasing their availability.

How is distribution of files in Pervasive Personal Data achieved?

Data transfer in such a system could be achieved in three ways.

First, the system could use the Local Area Network (LAN) to quickly transfer files. Obviously this is limited in usefulness, as the system cannot spread files outside of the network the device is in. Device portability, however, could help solve this. As laptops are often moved between home, work, and other locations, they could synchronize files to computers on all of the networks that they visit.

Second, the system could determine where it has seen high bandwidth connections and log them. By understanding past encounters with these connections, and predicting when it will be connected to them again, it can utilize the excess bandwidth to spread files without congesting the user’s connection. Of course, this is slower than using the LAN in most situations. However, Pervasive Personal Data is largely about synchronizing a user’s files for their personal use. In most cases, if the user is working with a device then it is likely to be connected either to the same LAN as another of their devices or to be used infrequently. In these cases, the system will either be able to quickly copy on the LAN or tolerate the slow transfer (as it is unlikely the machine will be used frequently).

Finally, the system can use a Sneakernet to distribute files. As devices like phones, tablets, and USB drives are moved between devices and/or networks, they all contain excess space that could be used to disseminate the user’s data.

How does a Pervasive Personal Data system learn preferences?

I previously introduced the idea that a system could make determinations about what files to synchronize. The simplest way to do this would be to bootstrap the system by asking the user questions about their preferences. These questions could include “I frequently look at recent photos at work”, “I do not want files created at work on my home network”, or “I do not watch my videos on my tablet”. Through this, some simple preferences could be worked out and used to determine the priority of data transfers.

Users could tweak these with more complex rules if they found that the desired data was not being synchronized.

On a larger scale, data about what files users actually interact with, and where, could be mined. Through this, a generic set of rules could be developed using machine learning, and adjusted over time with input from the user.

How are networks like “Home” and “Work” determined?

By building a network map based on the computers that a given device in the system interacts with, the system quickly partitions devices into distinct networks. Devices that travel between networks are also identified as key routes.

The system then labels these networks by combining user input with other information such as time of day or the types of modified files. Once it has determined the types of encountered networks, it uses that data to filter the files that are spread between devices.

What are the problems for Pervasive Personal Data?

There are a number of components that would make up such a system, and few of them are simple to engineer.  A starting point for thinking about these problems would be to ask questions. For example, how does a system:

  • Work out what to synchronize?
  • Decide which route (Sneakernet/LAN/WAN) to use?
  • Predict network availability and the transit of devices?
  • Deal with changing networks and new devices?
  • Simplify the configuration process for users but still meet their demands?

These problems, however numerous, are not insurmountable. A distributed storage system is a key part of the problem; analyzing the network and predicting availability is another. Furthermore, the system would need to learn user preferences in file usage, evaluate route selection when high latency paths like Sneakernets exist, and deal with the unpredictable whims of human beings.

This is an area of research that is still relatively unexplored, with some big challenges to overcome in order to deliver a solution that satisfies users.

Are you a good estimator?

Producing accurate estimations for software projects is notoriously challenging, but why? It all starts with understanding what it takes to make a good estimate.

What is a good estimate?

An estimate is an approximation of something, implying that it is based on uncertainty. Clearly a good estimate is accurate, but since this isn’t always possible, it’s more useful if it at least encodes how uncertain we are.

If I say that a project will be completed in 4 months, it removes an important piece of information — my confidence in the estimate. It’s unlikely that the project will take exactly 4 months, but  is it a low risk project which might take between 3-5 months, or is it based on so many unknowns that it could take over a year? The estimate isn’t more useful with a narrow range if it is based on little to no understanding of the problem.

This is the point made by Steve McConnell in “Software Estimation: Demystifying the Black Art”, where he argues that the illusion of accuracy can be more dangerous for project estimation than a wide estimate. If we can acknowledge that the estimate is not solid, then we can at least start to improve our knowledge of the problem and begin to make it more accurate.

“Estimates don’t need to be perfectly accurate as much as they need to be useful.” – Steve McConnell.

How good are your estimates?

Perhaps unsurprisingly, most people overestimate their own ability to make accurate estimations.

To show this, McConnell provides a test (which you can try for yourself here), where you have to estimate the answer to 10 questions with a 90% confidence that the correct answer is in the range of your estimation.

Try it, and come back here. How did you do?

Very few people answer these questions with 90% confidence, partly because we are conditioned to believe that a good estimate is a narrow estimate.

In fact, a lot of the comments on the answers page argue that the questions are poor, because you’d have to be an expert to produce any meaningful (accurate, narrow) estimates. But this is precisely the point!

If you can answer with 90% confidence, but with a very wide range, then you are at least acknowledging that you don’t have enough knowledge to accurately answer the question.

And that’s the first step to fixing the problem.

Lessons Learned as a Start-up Intern

For my final contribution to the AetherWorks blog, I thought I’d share some thoughts on what I’ve learned during my past two and a half months interning at AetherWorks.  In addition to learning more about the software and data storage industry, I had a handful of personal goals for my internship that concentrated on my own professional development.

  1. Ask questions.

    There is always an opportunity to learn. Whether it was asking the developers tech questions or impromptu lunchtime conversations about life after college, people tend to be pretty willing to answer questions.  I found that to be a valuable resource and tried to tap into it frequently. To me, part of being an intern is trying to absorb the entire experience.

  2. Don’t be scared to speak up.

    It can be intimidating to contribute during a meeting or to ask for help, but I did it anyway. I also learned how to share and present my own ideas.  One of the last projects I worked on was an industry specific analysis of data we received from surveying hundreds of IT experts.  My involvement was a direct result of me bringing up that idea and then offering to take on the project. The results will influence the direction of the AetherStore beta, and I’m proud to have that project to call my own.

  3. Embrace the agile work environment.

    Job descriptions are really just a starting point for start-ups.  As the goals and priorities of the company evolve, job responsibilities might change, too. That requires you to be flexible and willing to adjust along with that. Understanding why and how to best align your functions with those new priorities is critical to working at a start-up.  Plus, if you speak up, you can influence the direction of your internship and be sure you are working on things that you find interesting and are productive for the company.

  4. Take on different types of projects.

    Internships are unique because they give you a chance to try out different types of roles before the real thing. I hadn’t done much with market analysis or competitor research before, so I was excited to take on those projects and gain exposure to that side of a start-up. In addition to that, I was able to work on more typical marketing and communications projects, as well as business and customer development.

  5. Brush up on CS knowledge.

    While I want to work on the business side of the tech start-up world, this summer taught me how valuable it would be to have a stronger technical background. The ability to communicate well with developers and engineers is definitely a necessity, so I hope to find some time in my senior schedule to take another CS class or two to facilitate that.

I’m headed back to Providence this weekend to start off my senior year, but before I go, I’d like to say thank you to everyone at AetherWorks for making my summer in New York so valuable and fun!


Infographic: IT Pros on Storage Pricing

We’ve just released a beta version of our software-only data storage solution, AetherStore, so we’re trying to gather as much information about the storage market as possible. We teamed up with Spiceworks to survey 250 IT Pros from their “Voice of IT” panel on storage strategies. The infographic below is Part II in a series of snapshots we’ve been releasing to share the valuable data we collected.

Perhaps unsurprisingly, our survey results revealed that IT experts prioritize cost over anything else in a storage solution. See below to find out what else respondents value in in a solution, how they prefer to pay, and what the buying process is like when they need to purchase new technology.

 Storage Pricing

Keep checking back with the AetherWorks Blog for more results from our survey. If you haven’t signed up to be an AetherStore Early Adopter yet, do so here!

Infographic: IT Pros on Local Storage

As we recently released an AetherStore beta, it’s been a huge priority to get feedback of all kinds from experts in the data storage industry. In addition to the information we were gathering from phone interviews with our Early Adopters, we worked with Spiceworks to survey 250 IT Pros from their “Spiceworks Voice of IT” panel on their storage strategies.

We covered everything from preferred pricing plans to compatibility requirements, and are looking forward to sharing the results through a series of snapshots like the one below. Because we’re developing storage software that makes use of latent hard drive space, we’re particularly interested in organizations’ available workstation capacity and how IT Pros feel about local storage vs. the cloud. Here’s what we found:

 Local Storage

We’ll be continuing to release the results of our Spiceworks survey over the coming days and weeks, so check back with the blog to stay in the loop! If you haven’t signed up to be an AetherStore Early Adopter yet, you can do so here.


Four Reasons Not to Forget About Local File Sharing

Increasingly, organizations need a file sharing system that offers remote access. But building your entire solution around the cloud to accommodate that need may neglect some of the most important benefits that local file sharing provides.

Considering a report published by the U.S. Census Bureau last year found that only 6.6% of workers regularly work from home, there’s clearly still a need for an effective way to share files in-office. Remote access may be a necessary addition, but here’s why cloud-first solutions shouldn’t replace local file sharing:

1. You won’t be completely reliant on your internet connection.

If you want to deploy a cloud-based storage solution, a constant, fast internet connection is a prerequisite. Any interruption in internet service can paralyze an office that’s depending on it to store and share files; you can only ever work as fast as your connection. Local networks remove the internet as a hindrance to productivity.

2. It’s faster.

Local file sharing means no network-clogging requests to the cloud. There’s simply no faster way to open a file than from your own local drive.

3. It’s more secure.

If you have security concerns, (and considering data breaches cost an average $5M per organization in the US last year, you probably do) there’s nothing safer than your local network. Even encrypted data is more at-risk on the internet, and many cloud-based solutions mean storing your data with a third party where you can’t be responsible for its security.

4. It’s often cheaper than cloud-based solutions

Chances are you already have local hard drives with spare capacity you could take advantage of to deploy a file sharing solution. Monthly premiums on cloud subscriptions can add up, and it’s hard to argue you couldn’t cut costs by adopting a local-first solution that uses hard drive space you’ve already paid for.

Of course, if you require remote access a local file share alone won’t cover the scope of your needs. But moving your entire solution to the cloud prioritizes flexibility, often at the cost of the availability, speed, security and price offered by local solutions. Why not consider a local-first, cloud-second approach instead? Use a local file sharing system in-office, and sync with the cloud for remote access when necessary. We’re developing AetherStore, and see it as a solution that allows you to reap the benefits of both. Cloud certainly isn’t going anywhere, but the benefits of local file sharing are too great to ignore.


What Does Business Development Mean at a Startup?

To reiterate what the Marketing and Communications Intern, Laney, explained in the last blog post – interning at AetherWorks does not consist of coffee-fetching and paper-sorting.

My title at AetherWorks is Venture Development Summer Associate, and this summer I’ve been devoting my time entirely to their current venture, AetherStore. My focus is on Business Development for AetherStore, but what exactly does someone do in this position?

The definition not only varies greatly depending on the size and type of startup, but it has become a catch-all phrase that seems to change depending on who you talk to. For me business development means continued, methodical innovation with the goal of growing business opportunities. I work with members of the product, marketing and engineering team to track key tasks and identify customers, manage the deal process, align roadmaps and launch strategies.

For an early stage startup like AetherStore, this is broken down into 3 main objectives:

1. Hypothesize

99% of all successful startup ideas start with an itch, or in the best-case scenario, with a problem that needs to be solved. Dropbox Founder, Drew Houston, was tired of having all his files scattered across his devices. Mark Zuckerberg wanted an easier way of connecting with other students at Harvard. This is where my work starts – identify a problem or “need” big enough that a customer is willing to pay for it, and find that customer. If the problem is not worth solving, create a new hypothesis and start testing that.  This might sound like a gross over simplification of Steven Blanks “the four steps to epiphany,” but at the end of the day what I do is a lot of customer development.

2. Analyze

Last summer working with an education-focused startup on campus, surrounded by thousands of students, I could simply walk out the door and start interviewing people about their problems and solutions. Working with a B2B product, customer development has definitely required more creativity and hard work – spending hours tracking people down on LinkedIn, cold emailing, and running around to NYC tech meetups. However, once you move past the first call, you begin to establish a working relationship with a customer.  You understand their job, their needs, and their problems. Key problems are highlighted and analysis is drawn to filter out the noise so the development team can focus on the right features and best integrations, and we can focus on the right partnerships and channels to deliver the best product experience possible.

3. Focus and Implement

Once we have collected enough information about a consumer segment and their problems, we can start to analyze the data and invalidate or validate our hypothesis, target markets, product features and partnerships. We can create validated strategies for taking AetherStore to market.  The kryptonite for any startup is a lack of focus. This hypothesis testing process ensures that the business team is always focused and doesn’t waste time building partnerships that are not adding value to our business or the consumer; and the development team doesn’t waste time building features that customers don’t want.

As AetherStore is about to release a beta we are all excited to start delivering the product to Early Adopters, and I can’t wait to see what the next half of this summer will bring!

Intern Update: Marketing and Communications

This week marks the halfway point of my internship at AetherWorks. With the end of the summer now closer than the beginning, the realities of senior year are more pressing and require me to think seriously about what I’ll do once I graduate from Brown next May.

In my last post, I mentioned wanting to work for a start-up after graduation. For several months now, that has been my go-to response to questions about my future plans—encouraged by this article I read last summer and the positive experiences of many friends working at start-ups, but unsupported by any personal experience. I’m happy to report that my five weeks of working at AetherWorks has properly validated that statement. It truly is a great working environment.

Being an intern is often associated with mundane tasks, such as coffee runs or stuffing envelopes, but that doesn’t describe my responsibilities as the Marketing and Communications intern. With the AetherStore beta release on the horizon, today’s to-do list is always different than yesterday’s. Though the variety is certainly exciting and exposes me to the complexity of running a start-up, I have most enjoyed working on some of the larger projects because they give my days more continuity and offer a real-time perspective on the process of getting a product to market.

One major project I’ve been working on is the AetherStore Early Adopters Program. A main goal of this project is to collect information from our Early Adopters to help us address specific use cases and tailor the technology for different industries. As AetherStore is still in development, the type of information we collect can significantly influence the direction we take. As a result, my involvement, from promoting the program to setting up calls, has been a great learning experience.

Moreover, my involvement in this program serves as a great tool to measure both the company’s evolution and my personal progress. The opportunity to be put on this type of project as an intern would be unheard of at most places, but with only 9 people in the office, formalities and hierarchy are noticeably absent.

As AetherStore approaches a public beta, things will be picking up quite a bit here. I’m expecting the second half of the summer to be even busier than the first, and I’m excited to see what adjustments will be made to AetherStore as more people try out the software!