Fast Reads in a Replicated Storage System

People don’t like slow systems. Users don’t want to wait to read or write to their files, but they still need guarantees that their data is backed up and available. This is a fundamental trade-off in any distributed system.

For a storage system like AetherStore it is especially true, so we attach particular importance to speed. More formally, we have the following goals:

  1. Perform writes as fast as your local disk will allow.
  2. Perform reads as fast as your local disk will allow.
  3. Ensure there are at least N backups of every file.

With these goals the most obvious solution is to backup all data to every machine so that reads and writes are always local. This, however, effectively limits our storage capacity to the size of the smallest machine (which is unacceptable), so we’ll add another goal.

  1. N can be less than the total number of machines.

Here we have a problem. If we don’t have data stored locally, how can we achieve reads and writes as fast as a local drive? If we store data on every machine how do we allow N to be less than that? With AetherStore, goal 4 is a requirement, so we have to achieve this while attempting to meet goals 1 & 2 as much as possible.

This proves to be relatively trivial for writes, but harder for reads.

Write Fast

When you write to the AetherStore drive your file is first written to the local disk, at which point we indicate successful completion of the write — the operating system returns control to the writing application. This allows for the write speed of the local disk.

AetherStore then asynchronously backs up the file to a number of other machines in the network, providing the redundancy of networked storage.

Files are written locally, then asynchronously backed up to remote machines.

It is, therefore, possible for users to write conflicting updates to files, a topic touched on in a previous post.

Read Fast

Ensuring that reads are fast is much more challenging, because the data you need may not be stored locally. To ensure that this data is available on the local machine as often as possible, we pre-allocate space for AetherStore and use it to aggressively cache data beyond our N copies.

Pre-Allocating Space

When you start AetherStore on your PC you must pre-allocate a certain amount of space for AetherStore to use, both for primary storage and caching. As an example, a 1 GB AetherStore drive on a single workstation machine can be visualized like this when storing 400 MB:

Replicated data takes up a portion of allocated space.

The 400MB of data indicated in orange is so-called replicated data, which counts towards the required N copies of the data we are storing across the whole system. Each of these files is also stored on N-1 other machines, though not necessarily on the same N-1 machines[1].

Caching Data

The remaining 600GB on the local drive is unused, but allocated for AetherStore. We use this space to cache as much data as we can, ensuring that we get the read speed of the local drive as often as possible. Cached data is treated differently than replicated data (as illustrated in our example below), as it must be evicted from the local store when there is insufficient space available for newly incoming replicated data. At this point some read speeds will be more similar to a local networked server.

Cached data takes up space not used by replicated data.

The advantage of this approach is that it allows us to store data in as many places as possible when we have the space, but then store as much data as possible as the drive fills up. In the worst case we’re as good as a networked server, in the best-case we’re as good as your local file system.

Of greater importance, is that we are able to store data across only a subset of machines. This allows the capacity of the AetherStore drive to outgrow the capacity of any individual machine, while ensuring that users of every machine can still access all of the data on the drive.



[1] Replica locations are determined on a file-by-file basis, rather than on an entire set of files, so it is unlikely that large groups of files will be backed up to the same set of machines. This allows us to make use of low capacity machines by replicating fewer files in comparison to larger machines.

Creating Our Ideal Office Space

Office view.
Location, location, location!

For the first of my posts on the operations of AetherWorks, where better to begin than our first real challenge: finding the perfect office for our company.

When we started our search in 2010, we had three options:

1)      Setting Up Within an Incubator.  The tech scene in NYC is booming and there are plenty of options here.  If we had applied and been successful, an incubator would have provided managed office space alongside a range of other services. These vary from organization to organization but often include access to mentors/advisors and strategic help with partnering and access to loans and grants. Being in an environment focused on the development and longevity of start-ups and having bundled expertise on hand was a very attractive proposition. See Dogpatch Labs as an example.

2)      Fully Managed Facilities. These facilities are fully furnished and equipped offices that are already in move-in condition. They provide dedicated support staff and are available on short or long-term leases. In addition to reducing your upfront costs, work can begin immediately in Fully Managed Facilities. This is where we were based when we started our search. See Regus as an example.

3)      Traditional/Custom Office Space. With traditional office space you must first decide on the grade of building you are willing to settle in. Grading’s are based on attributes from location strengths (public amenities, transportation etc.) through the building efficiencies (energy management, exit routing, parking etc) and can give a quick overall assessment of the quality of the building. The grading gives a clear idea of how much you will spend per square foot. If you can find a suite, room or floor that has been occupied in the past, and suits your needs, your calculations can stop there.

Alternatively you may find a raw space within a building you like that requires a build out. The landlord may offer to do this for you, with some degree of flexibility over the space, or you can opt to take care of this yourself. If building out your space, there are significant costs for contractors as well as the delay in entering the space while work is ongoing. There is also a great deal of setup involved for services that can be taken for granted in the other options. Leases are standardly 5+ years and as a young company with limited credit there can be large down payments, returned at the conclusion of the lease.  But, this does allow any company to completely tailor the space for their specific needs. See Abramson Brothers as an example.

We found the perfect raw custom office space at 501 Fifth Avenue.  It would involve a comprehensive build-out, requiring a significant amount of input from a number of our employees.  But, having long-term investment, we felt that long-term planning was the way forward.  With a clear vision of where we wanted to be, this would ensure we had the perfect space to operate from and expand into. Little did we know everything else that this would entail. In April’s blog I’ll write in detail about the build out and everything you’ll need if you ever consider going through with one!

Requirements

The only way we could successfully tackle this project was to focus on who we were, what our goals were and, what we needed to achieve those goals.  As an R&D firm, we felt we had a lot of default requirements for an office space that would not be as highly prioritized within a more standard software development company. Yes, we still had to ensure that each individual felt comfortable in their space, had the room they needed and the tools to assist them, but we also needed to consider the research in R&D. Our longevity as a company hinges on not only designing and delivering software of the highest quality, but also continuing to create intellectual property and ensuring our research cycles are efficient and rewarding.

Office Pre-Build-Out
Our office before the build-out.

There are many ideas that claim to contribute to random stimulation and creativity within your office or work space: visual stimulation through artwork and colors, having magazines and journals lying around, providing games and toys for breaks, but here, we had the chance to go a step further. We had the opportunity to create the perfect environment where all these extras would add to the creativity within our space. To this end we knew the pressure was on to build a great space for creative cooperative working and within this requirement, we had to cater for research groups that would often vary in size. We had to ensure that both large and small groups would feel motivated and comfortable enough to spend prolonged periods of creative time (occasional overnights if necessary…) dealing with hard problems – all within our space restrictions in NYC.

Furniture Build-Out
With the walls of the office up, all that’s left to do is make the space our own.

Being rather inexperienced in the world of build-outs, we had yet to discover how important it was to have designers and contractors that bought into our ethos and fully understood what we needed.  We got lucky… the second time.

The office post-build out.
The finished article. Lots of natural light and open spaces.

As I mentioned before, I’ll save the specifics of the design and construction, which inevitably went on far longer than planned and caused an incredible amount of stress, for another post.

Was it the right option?

Building out our space was a lengthy process; the duration from our initial viewing through absolute completion was nearly 18 months. But for all the phone calls, emails, drama, irritation and stress, we now have a fantastic, light, open, creative office that suits us perfectly. We do have a long term lease, and we do have to manage our own services, but having the freedom and flexibility to create our own environment was far more important to us. Everyone now has their own personal space, we have ample room for our research activities, and we also have a great environment for welcoming customers and visitors. Most importantly, if you speak to any of our employees, one of the first things they will say about working here is that we have a phenomenal space to do our thing.

We knew what AetherWorks was, we built what it needed, and now we have an office that shows people who we are. That’s a result.

We value our open spaces, and room.
With plenty of open spaces, we have ample room for creative thought.

‘Where is the Server?’

This is the first post of our new series answering some of your most frequently asked questions. We’ll start with the most common query.

Where is the server?

This is easy to answer: there isn’t one.

Other systems that make use of the spare capacity you have on your servers or workstation machines either require that you have a constant connection to the cloud, or that your machines are connected to a dedicated centralized server.

In AetherStore the software that runs on each of your workstation machines manages and co-ordinates access and backup to data. These machines co-ordinate among each other, so there is no need for a central server.

AetherStore's Serverless Architecture vs Server-Centric Architecture

Challenges and Rewards of P2P Software

I love the challenge of creating peer-to-peer systems and the flexibility that they give us.

A well constructed peer-to-peer system allows us to create applications that work just as well with one hundred machines as they do with two, all without predetermined co-ordination or configuration; applications that don’t rely on a single machine or a specific network topology to run correctly.

With AetherStore, this is precisely what we need. We are creating a software system that eliminates the need for a storage server, instead allowing you to make use of the capacity you already have. If you have ten machines each with 1TB of free storage, AetherStore allows you to combine this capacity to create 10TB [1] of networked, shared storage, without any additional hardware.

With no shared server, we want to avoid making any one machine more important than the others, because we don’t want a single point of failure. We can’t delegate a machine to manage locks for file updates or to determine where data should be stored. Instead we need a system that is able to run without any central co-ordination, and that dynamically up-scales or down-scales as machines start up or fail.

This post discusses one of the ways in which AetherStore achieves this as a peer-to-peer system.

Conflict Resolution

As we have no central server and no guarantee that any one machine will always be active, we have no way of locking out files for update — two users can update the same file at the same time and we have no way of stopping them. Instead we need to resolve the resulting conflict.

Consider the following example. When two users decide to concurrently update the same file, we have a conflict. These updates are gossiped to the other machines in the network [2], which must independently decide how to resolve the conflict and make the same decision regardless of the order in which the updates were received.

conflict

This independent evaluation of conflicts is critical to the scalability of the system and to peer-to-peer architectures in general. If each node makes the ‘correct’ decision without having to contact any other nodes, the system is able to scale without introducing any bottlenecks [3]. This is the advantage of the peer-to-peer architecture, but it is also the challenge.

In the case of AetherStore, to deterministically resolve file conflicts we can only use one of the two pieces of information available to us: the time of the file update and the identity of the machine making the update. Time is an imperfect comparison, however, because the system clocks of each machine in the network are unlikely to be synchronized. Using machine ID for comparison is even less suitable because it results in an ordering of updates entirely determined by a user’s choice of machine [4].

Both options are imperfect, but they are the only choices we have without resorting to some form of central co-ordination. Consequently, we use the time of the update — the lesser of two evils — to determine which update takes precedence, with the other, conflicting update being added into a renamed copy of the file. If each update occurred at precisely the same time, we use the machine ID as a tiebreaker [5].

Truly Peer-to-Peer

The advantage of this approach is that every machine is an equal peer to every other machine. The failure of one machine doesn’t disproportionately affect the operation of the system, and we haven’t had to add a special ‘server’ machine to our architecture. Also, because each node resolves updates independently, we can easily scale out the system without fear of overloading a single machine.

Machines can be temporarily disconnected, users can take laptops home, a lab can be shut down at night, and the system remains operational [6].

Contrast this with a more traditional setup, where users are reliant on continued connectivity to a single server to have any chance of access to their data.

The key point here is that the removal of any central co-ordination greatly increases the flexibility of the system and its tolerance of failures. In AetherStore we have a system that is resilient to the failure of individual machines and one that seamlessly scales, allowing you to add or reintegrate machines into your network without configuration or downtime.

There is no central point of failure, no bottleneck, and no server maintenance.

And, for this, I love peer-to-peer systems.

 


[1] You probably want to keep multiple copies of this data, so the total space available may be slightly less.

[2] Rather than sending updates to all machines immediately, they are sent to random subsets of machines, eventually reaching them all. This allows us to scale.

[3] This is beautifully illustrated in Chord, which can scale to 1000’s of nodes with each node only having to know about a handful of other nodes to participate in the ring.

[4] Tom’s update will always override Harry’s.

[5] This approach is similar to, among other things, the conflict resolution used by CouchDB.

[6] Provided we have provisioned enough copies of user data. This is the topic for another blog post.

Who We Are

Welcome to the new blog. I’ll use this first post to give you a brief idea of who we are and what we do.

At our heart we are a software R&D firm with a particular expertise in distributed systems. We were founded in 2010 with the goal of developing intellectual property assets for the enterprise, and have since focused our efforts on bringing our first product, AetherStore, to market.

We are now a seven person team, built from a core of experts with the knowledge and experience needed to make AetherStore a reality.

Over time everyone will be contributing their perspective to the blog, sharing their experiences at AetherWorks. Personally, I’m looking forward to contributing my own experiences as COO in the coming weeks and months.

The Team at Work.
The Team at Work.

 

Welcome

Hello and welcome to the AetherWorks blog!

2013 promises to be an exciting year for us, with AetherStore reaching Alpha this week and further releases upcoming.

Over time, we plan to use this blog to discuss some of the more interesting aspects of our work, both technical and operational.

Stay tuned!

The View over Bryant Park, New York.
Our view over Bryant Park, New York.