January Cleanweb Meetup At AetherWorks

We’re excited to be hosting this month’s CleanwebNYC meetup at AetherWorks! The event kicks off Tuesday, January 28th at 6:30pm at the AetherWorks office, located at Bryant Park in Manhattan. We’ll have pizza and beer, and anyone is welcome! RSVP here.

If you’re not familiar, Cleanweb leverages technology to tackle resource problems in energy, water, food, waste, transit and beyond. Check out the CleanwebNYC meetup page for more information on their monthly meetings and how you can get involved, or cleanweb.co to see how the Cleanweb initiative is addressing resource challenges on a global scale.

There will be two presentations Tuesday on solutions that use computer hardware resources more efficiently:

AetherStore

  • We’ll be presenting the storage software we’re developing, AetherStore, which turns unused space on machine hard drives into a shared storage network. AetherStore requires no new hardware and allows organizations to use their existing storage space more efficiently.

Revivin

  • Revivn re-purposes unused technology for social and environmental impact, creating “a new and greater purpose for unused computers.” They use outdated electronics from companies to build out various initiatives helping people gain computer access.

It should be a fun night, so join us to see what’s going on in the CleanwebNYC community – and of course to enjoy some free food and drinks! RSVP now! 

Storage Capacity That Increases With Your Demand

Over recent years we have seen the growth of three seemingly orthogonal topics: green-computing, small tech businesses, and data analytics. This has led to the rise of simple local file storage and sharing applications that provide easy access to critical business data, while minimizing cost.

As businesses grow in size, they employ more people, purchase more workstations, and see an overall increase in demand for storage capacity.

These businesses can either provision a server, which is subject to sawtooth-like changes in utilization as demand increases, or they can use a cloud or local storage synchronization application.

Most of these local storage applications require data to be stored on all machines running the storage application. This conflates the issue of backups (multiple copies required for redundancy) and availability (data is available from all machines). If all machines have to store a copy of the data, then when a new machine joins it must have enough space to store all of the existing data, which means that machines with small hard disks may not be able to join and access all data. When you add a new machine to your system it doesn’t increase your capacity at all, even though your demand is increasing.

demand vs capacity
Figure 1: While demand increases with number of machines, server capacity is a step function. The local capacity is almost entirely ignored in favor of server storage.

What if it were possible to have copies of data available on all machines, while not requiring all machines to keep a copy? This would allow us to support machines with low capacities as well as high capacities, and would ensure that our storage capacity increases as the number of machines available increases.

With AetherStore, we can do this.

Rather than storing all files on every machine in the local file system, AetherStore creates a virtual network drive, which allows all machines to access all data without having to have it all on any single machine. When a user saves a file to the AetherStore drive, the file is backed up across a number of machines on the local network.  AetherStore ensures that there are multiple copies of the data for each file available, but it stores these copies on a subset of all machines rather than across every one.

AetherStore abstracts the physical location of file data so all machines in the local AetherStore network can see the files without necessarily having to store a local copy. AetherStore’s storage scales linearly with the number of machines in the network, so more machines means more storage capacity. Your storage capacity grows with the size of your business!

AetherStore splits files into chunks and store these chunks across the members of the local AetherStore network. The current default number of machines to replicate data onto is four, though this is customizable. The disparity between the visibility of files on all machines and the locality of files split up across a subset of all machines is what allows AetherStore to scale efficiently, so what does this mean for your available storage capacity?

The following graph shows “the storage capacity vs. number of machines” and compares AetherStore to “Full-Copy” systems that require all machines to store file replicas.

combinedcapacity
Figure 2: The storage capacity available on each machine is the same in both systems. AetherStore’s capacity grows linearly with the number of machines. When using AetherStore, your storage capacity grows with your business.

Taking Advantage of All Capacity

Each machine in an AetherStore network can have a different sized hard disk and a different amount of capacity allocated for use by AetherStore. For example, consider six machines running AetherStore on a local network. For the sake of our example, each node is able to store some number of units of data. We can visualize our example AetherStore network:

UsageStats-1-Empty
Figure 3: Not all of the machines have the same capacity of available units. Each cylinder represents one unit of storage space. Note no data is stored at this point.

Let’s now save a small file to AetherStore. In this example we assume each file saved takes up a whole unit of storage. In our example network, there are 23 units worth of storage across all machines in the system. We know that we have to have several copies of data, so if we save a single unit file to AetherStore, we have four replicas saved across the system.

UsageStats-2-OneFile
Figure 4: After saving a one unit file, we now see that the whole takes up four units across the system. Note that the algorithm does not allocate more than a single replica per machine for redundancy’s sake.

Figure 4: After saving a one unit file, we now see that the whole takes up four units across the system. Note that the algorithm does not allocate more than a single replica per machine for redundancy’s sake.

We should remember that because files are split into chunks which are replicated four times, we can therefore view each chunk as “costing” (in terms of space required) four times its actual size.

So how much data can our system store? Let’s save two more single-unit files to AetherStore:

UsageStats-2-ThreeFiles
Figure 5 It is easier to grasp AetherStore’s data allocation after three files have been saved.

In a system requiring a full copy of all data, the machine with the smallest capacity would be unable to access every file!

Now that we have seen the advantage of this form of data allocation, let’s make the leap to using real units. If a one unit file requires four units of storage in an AetherStore network, a 1MB file “costs” 4MB to store.

If this file requires a copy on every machine in a full-copy system, a 1MB file would cost 6MB in the above example. Moreover, once the capacity is full on a machine, that machine can no longer view all of the files. The full replica system uses more of your space and then limits access once a machine’s capacity is reached!

The short equation for the capacity of your AetherStore is:

Local store capacity = Space allocated / 4

The same equation applies to all nodes in the network.  Thus the capacity of the total AetherStore system is:

Total AetherStore system capacity = Sum of allocated space across all nodes / 4

Remember that behind the scenes AetherStore breaks files into chunks, so our example was both a visual metaphor AND a model of AetherStore itself!

In practice, AetherStore is more complicated than this because we utilize caching on all nodes so that repeated access of a file only makes local requests, reducing network traffic and increasing speed (discussed in a previous post).

Infographic: IT Pros on Local Storage

As we recently released an AetherStore beta, it’s been a huge priority to get feedback of all kinds from experts in the data storage industry. In addition to the information we were gathering from phone interviews with our Early Adopters, we worked with Spiceworks to survey 250 IT Pros from their “Spiceworks Voice of IT” panel on their storage strategies.

We covered everything from preferred pricing plans to compatibility requirements, and are looking forward to sharing the results through a series of snapshots like the one below. Because we’re developing storage software that makes use of latent hard drive space, we’re particularly interested in organizations’ available workstation capacity and how IT Pros feel about local storage vs. the cloud. Here’s what we found:

 Local Storage

We’ll be continuing to release the results of our Spiceworks survey over the coming days and weeks, so check back with the blog to stay in the loop! If you haven’t signed up to be an AetherStore Early Adopter yet, you can do so here.

 

Four Reasons Not to Forget About Local File Sharing

Increasingly, organizations need a file sharing system that offers remote access. But building your entire solution around the cloud to accommodate that need may neglect some of the most important benefits that local file sharing provides.

Considering a report published by the U.S. Census Bureau last year found that only 6.6% of workers regularly work from home, there’s clearly still a need for an effective way to share files in-office. Remote access may be a necessary addition, but here’s why cloud-first solutions shouldn’t replace local file sharing:

1. You won’t be completely reliant on your internet connection.

If you want to deploy a cloud-based storage solution, a constant, fast internet connection is a prerequisite. Any interruption in internet service can paralyze an office that’s depending on it to store and share files; you can only ever work as fast as your connection. Local networks remove the internet as a hindrance to productivity.

2. It’s faster.

Local file sharing means no network-clogging requests to the cloud. There’s simply no faster way to open a file than from your own local drive.

3. It’s more secure.

If you have security concerns, (and considering data breaches cost an average $5M per organization in the US last year, you probably do) there’s nothing safer than your local network. Even encrypted data is more at-risk on the internet, and many cloud-based solutions mean storing your data with a third party where you can’t be responsible for its security.

4. It’s often cheaper than cloud-based solutions

Chances are you already have local hard drives with spare capacity you could take advantage of to deploy a file sharing solution. Monthly premiums on cloud subscriptions can add up, and it’s hard to argue you couldn’t cut costs by adopting a local-first solution that uses hard drive space you’ve already paid for.

Of course, if you require remote access a local file share alone won’t cover the scope of your needs. But moving your entire solution to the cloud prioritizes flexibility, often at the cost of the availability, speed, security and price offered by local solutions. Why not consider a local-first, cloud-second approach instead? Use a local file sharing system in-office, and sync with the cloud for remote access when necessary. We’re developing AetherStore, and see it as a solution that allows you to reap the benefits of both. Cloud certainly isn’t going anywhere, but the benefits of local file sharing are too great to ignore.

 

AetherStore Early Adopters Program

If you’ve been keeping up with the AetherWorks Blog, you probably know we’ve been developing software that will allow users to make the most efficient use of their storage resources. It’s called AetherStore, and as we approach beta release we’re looking for Early Adopters to be some of the first to benefit from the technology.

Sign up here to become an Early Adopter!

What is AetherStore?

AetherStore works by pooling unused space on machine hard drives to create a shared, distributed storage network. The software chunks your data and saves those chunks many times over multiple machines, distributing the burden of data and removing the central point of failure. AetherStore stores data intelligently based on the amount of space machines have available, so you’re never limited to the smallest hard drive. All of your data is encrypted, too, so it’s storage-compliant with even the most regulated industries like healthcare, finance, and education.

Remote access and BYOD policies being as widely embraced as they are, AetherStore can also be coupled with a cloud solution for remote access and long-term backup.

Capture Post

Why Sign Up for the Early Adopters Program?

While there are certain concerns (security, scalability, latency, availability) that all organizations consider when developing a data storage strategy, differences in infrastructure and business processes mean the storage needs of each organization vary widely. The information we receive from our Early Adopters allows us to address highly specific use cases and tailor our technology to maximize its effectiveness for different organizations. In return, we’re offering you the technology necessary to make the most efficient use of the storage resources you’re already paying for.

If you’re interested in learning more about AetherStore, sign up to be an Early Adopter here!

What We Do

If you’re looking at one of our job ads, chances are you want to know more about what we do — what does a software engineering job at AetherWorks actually involve?

We’re currently building AetherStore, a distributed data store. AetherStore runs over the workstations and servers in your organization, harnessing their unused capacity to create a shared, virtual file system. To a user, we appear as any other networked file system, without requiring any extra hardware.

It’s a wonderfully engaging project to work on because we see such a diverse range of activities every day. From the low level handling of calls from Windows Explorer[1], to the task of breaking up[2], encrypting[3], and dispersing files across the network[4], I’ve covered most of my Computer Science education in some form while working on AetherStore.

Why I Work Here

My own background is in distributed systems (primarily distributed database systems), so the de-centralized and fault-tolerant design of AetherStore has obvious appeal. I love the challenge of creating these systems because you are so constrained by the fundamental properties of distribution[5], and yet it’s still possible to create systems that scale to many hundreds of thousands of machines[6].

With AetherStore our challenge is in creating a shared file system where every piece of data doesn’t have to reside on every machine. We want to spread user data across machines proportional to their size, but we don’t have a centralized authority to lean on when deciding how to do this. And we don’t even know how many machines should be in the system[7]. It’s brilliantly limiting[8]!

For me, it’s hard to imagine a better job. I love the intellectual challenge of creating a new feature or component, and the satisfaction of being able to craft this into a complete, stable software product. It’s  truly an exciting thing to be a part of.

Photo of Angus' Desk
The working environment isn’t bad either!

So, while it’s not easy competing with so many other great New York companies, I think we’ve got a lot to offer. Consider applying!

Our Interviews

My biggest problem with jobs listings (ours included) is that we specify a set of requirements that invariably turn into clichés, and we don’t explain why we need them or how we test for them. So let’s look at a few, and see why they actually matter more than you might think.

“Software Engineer.”

The job title may seem meaningless, but I love this distinction between software engineers and programmers. We want to know that you craft code to a high standard, and that you understand why ‘it just works’ isn’t enough.

In an interview we’ll ask you to review some (bad) code we’ve written, to gauge your code literacy. We’re looking for someone that has an appreciation for clean code and a quick eye for bugs.

“A solid understanding of object-oriented programming.”

We’re building a complex system and we need to make sure that you’re the type of person that can structure code in a logical and maintainable way.  We’ll ask you to do a short programming assignment to get a feel for your general abilities and experience.

“Fundamental Computer Science Background.”

The work I have described in the previous section is challenging, and it requires that you know the relative efficiency of, say, a linked list and an array, but also that you’re capable of creating your own data structures from time to time. For us, the best indicator of this skill-set is an undergraduate degree in Computer Science. In an interview we’ll ask you an algorithmic question that gives you a chance to demonstrate the breadth of your knowledge.

If you do well enough in these questions then we’ll invite you in for a longer interview, asking you to solve a real problem that we’re actually working on in the office.

To Apply

If the idea of working at AetherWorks appeals to you, I’d urge you to check out our available positions. Alternatively, if you have any questions about this post or our interviews, please feel free to email me (first initial, last name[9]).

 


[1] And I mean every single call. Every time you open a directory, right-click on a file, or save a document, Windows Explorer is providing us with a constant stream of calls asking for information and telling us what to update. Since we’re pretending to be a network mount, we have to handle each of these calls, giving responses either from the local machine or a remote copy. This fascinates me more than it probably should, but it gives you some brief insight into the complexity of the operating systems we use every day without thought.

[2] When you store a file we break it up into chunks, both to make it easier to spread data across the network and to increase de-duplication. There are entire classes of research dedicated to finding ways of doing this efficiently. Content-based chunking, in particular, has some really clever uses for hashing (fingerprinting) algorithms and sliding windows, which dramatically improve de-duplication.

[3] We have to encrypt data at rest and in transit, but this is more challenging than in most systems where you have a central authoritative server. Without this, our encryption architecture represents a trade-off between security and usability.

[4] Deciding where to place data is particularly challenging, since we don’t have a central coordinator that can make this decision. All machines must be in agreement as to where data is placed (so that it can be accessed), but it is expensive to allow them to co-ordinate to make this happen.

[6] Constraints can be catalysts for creativity.

[7] Since we don’t know how many machines are ever in the system, we can’t use distributed consensus protocols such as Paxos. These require that a majority of nodes agree on a decision, but if you don’t know how many nodes exist, you don’t know how many nodes form a majority.

[8] The CAP theorem is my favorite (trivial) example of this. Imagine you have 3 machines, and have a copy of some data on each machine. How do you handle an update to that data?

How we handle this update (and anything else in distributed systems) is determined by our response to network partitions – when one set of machines is unable to contact another. If we use a central lock manager to stop conflicting updates we ensure that the data is consistent, but that it will be unavailable if the lock manager cannot be contacted. If we use a majority consensus protocol, we can update our data in the event of a partition, but only if we are in the partition with a majority of nodes. If we assume that neither of these cases is acceptable, we can do away with consistency altogether, allowing updates to each individual copy even when the others are inaccessible. The fundamental properties of a distributed system limit us in each of these options — it’s up to us to decide which is the most appropriate in any given case.

[9] This is our interview puzzle question!

Plan For Disruptions: Networking in a Disconnected World

Modern networked applications are generally developed under the assumption of ubiquitous, high availability networks to afford communication between computing devices. This assumption is based on the tenets that all nodes in the network are addressable at all times, and all nodes in the network are contactable at all times. But, what if we consider a network environment where not all present devices are contactable? What can we learn from building software that operates in a low-availability environment?

The word “networking”, can both refer to a set of communicating computing devices (a TCP/IP computer network), and to a set of people meeting to build inter-personal connections (a NYC start-up networking event, for example).

If we frame these two concepts together, we can gain understanding of how to build applications tolerant of low-availability environments.

We should consider the properties of each of these networking concepts. What makes them similar? What makes them different? In the typical office environment, desktop machines are always plugged in to the office backbone network, which is always present and always has access to the internet using the office’s connection. Regular computing networks are designed to afford communication in such an environment. IP/Ethernet addresses of network members are assumed to resolve to a machine, and we assume that machines will not change Ethernet addresses (generally they do not change IP addresses either). All functional machines, therefore, are assumed to be addressable and contactable, at all times.

This is clearly quite a simplistic example. We can, however, contrast it with the human networking example. In an NYC start-up event, we consider the communication network to be human discussion. When the attendees are mingling over drinks, they are moving in and out of audible range of one another. Not everyone present at the event is immediately contactable by every other attendee, despite being in the same room (network), because not all attendees are part of the same group conversation. All attendees are addressable, however, because everyone is wearing their name badge!

I like to think conceptually that these types of networking lie on opposite ends of a spectrum. On one end we have a “solid” networking state, where computers do not change location, routing paths are assumed to be static (or effectively static) and all connected machines are assumed to be addressable and contactable by all other machines on the network.

At the other end of the spectrum we have a sort of “gaseous” network where meet-up attendees are in small, disparate networks.  Members are available to communicate locally, but are aware of all other attendees who, while they cannot be communicated with at present, are addressable and assumed to be contactable at some point in the future (as attendees mingle).[i]

Most of my work in academia focused on routing protocols designed for these low connectivity environments similar to the human tech meet-up. In these types of networks, a node may be aware of the address of the device that it is attempting to contact, but there may be no routing path for communication. In this case, a full path will never be available, and therefore, the node will never successfully communicate with the intended destination. Nodes must therefore pass messages through intermediate nodes that store and forward messages on each other’s behalf.  These types of networks are called opportunistic networks, as nodes may pass messages opportunistically whenever the opportunity to do so arises.[ii]

A good example of an opportunistic network would be an SMS network of everyone in NYC, where messages could only be passed between phones using Bluetooth, with  messages forwarded on each other’s behalf as soon as they came into range. By exploiting six degrees of separation, messages could travel throughout the city, producing an approximation of  a free SMS delivery network, albeit a rather slow one!

Manhattan provides a great location for an opportunistic network due to its high population density and small geographic area.
Manhattan provides a great location for an opportunistic network due to its high population density and small geographic area.

It’s not hard for me to see the link between this and my new job at AetherWorks, where we develop distributed application software. Consider AetherStore, our distributed peer-to-peer storage platform that presents the free disk space of multiple machines as a single address space. Data can be written to any subset of machines, and the data and backups are automatically synced and managed by the nodes themselves. Like most modern software, it is designed to operate in a heterogeneous network environment.

AetherStore uses a decentralized network where nodes may join and leave at any time, so it operates in a problem space at the convergence of well-connected TCP/IP networks and a disconnection-prone environment. Consider a customer using AetherStore to sync files between two desktop machines and a laptop at their office.  They may wish to take their laptop out of the office and work on the stored files in a park.  If someone else in the office decides to modify “the big presentation.ppt” simultaneously with our user in the park, when the two devices sync there will undoubtedly be conflicts.

This synchronization may seem like a trivial problem, but it is it not. Time stamps cannot be trusted. How do you know which machine has the correct time? Furthermore, how do you know how to construct the differences between the files? One way to quickly determine if conflicts are present is to construct a tree of changes to files, similar to the approach of modern Distributed Version Control Software (e.g. Git, Mercurial). These change-trees are then compared when machines can once again communicate. In our example of the laptop taken home, we can immediately draw some parallels to our disconnected network environment. The two nodes in the office and the laptop in the park are part of the same AetherStore network, and yet not all nodes are contactable or addressable by all other nodes.

By building a system that can handle the difficulty of the disconnected environment, a state that may not occur often but must be accounted for, we can necessarily cope with the well-connected, high availability network environment.

I present no answers here. I will, however, leave you with a few questions:

  • Should nodes queue change-trees to be exchanged between devices when they meet? How big should we set this limit? Can we discard old change sets?
  • Should certain machines always defer to others versions of files when conflicts occur?
  • Can we develop a system in which timings of changes can be trusted?
  • Should we take human factors into account? Should we consider employee hierarchy? Is it a good idea to always accept your boss’s changes?
  • Is there a “plasma” state for networks somewhere past “gaseous” on my spectrum? I may not know who is going to turn up (non-addressable), or for how long they will be part of the network (address lifetime).
  • Should we allow users and addresses to be separated?
  • Perhaps we could predict addresses that might be usable in future on such a network?
  • Should we use address ranges instead of unique addresses?
  • Can nodes share addresses for certain points in time?[iii]

 


[i] Somewhere in between solid and gaseous networks we have ‘liquid’ state Mobile Ad hoc NETworks (MANETs). In a MANET the routing paths between nodes may change frequently but nodes are all are addressable contactable at any given point.
http://en.wikipedia.org/wiki/Mobile_ad_hoc_network

[ii] Note the distinction between opportunistic networks and Delay Tolerant Networks. A DTN may assume some periodicity to connections, which gives rise to a different set of routing algorithms. http://www.dtnrg.org/

[iii] For discussion of phase transitions in opportunistic networks: http://dx.doi.org/10.1145/1409985.1409999

Fast Reads in a Replicated Storage System

People don’t like slow systems. Users don’t want to wait to read or write to their files, but they still need guarantees that their data is backed up and available. This is a fundamental trade-off in any distributed system.

For a storage system like AetherStore it is especially true, so we attach particular importance to speed. More formally, we have the following goals:

  1. Perform writes as fast as your local disk will allow.
  2. Perform reads as fast as your local disk will allow.
  3. Ensure there are at least N backups of every file.

With these goals the most obvious solution is to backup all data to every machine so that reads and writes are always local. This, however, effectively limits our storage capacity to the size of the smallest machine (which is unacceptable), so we’ll add another goal.

  1. N can be less than the total number of machines.

Here we have a problem. If we don’t have data stored locally, how can we achieve reads and writes as fast as a local drive? If we store data on every machine how do we allow N to be less than that? With AetherStore, goal 4 is a requirement, so we have to achieve this while attempting to meet goals 1 & 2 as much as possible.

This proves to be relatively trivial for writes, but harder for reads.

Write Fast

When you write to the AetherStore drive your file is first written to the local disk, at which point we indicate successful completion of the write — the operating system returns control to the writing application. This allows for the write speed of the local disk.

AetherStore then asynchronously backs up the file to a number of other machines in the network, providing the redundancy of networked storage.

Files are written locally, then asynchronously backed up to remote machines.

It is, therefore, possible for users to write conflicting updates to files, a topic touched on in a previous post.

Read Fast

Ensuring that reads are fast is much more challenging, because the data you need may not be stored locally. To ensure that this data is available on the local machine as often as possible, we pre-allocate space for AetherStore and use it to aggressively cache data beyond our N copies.

Pre-Allocating Space

When you start AetherStore on your PC you must pre-allocate a certain amount of space for AetherStore to use, both for primary storage and caching. As an example, a 1 GB AetherStore drive on a single workstation machine can be visualized like this when storing 400 MB:

Replicated data takes up a portion of allocated space.

The 400MB of data indicated in orange is so-called replicated data, which counts towards the required N copies of the data we are storing across the whole system. Each of these files is also stored on N-1 other machines, though not necessarily on the same N-1 machines[1].

Caching Data

The remaining 600GB on the local drive is unused, but allocated for AetherStore. We use this space to cache as much data as we can, ensuring that we get the read speed of the local drive as often as possible. Cached data is treated differently than replicated data (as illustrated in our example below), as it must be evicted from the local store when there is insufficient space available for newly incoming replicated data. At this point some read speeds will be more similar to a local networked server.

Cached data takes up space not used by replicated data.

The advantage of this approach is that it allows us to store data in as many places as possible when we have the space, but then store as much data as possible as the drive fills up. In the worst case we’re as good as a networked server, in the best-case we’re as good as your local file system.

Of greater importance, is that we are able to store data across only a subset of machines. This allows the capacity of the AetherStore drive to outgrow the capacity of any individual machine, while ensuring that users of every machine can still access all of the data on the drive.



[1] Replica locations are determined on a file-by-file basis, rather than on an entire set of files, so it is unlikely that large groups of files will be backed up to the same set of machines. This allows us to make use of low capacity machines by replicating fewer files in comparison to larger machines.

‘Where is the Server?’

This is the first post of our new series answering some of your most frequently asked questions. We’ll start with the most common query.

Where is the server?

This is easy to answer: there isn’t one.

Other systems that make use of the spare capacity you have on your servers or workstation machines either require that you have a constant connection to the cloud, or that your machines are connected to a dedicated centralized server.

In AetherStore the software that runs on each of your workstation machines manages and co-ordinates access and backup to data. These machines co-ordinate among each other, so there is no need for a central server.

AetherStore's Serverless Architecture vs Server-Centric Architecture