Eradicating Service Outages, Once and For All

Throughout the day, news sites and bloggers have been covering a glitch in Google’s Personalized Home Page service that apparently deletes all the customized content from users’ pages.

For nearly 7 hours at the beginning of April 2006, Amazon’s S3 storage service returned nothing but service unavailable messages. It happened again in January 2007, ruining Amazon’s 99.99% uptime goal.

For much of the morning of April 12th, Microsoft’s Hotmail and MSN services were unavailable to many of its users.

In February, Apple’s .Mac online service was reported to be mostly unavailable for many customers over a 5-day period.

Five months ago, Microsoft’s Windows Genuine Advantage service (which is used to disable your copy of Windows if it is deemed to be pirated) experienced a temporary outage, resulting in erroneous flagging of many valid systems as non-genuine.

A couple of months ago, Google irretrievably lost the email archives and address books of 65+ users of its GMail service.

Typepad, Blogger, Bloglines, del.icio.us, Google, Yahoo — all have experienced high-profile outages that have made their services unavailable, and in some cases lost customer data.

We all know that both hardware and software are unreliable. Given this fact, all of the above services have created counter-measures to protect your data and provide continuous service, yet none have been able to deliver.

And yet, we have created systems that never go down. The best known example is the Internet itself. Certainly, parts of the Internet fail all the time: fiber optic cables are severed, core routers malfunction, ISPs drop service to their end customers, servers blow up. But the modern Internet has never “gone down,” not even for a brief moment. As you may recall, this was part of its design goal: ARPA wanted a network that could survive multiple nuclear strikes without failing. Why is the Internet resilient? Because it is decentralized.

We’ve also designed resilient applications that run on top of this robust network. Email is the best known example. Email never fails. Yes, email servers may disintegrate, network outages may prevent specific routes of email from being delivered, individual email providers might disappear. And these problems do occasionally cause a few individuals to be unable to access or send email for brief periods of time. But the Internet’s “email service” has never served up a “service unavailable” message. There has never been a time when the vast majority of email users on the Internet could not send each other email. Why is email resilient? Because it is decentralized.

Why, then, doesn’t this type of reliability show up in the commercial sector of the Internet? Why aren’t the Amazon S3s, the Apple .Macs, the Microsoft WGAs, et. al more reliable?

The answer is surpisingly simple: centrality is vital to many business models. So even though we know that robust services eschew centrality while non-robust ones cling to it, most commercial services tether themselves to their centralized architectures not for technical reasons, but because collecting money from users requires the ability to centrally ‘turn off’ the service for those who don’t pay.

The smartest centralized services, such as Amazon’s EC2 or Google’s search engine, try to get as far away from centrality as possible, implementing things like geographically diverse data replication and roaming virtual machine instances. But even the most decentralized commercial services remain chained to central components. Amazon’s S3 and EC2 hardware, for example, remain under the complete control of Amazon, and are centrally administered by Amazon staff. Amazon centrally regulates what users of the service are allowed to do through license agreements. When spikes in usage arise, Amazon must acquire and install more hardware to keep up with demand, presumably passing purchase orders through their centralized purchasing department. A single mistake by a systems administrator has the potential to bring the entire service down. The same can be said for any of the other services controlled by a single entity.

Contrast this with email, with bittorrent, or with flŭ­d, where anyone can instantiate nodes that provide their respective services, where no single entity can start or shutter the service, where no committee or organization can censor content or control the flow of data and information. Supply of the service organically grows out of demand for the service. A single decision to turn off these decentralized services is virtually impossible; in order for that to happen, each current participant must simultaneously decide to turn off, and stay off, and all future potential participants must decide to remain, well, potential.

True decentralization has proven that there is a better way. So why do we still build and rely on centralized services that are susceptible to outages? In the future, perhaps we won’t.

Posted in decentralization, emergence, resilience | 9 Comments »

How Comes this Unity?

In the second chapter of ‘Out of Control‘, Kevin Kelly describes the surprising unity that emerges in a large flock of mallard ducks:

At dawn, on a weedy Michigan lake, ten thousand mallards fidget. In the soft pink glow of morning, the ducks jabber, shake out their wings, and dunk for breakfast. Ducks are spread everywhere. Suddenly, cued by some imperceptible signal, a thousand birds rise as one thing. They lift themselves into the air in a great thunder. As they take off they pull up a thousand more birds from the surface of the lake with them, as if they were all but part of a reclining giant now rising. The monstrous beast hovers in the air, swerves to the east sun, and then, in a blink, reverses direction, turning itself inside out. A second later, the entire swarm veers west and away, as if steered by a single mind. In the 17th century, an anonymous poet wrote: “…and the thousands of fishes oved as a huge beast, piercing the water. They appeared united, inexorably bound to a common fate. How comes this unity?”

Just as each duck has some set of criteria by which it decides when to swim and when to eat and when to fly off with the crowd, each individual in the flŭd network has some simple rules for when and where to store bits of its data, and for when and from whom to accept bits of data. The trick is to tease out an emergent behavior where each node in this massive flock of computers cooperates to provide extremely robust protection against data loss. The local rules employed to produce this behavior are not so different from those used by the mallard; each individual in the flock must be free to act for itself, each decision must have self-interest driving its outcome and must be made only with local stimulus from the part of the world that it perceives directly, and there is no central leader, chain-of-command, or hierarchical control. The behavior that emerges from the flŭd backup network also shares some common characteristics with the flock. Both survive even when a large number of individuals leave or die, and both continue to function as new individuals join. The system is extremely robust to radically changing conditions and rogue individuals.

In the initial flŭd network, all the nodes are identical — clones programmed to make the same decisions given the same input, over and over again. But flŭd has been designed to encourage the richness of diversity. The software is open. Alternative clients with new behaviors can be created and distributed. No one can anticipate where such diversity will take the network, but the internal mechanisms of flŭd plan for such events in a robust way. All interactions are based on trust, and trust is a value that each individual node determines for itself. Nodes increase their trust for others that behave in ways that they view as beneficial, and decrease their trust for nodes that they deem as misbehaving. Nodes prefer to exchange data with other nodes that they trust. This internal trust mechanism regulates the entire network, brokering how resources can be consumed and dispersed, ensuring that you’ll always be able to recover the data that you store to it.

We believe that no model more robust to failure can be found than those that nature has so generously provided. flŭd’s goal to provide the world’s easiest to manage and most resilient backup system is accomplished by leveraging these natural models.

Posted in emergence, resilience | 1 Comment »