Eradicating Service Outages, Once and For All

Throughout the day, news sites and bloggers have been covering a glitch in Google’s Personalized Home Page service that apparently deletes all the customized content from users’ pages.

For nearly 7 hours at the beginning of April 2006, Amazon’s S3 storage service returned nothing but service unavailable messages. It happened again in January 2007, ruining Amazon’s 99.99% uptime goal.

For much of the morning of April 12th, Microsoft’s Hotmail and MSN services were unavailable to many of its users.

In February, Apple’s .Mac online service was reported to be mostly unavailable for many customers over a 5-day period.

Five months ago, Microsoft’s Windows Genuine Advantage service (which is used to disable your copy of Windows if it is deemed to be pirated) experienced a temporary outage, resulting in erroneous flagging of many valid systems as non-genuine.

A couple of months ago, Google irretrievably lost the email archives and address books of 65+ users of its GMail service.

Typepad, Blogger, Bloglines, del.icio.us, Google, Yahoo — all have experienced high-profile outages that have made their services unavailable, and in some cases lost customer data.

We all know that both hardware and software are unreliable. Given this fact, all of the above services have created counter-measures to protect your data and provide continuous service, yet none have been able to deliver.

And yet, we have created systems that never go down. The best known example is the Internet itself. Certainly, parts of the Internet fail all the time: fiber optic cables are severed, core routers malfunction, ISPs drop service to their end customers, servers blow up. But the modern Internet has never “gone down,” not even for a brief moment. As you may recall, this was part of its design goal: ARPA wanted a network that could survive multiple nuclear strikes without failing. Why is the Internet resilient? Because it is decentralized.

We’ve also designed resilient applications that run on top of this robust network. Email is the best known example. Email never fails. Yes, email servers may disintegrate, network outages may prevent specific routes of email from being delivered, individual email providers might disappear. And these problems do occasionally cause a few individuals to be unable to access or send email for brief periods of time. But the Internet’s “email service” has never served up a “service unavailable” message. There has never been a time when the vast majority of email users on the Internet could not send each other email. Why is email resilient? Because it is decentralized.

Why, then, doesn’t this type of reliability show up in the commercial sector of the Internet? Why aren’t the Amazon S3s, the Apple .Macs, the Microsoft WGAs, et. al more reliable?

The answer is surpisingly simple: centrality is vital to many business models. So even though we know that robust services eschew centrality while non-robust ones cling to it, most commercial services tether themselves to their centralized architectures not for technical reasons, but because collecting money from users requires the ability to centrally ‘turn off’ the service for those who don’t pay.

The smartest centralized services, such as Amazon’s EC2 or Google’s search engine, try to get as far away from centrality as possible, implementing things like geographically diverse data replication and roaming virtual machine instances. But even the most decentralized commercial services remain chained to central components. Amazon’s S3 and EC2 hardware, for example, remain under the complete control of Amazon, and are centrally administered by Amazon staff. Amazon centrally regulates what users of the service are allowed to do through license agreements. When spikes in usage arise, Amazon must acquire and install more hardware to keep up with demand, presumably passing purchase orders through their centralized purchasing department. A single mistake by a systems administrator has the potential to bring the entire service down. The same can be said for any of the other services controlled by a single entity.

Contrast this with email, with bittorrent, or with flŭ­d, where anyone can instantiate nodes that provide their respective services, where no single entity can start or shutter the service, where no committee or organization can censor content or control the flow of data and information. Supply of the service organically grows out of demand for the service. A single decision to turn off these decentralized services is virtually impossible; in order for that to happen, each current participant must simultaneously decide to turn off, and stay off, and all future potential participants must decide to remain, well, potential.

True decentralization has proven that there is a better way. So why do we still build and rely on centralized services that are susceptible to outages? In the future, perhaps we won’t.

Posted in decentralization, emergence, resilience | 9 Comments »