Eradicating Service Outages, Once and For All

Throughout the day, news sites and bloggers have been covering a glitch in Google’s Personalized Home Page service that apparently deletes all the customized content from users’ pages.

For nearly 7 hours at the beginning of April 2006, Amazon’s S3 storage service returned nothing but service unavailable messages. It happened again in January 2007, ruining Amazon’s 99.99% uptime goal.

For much of the morning of April 12th, Microsoft’s Hotmail and MSN services were unavailable to many of its users.

In February, Apple’s .Mac online service was reported to be mostly unavailable for many customers over a 5-day period.

Five months ago, Microsoft’s Windows Genuine Advantage service (which is used to disable your copy of Windows if it is deemed to be pirated) experienced a temporary outage, resulting in erroneous flagging of many valid systems as non-genuine.

A couple of months ago, Google irretrievably lost the email archives and address books of 65+ users of its GMail service.

Typepad, Blogger, Bloglines, del.icio.us, Google, Yahoo — all have experienced high-profile outages that have made their services unavailable, and in some cases lost customer data.

We all know that both hardware and software are unreliable. Given this fact, all of the above services have created counter-measures to protect your data and provide continuous service, yet none have been able to deliver.

And yet, we have created systems that never go down. The best known example is the Internet itself. Certainly, parts of the Internet fail all the time: fiber optic cables are severed, core routers malfunction, ISPs drop service to their end customers, servers blow up. But the modern Internet has never “gone down,” not even for a brief moment. As you may recall, this was part of its design goal: ARPA wanted a network that could survive multiple nuclear strikes without failing. Why is the Internet resilient? Because it is decentralized.

We’ve also designed resilient applications that run on top of this robust network. Email is the best known example. Email never fails. Yes, email servers may disintegrate, network outages may prevent specific routes of email from being delivered, individual email providers might disappear. And these problems do occasionally cause a few individuals to be unable to access or send email for brief periods of time. But the Internet’s “email service” has never served up a “service unavailable” message. There has never been a time when the vast majority of email users on the Internet could not send each other email. Why is email resilient? Because it is decentralized.

Why, then, doesn’t this type of reliability show up in the commercial sector of the Internet? Why aren’t the Amazon S3s, the Apple .Macs, the Microsoft WGAs, et. al more reliable?

The answer is surpisingly simple: centrality is vital to many business models. So even though we know that robust services eschew centrality while non-robust ones cling to it, most commercial services tether themselves to their centralized architectures not for technical reasons, but because collecting money from users requires the ability to centrally ‘turn off’ the service for those who don’t pay.

The smartest centralized services, such as Amazon’s EC2 or Google’s search engine, try to get as far away from centrality as possible, implementing things like geographically diverse data replication and roaming virtual machine instances. But even the most decentralized commercial services remain chained to central components. Amazon’s S3 and EC2 hardware, for example, remain under the complete control of Amazon, and are centrally administered by Amazon staff. Amazon centrally regulates what users of the service are allowed to do through license agreements. When spikes in usage arise, Amazon must acquire and install more hardware to keep up with demand, presumably passing purchase orders through their centralized purchasing department. A single mistake by a systems administrator has the potential to bring the entire service down. The same can be said for any of the other services controlled by a single entity.

Contrast this with email, with bittorrent, or with flŭ­d, where anyone can instantiate nodes that provide their respective services, where no single entity can start or shutter the service, where no committee or organization can censor content or control the flow of data and information. Supply of the service organically grows out of demand for the service. A single decision to turn off these decentralized services is virtually impossible; in order for that to happen, each current participant must simultaneously decide to turn off, and stay off, and all future potential participants must decide to remain, well, potential.

True decentralization has proven that there is a better way. So why do we still build and rely on centralized services that are susceptible to outages? In the future, perhaps we won’t.

9 Responses to “Eradicating Service Outages, Once and For All”

  1. Interesting post and I agree that decentralization allows one to eliminate the types of service outages mentioned. I’m wondering if you know of any performance data that is collected against things like S3, Google , .Mac, etc to try to come up with a pragmatic view on how often they are out and for how long. It also would be interesting to open up the storage fabric to allow for storage of data on all of those centralized stores, thus hedging a bit against any one failure.

    Thanks for posting and I look forward to reading more,
    Peter

  2. Alen Peacock Says:

    I don’t know of any external performance data / monitoring that is provided by these services. But you bring up an interesting idea; it would certainly be useful if someone monitored availability externally. I don’t know if there is any sort of a business opportunity there, but wouldn’t it be cool to have a disinterested third party provide service level statistics for all these online offerings?

    As for using some of these services as an aggregate storage fabric, it probably wouldn’t be too difficult to hobble something together, with the likes of S3′s API, GmailFS, and to a lesser degree WikipediaFS, and others. With the cost of doing this ranging from expensive (.mac), to reasonable (S3), to free (gmailfs), I wonder about the economics of aggregation and whether the cost justifies the architecture, especially given that the free services may not condone this type of use. It is a neat idea though, and I’d love to see someone implement it as a proof of concept — if done cleverly, there are also lots of non-traditional resources that could be leveraged for storage (myspace, youtube, flickr, etc).

  3. To start by saying that GMail fails because it’s centralised and then claim that e-mail never fails because it’s decentralised is meaningless. I don’t care if (on average) there’s always someone who can send email to someone. All I care about is if I can send and receive email with all the people I care about contacting – and no decentralised service can make that claim.

    “The Service” may exist as some platonic ideal with no connection to the real world and that may be breathtaking poetical beauty, but the bit where “The Service” meets the bit of the world where I live actually matters as well.

  4. All I care about is if I can send and receive email with all the people I care about contacting – and no decentralised service can make that claim.

    No, but it can claim to allow you to send and receive email with at least some (likely most) of the people you care about contacting, at all times. If you had instead used some centralized email-like system, you, and every single user of the system, would be completely out of luck whenever it suffered an outage.

    This isn’t theoretical; there is a huge difference between an outage that affects 100% of users, and one that affects a fraction of a percent of users.

    Look, I’m not claiming that decentralization is magic pixie dust, simply that it is much better than centralization, especially when it comes to reliability and scalability. And that’s exactly why you are using email today, instead of some siloed electronic message system.

  5. “Over a 24 hour period between last Friday and Saturday, millions of Microsoft customers who attempted to download software updates from the company’s Web site were erroneously accused of running pirated Windows versions, thanks to a glitch in Microsoft’s reviled Windows Genuine Advantage (WGA) system” — (from Forum of Incident and Reponse Team)

    Microsoft claims that fewer than 12,000 systems were affected, and that the glitch was due to human error — (from arstechnica)

    Even though WGA is necessarily centralized for business purposes (though certainly distributed among many data centers and servers), this is a good example of how having one entity control such a system can directly result in the system failing dramatically.

    On a funnier note, Microsoft is claiming this was not an “outage” becase the servers were all up but they were just not working correctly (it just disabled a bunch of systems remotely, that’s all!).

  6. [...] decentralization in the wild. But these techniques are not limited to malicious software. Indeed, major components of the Internet itself use the same techniques. flŭ­d backup is 100% decentralized for the same reasons: the data that [...]

  7. [...] Most software is not designed to have this type of resiliency, and the excuse is simple: most software does not face adversarial forces. Or, at least, its designers think that it won’t (an assumption that many times leads to disaster). [...]

  8. [...] Cloud computing might not be able to deliver on its high-availability promises? Who would’ve thunk it? [...]

  9. Thanks the author!