Cloud Durability?

Cloud computing might not be able to deliver on its high-availability promises? Who would’ve thunk it?

The future is not to rely on a single entity to provide hosting, scaling, and reliability, but rather to rely on a multitude of distinct entities. That way, if one or two go down — or even if many go down — service continues uninterrupted, as if by magic.

Posted in decentralization, resilience | 3 Comments »

Decentralized Resilience

I’ve written at some length about the importance of resilience in flŭd’s design, and how complete decentralization is a key component of that resiliency. One of flŭd’s hallmark goals is to back up data in such a way that it would be virtually impossible to lose it — even if a very powerful adversary (including an oppressive government regime or an extensive natural disaster) disables large portions of the flŭd network.

Most software is not designed to have this type of resiliency, and the excuse is simple: most software does not face adversarial forces. Or, at least, its designers think that it won’t (an assumption that many times leads to disaster).

There is one class of software, however, which meets adversity as part of its raison d’être: malware. Now, of course, flŭ­d’s purposes are at the polar opposite of software such as the storm botnet, but I can’t help admire, at least from the standpoint of technology, some of the self-preservation techniques employed by malware such as Nugache in avoiding eradication. It seems that many anti-malware researchers share my reluctant admiration. From SearchSecurity.com:

Dittrich, one of the top botnet researchers in the world, has been tracking botnets for close to a decade and has seen it all. But this new piece of malware, which came to be known as Nugache, was a game-changer. With no [centralized command-and-control] server to target, bots capable of sending encrypted packets and the possibility of any peer on the network suddenly becoming the de facto leader of the botnet, Nugache, Dittrich knew, would be virtually impossible to stop.

Posted in decentralization, resilience | Comments Off

Eternal Storm

If you haven’t heard of the Storm Botnet yet, chances are you will soon. With between one and 50 million nodes infected with this trojan, it forms the world’s most powerful supercomputer. Criminal elements are believed to control Storm, but that’s not why it is interesting.

The Storm Botnet is fascinating because of its resilience. As Bruce Schneier points out, antivirus companies haven’t figured out a way to put a dent in its propagation or effectiveness, even though they have known about it for almost a year, and even though it has already been used to send millions (perhaps billions) of spam emails and carry out several high-profile DDoS attacks. Getting rid of such a beast would clearly be very lucrative work for the antivirus industry, yet no one has devised a successful solution.

What makes this thing so indestructible? Schneier explains a key component:

Rather than having all hosts communicate to a central server or set of servers, Storm uses a peer-to-peer network for C2. This makes the Storm botnet much harder to disable. The most common way to disable a botnet is to shut down the centralized control point. Storm doesn’t have a centralized control point, and thus can’t be shut down that way.

This technique has other advantages, too. Companies that monitor net activity can detect traffic anomalies with a centralized C2 point, but distributed C2 doesn’t show up as a spike. Communications are much harder to detect.

One standard method of tracking root C2 servers is to put an infected host through a memory debugger and figure out where its orders are coming from. This won’t work with Storm: An infected host may only know about a small fraction of infected hosts — 25-30 at a time — and those hosts are an unknown number of hops away from the primary C2 servers.

And even if a C2 node is taken down, the system doesn’t suffer. Like a hydra with many heads, Storm’s C2 structure is distributed.

In combination with some of its other novel features, Storm’s decentralization gives it a level of immunity to shutdown previously unseen anywhere.

It’s a shame that a malicious tool like Storm is providing such a stunning demonstration of the advantages of decentralization in the wild. But these techniques are not limited to malicious software. Indeed, major components of the Internet itself use the same techniques. flŭd backup is 100% decentralized for the same reasons: the data that you backup with flŭ­d should be as indestructible as possible. Decentralization is the most effective path to that goal, and Storm has provided us with ample evidence that such a scheme can be extremely effective.

Posted in decentralization, resilience | Comments Off

P2P’s Skype-induced Blackeye, or Why Diversity is Good

On August 16th, Skype disappeared for two days and the company blamed the outage on Skype’s p2p network, the nodes of which received updates worldwide at the same time, restarted, and overwhelmed the system.

Rather than blaming the decentralized nature of their p2p network, Skype should be pointing the finger of blame at the true cause of this extended outage: Skype’s own centralized control and distribution of their software, and Skype’s closed, proprietary protocol, which ensures that no clients other than the official Skype client software exist. The Skype ecosystem is homogeneous, and as such faces problems similar to those found during the famous potato famine in Ireland, or the current problems facing genetically homogeneous banana crops: a population that lacks diversity lacks resilience to disease and adversity. When all the individuals in a population are clones or close relatives of each other, every single member of the population is susceptible to a single disease, and it can spread like wildfire. In Skype’s case, the massive simultaneous restart of nodes apparently sparked a constant flame of log-in requests, exposing another weakness in Skype’s architecture: since Skype must centrally manage accounts for business reasons, all account management, including log-in authentication, must be centrally performed. Authentication servers in Skype are distributed, but not decentralized. Centralized components do not scale well, especially during heavy spikes in usage patterns.

Contrast this with the gnutella or bittorrent networks, in which no single entity controls all the clients. There are perhaps dozens of different active clients for each network, and if there is a fatal bug in one version of one client, it only affects the portion of the population running that version of that client. Diversity is good for the ecosystem, and always will be.

I’ve claimed before that decentralization can eradicate service outages. But simply decentralizing some portion of a system clearly does not make it immune to serious outage problems. Decentralization without open protocols (and usually by corollary, open software) is not decentralization at all. Decentralization is not made of magic pixie dust, but it will overcome this type of catastrophe if applied correctly.

This is why the flŭ­d protocol is open, and why the software is open source; we hope that others will eventually write clients not only for other languages and platforms, but clients that implement different strategies for maximizing their efficiency and trading relationships.

Hamstringing a p2p application with centralized SPOFs (single points of failure), and then blaming decentralization as the cause of problems that really stem from centralized bottlenecks is, at the very least, disingenuous. But I understand Skype’s predicament — it’s much harder to say, “our system failed because our business model fundamentally weakened our technology” than it is to say “we had a bug in our p2p application.”

Posted in decentralization, resilience | 2 Comments »

Eradicating Service Outages, Once and For All

Throughout the day, news sites and bloggers have been covering a glitch in Google’s Personalized Home Page service that apparently deletes all the customized content from users’ pages.

For nearly 7 hours at the beginning of April 2006, Amazon’s S3 storage service returned nothing but service unavailable messages. It happened again in January 2007, ruining Amazon’s 99.99% uptime goal.

For much of the morning of April 12th, Microsoft’s Hotmail and MSN services were unavailable to many of its users.

In February, Apple’s .Mac online service was reported to be mostly unavailable for many customers over a 5-day period.

Five months ago, Microsoft’s Windows Genuine Advantage service (which is used to disable your copy of Windows if it is deemed to be pirated) experienced a temporary outage, resulting in erroneous flagging of many valid systems as non-genuine.

A couple of months ago, Google irretrievably lost the email archives and address books of 65+ users of its GMail service.

Typepad, Blogger, Bloglines, del.icio.us, Google, Yahoo — all have experienced high-profile outages that have made their services unavailable, and in some cases lost customer data.

We all know that both hardware and software are unreliable. Given this fact, all of the above services have created counter-measures to protect your data and provide continuous service, yet none have been able to deliver.

And yet, we have created systems that never go down. The best known example is the Internet itself. Certainly, parts of the Internet fail all the time: fiber optic cables are severed, core routers malfunction, ISPs drop service to their end customers, servers blow up. But the modern Internet has never “gone down,” not even for a brief moment. As you may recall, this was part of its design goal: ARPA wanted a network that could survive multiple nuclear strikes without failing. Why is the Internet resilient? Because it is decentralized.

We’ve also designed resilient applications that run on top of this robust network. Email is the best known example. Email never fails. Yes, email servers may disintegrate, network outages may prevent specific routes of email from being delivered, individual email providers might disappear. And these problems do occasionally cause a few individuals to be unable to access or send email for brief periods of time. But the Internet’s “email service” has never served up a “service unavailable” message. There has never been a time when the vast majority of email users on the Internet could not send each other email. Why is email resilient? Because it is decentralized.

Why, then, doesn’t this type of reliability show up in the commercial sector of the Internet? Why aren’t the Amazon S3s, the Apple .Macs, the Microsoft WGAs, et. al more reliable?

The answer is surpisingly simple: centrality is vital to many business models. So even though we know that robust services eschew centrality while non-robust ones cling to it, most commercial services tether themselves to their centralized architectures not for technical reasons, but because collecting money from users requires the ability to centrally ‘turn off’ the service for those who don’t pay.

The smartest centralized services, such as Amazon’s EC2 or Google’s search engine, try to get as far away from centrality as possible, implementing things like geographically diverse data replication and roaming virtual machine instances. But even the most decentralized commercial services remain chained to central components. Amazon’s S3 and EC2 hardware, for example, remain under the complete control of Amazon, and are centrally administered by Amazon staff. Amazon centrally regulates what users of the service are allowed to do through license agreements. When spikes in usage arise, Amazon must acquire and install more hardware to keep up with demand, presumably passing purchase orders through their centralized purchasing department. A single mistake by a systems administrator has the potential to bring the entire service down. The same can be said for any of the other services controlled by a single entity.

Contrast this with email, with bittorrent, or with flŭ­d, where anyone can instantiate nodes that provide their respective services, where no single entity can start or shutter the service, where no committee or organization can censor content or control the flow of data and information. Supply of the service organically grows out of demand for the service. A single decision to turn off these decentralized services is virtually impossible; in order for that to happen, each current participant must simultaneously decide to turn off, and stay off, and all future potential participants must decide to remain, well, potential.

True decentralization has proven that there is a better way. So why do we still build and rely on centralized services that are susceptible to outages? In the future, perhaps we won’t.

Posted in decentralization, emergence, resilience | 9 Comments »

The Death of the Datacenter

Jonathan Schwartz weighed in on the future of the datacenter today, saying:

where’s computing headed? …into the real world, certainly.

Perhaps a more interesting question should be – why bother with datacenters at all? Surely it’s time we all started revisiting some basic assumptions…

Schwartz’s message is clear: the datacenter as we know it will die. That’s a pretty shocking prognostication to come from the lips of one of the founders of Sun Microsystems, a company whose bread and butter has always been the traditional datacenter. But of course there is little new in this prophecy. Many of us have been predicting the demise, or at least decline, of the datacenter for many years now. And it’s not hard to see why.

The center of gravity of computing has been moving away from centralization for several decades. Mainframes gave way to minicomputers. Minicomputers gave way to Personal Computers. Desktops proliferated. Portable computers such as the Palm, the iPod, dashboard GPS navigators, and cell phones clearly have an importance that increases every day.

And it’s not just the hardware. Software has followed a similar trend, from the gluing together of diverse electronic mail systems by sendmail to form what we now regard singly as email, to the creation of the Internet itself, to the incredible popularity and seeming indestructibility of today’s most used p2p filesharing applications. All of these decentralized computing resources provide an incredibly diverse and resilient platform for creating the next generation of software and services.

Will the datacenter as we know it today cease to exist in ten years? Probably not. But will important services continue to move to the edges of the network? Yes. Will decentralization be able to provide services at costs that blow datacenter solutions out of the water? Yes. Will these new applications provide a new level of reliabilty? Yes.

Posted in decentralization, resilience | Comments Off