Cloud reliabilty

Reliability is often attributed as one of the reasons some organizations are wary of the cloud.

Last week, Amazon, Rackspace and IBM had to “reboot” their clouds to deal with maintenance issues with the Xen hypervisor. Details were scarce but it was pretty quickly established that an unspecified vulnerability in the Xen hypervisor was the issue.

The vulnerability, discovered by researcher Jan Beulich, concerned Xen hypervisor, open-source technology that cloud service providers use to create and run virtual machines. If exploited the vulnerability would have allowed malicious virtual machines to read data from or crash other virtual machines as well as the host server.

Not all providers had to reboot their clouds to upgrades or maintenance. Google and EMC VMware support the notion of live migration, which keeps internal changes invisible to users and avoids these Xen reboots and Microsoft uses (customized) Hyper V so they did not have that vulnerability.

It is interesting to see what “uptime” means in this context. In many reports of this nature, “uptime” doesn’t take into account “scheduled downtime.” And that could very well be the case here, as well. If one does a little bit of math:

  • 99.9% uptime is 8.77 hours of downtime per year
  • 99.99% uptime is 52.60 minutes of downtime per year
  • 99.999% uptime is 5.26 minutes of downtime per year

Although some users complained about the outage most where complaining about (the lack of) the providers’ communications.

Cloud providers cannot be considered as a black box anymore. As an architect we need to know the limitations of the architectural components the provider uses such as Xen. We need to know how often these kinds of reboots have occurred, and how the provider handles transparent maintenance.

We also need to consider the lines of communications. Providers often drop the ball here. People are often unhappy because they didn’t get much (or any) heads-up about the reboot, not about the reboots itself.

We should remember that outages and other disruptions are few and far between these days, so these rare event get extra media attention.

Geef een reactie

Je e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *

Deze site gebruikt Akismet om spam te verminderen. Bekijk hoe je reactie-gegevens worden verwerkt.