Wednesday, April 4, 2012

The cloud is not invincible

We tend to think that cloud-based systems as more reliable than our current server-based applications. And they can be, if they are designed properly.

Cloud-based systems use designs that distribute work to multiple servers. Instead of a single web server, you have multiple web servers with some form of load balancing. Instead of a single database server, you have multiple database servers with some form of synchronization. At first glance, it may seem quite similar to a sophisticated system hosted on your in-house servers.

But cloud-based systems are different. Cloud systems have several assumptions:


  • the system is distributed among multiple servers
  • any layer can be served by multiple servers (there are no special, "magical" servers with unique data)
  • any one server may go off-line at any moment
  • the cloud can "spin up" a new instance of a server quickly
  • requests between servers are queued and can be re-routed to other servers

As long as all of these assumptions hold, we have a reliable system.

Yet the cloud is not invincible. Here are a few ways to build a fragile system:

Require a service all the time (that is, fail if something is not available)

It is easy to fail due to a simple missing service. For example, a web app may have a home page with some information and some widgets. Let's say that one of the widgets is a weather display, showing the current temperature and weather conditions for the user. (We can assume that the widget is informed of the user's location, so that it can request the local weather from the general service.)

If the weather web service is down (that is, not responding, or responding with invalid data), what does your web application do? Does it skip over the weather information (and provide sensible HTML)? Or does it lock in a loop, waiting for a valid response? Or worse, does the page builder throw an exception and terminate?

This problem can occur with internal or external services. Any server can go off-line, any service can become unavailable. How does your system survive the loss of services?

Assume that the cloud will provide servers to meet demand

A big advantage of the cloud is scalability: you get more servers when you need them. And this is true, for the most part. While cloud infrastructure does "scale up" and "scale down" to meet your processing load, the processing power is not guaranteed.

For example, you may have a contract that allows for scaling up to a specified limit. (Such limits are put in place to ensure that the monthly bill will remain within some agreed-upon figure.) If your demand exceeds the contractual limits, your systems will be constrained and your customers may see poor performance.

What warnings will you get about nearing your processing limit? What warnings will your system provide when performance starts to degrade?

Assume that the cloud is the only thing that needs to scale

Even if the cloud infrastructure (servers) scales, does the network capacity? For the big cloud providers, the answer is yes. Does yours?

Designing for the cloud is different than designing for in-house web systems. But not that much different; there is a lot of overlap between them. Use your experience from your existing systems. Think about the problems that cloud computing solves. Learn about the assumptions that no longer hold. Those assumptions work both ways; some are not problems, and some are risks.

You can build reliable cloud-based systems. But don't think that it happens "for free".

No comments: