Tuesday, January 6, 2015

The cloud brings change

Perhaps the greatest effect that cloud computing has on operations is change -- or more specifically, a move away from the strict policies against change.

Before cloud technology, operations viewed hardware and software as things that should change rarely and under strictly controlled conditions. Hardware was upgraded only when needed, and only after long planning sessions to review the new equipment and ensure it would work. Software was updated only during specified "downtime windows", typically early in the morning on a weekend when demand would be low.

The philosophies of "change only when necessary" and "only when it won't affect users" were driven by the view of hardware and software. Before cloud computing, most people had a mindset that I call the "mainframe model".

In this "mainframe model," there is one and only one computer. In the early days of computing, this was indeed the case. Turning the computer off to install new hardware meant that no one could use it. Since the entire company ran on it, a hardware upgrade (or a new operating system) meant that the entire company had to "do without". Therefore, updates had to be scheduled to minimize their effect and they had to be carefully planned to ensure that everything would work upon completion.

Later systems, especially web systems, used multiple web servers and often multiple data centers with failover, but people kept the mainframe model in their head. They carefully scheduled changes to avoid affecting users, and they carefully planned updates to ensure everything would work.

Cloud computing changes that. With cloud computing, there is not a single computer. There is not a single data center (if you're building your system properly). By definition, computers (usually virtualized) can be "spun up" on demand. Taking one server offline does not affect your customers; if demand increases you simply create another server to handle the additional workload.

The ability to take servers offline means that you can relax the scheduling of upgrades. You do not need to wait until the wee hours of the morning. You do not need to upgrade all of your servers at once. (Although running servers with multiple versions of your software can cause other problems. More on that later.)

Netflix has taken the idea further, creating tools that deliberately break individual servers. By doing so, Netflix can examine the failures and design a more robust system. Once a single server fails, other servers take over the work -- or so Netflix would like. If servers don't pick up the workload, Netflix has a problem and changes their code.

Cloud technology lets us use a new model, what I call the "cloud model". This model allows for failures at any time -- not just from servers failing but from servers being taken offline for upgrades or maintenance. Those upgrades could be hardware, virtualized hardware, operating systems, database schemas, or application software.

The cloud model allows for change. It requires a different set of rules. Instead of scheduling all changes for a single downtime window, it distributes changes over time. It mandates that newer versions of applications and databases work with older versions, which probably means smaller, more incremental changes. It also encourages (perhaps requires) software to administer the changes and ensure that all servers get the changes. Instead of growing a single tree you are growing a forest.

The fact that cloud technology brings change, or changes our ideas of changes, should not be a surprise. Other technologies and techniques (agile development, dynamic languages) have been moving in the same direction. Even Microsoft and Apple are releasing products more quickly. Change is upon us, whether we want it or not.

No comments: