Showing posts with label distributed computing. Show all posts
Showing posts with label distributed computing. Show all posts

Sunday, June 21, 2015

More and smaller data centers for cloud

We seem to repeat lessons of technology.

The latest lesson is one from the 1980s: The PC revolution. Personal computers introduced the notion of smaller, numerous computers. Previously, the notion of computers revolved around mainframe computers: large, centralized, and expensive. (I'm ignoring minicomputers, which were smaller, less centralized, and less expensive.)

The PC revolution was less a change from mainframes to PCs and more a change in mindset. The revolution made the notion of small computers a reasonable one. After PCs arrived, the "space" of computing expanded to include mainframes and PCs. Small computers were considered legitimate.

That lesson -- that computing can come in small packages as well as large ones -- can be applied to cloud data centers. The big cloud providers (Amazon.com, Microsoft, IBM, etc.) have been built large data centers. And large is an apt description: enormous buildings containing racks and racks of servers, power supply distribution units, air conditioning... and more. The facilities may vary between the players: the hypervisors, operating systems, administration systems are all different among them. But the one factor they have in common is that they are all large.

I'm not sure that data centers have to be large. They certainly don't have to be monolithic. Cloud providers maintain multiple centers ("regions", "zones", "service areas") to provide redundancy in the event of physical disasters. But aside from the issue of redundancy, it seems that the big cloud providers are thinking in mainframe terms. They build large, centralized, (and expensive) data centers.

Large, centralized mainframe computers make sense for large, centralized mainframe programs.

Cloud systems are different from mainframe programs. They are not large, centralized programs. A properly designed cloud system consists of small, distinct programs tied together by data stores and message queues. A cloud system becomes big by scaling -- by increasing the number of copies of web servers and applications -- and not by growing a single program or single database.

A large cloud system can exist on a cloud platform that lives in one large data center. For critical systems, we want redundancy, so we arrange for multiple data centers. This is easy with cloud systems, as the system can expand by creating new instances of servers, not necessarily in the same data center.

A large cloud system doesn't need a single large data center, though. A large cloud system, with its many instances of small servers, can just as easily live in a set of small data centers (provided that there are enough servers to host the virtual servers).

I think we're in for an expansion of mindset, the same expansion that we saw with personal computers. Cloud providers will expand their data centers with small- and medium-sized data centers.

I'm ignoring two aspects here. One is communications: network transfers are faster in a single data center than across multiple centers. But how many applications are that sensitive to time? The other aspect is the efficiency of smaller data centers. It is probably cheaper, on a per-server basis, to build large data centers. Small data centers will have to take advantage to something, like an existing small building that requires no major construction.

Cloud systems, even large cloud systems, don't need large data centers.

Thursday, October 10, 2013

Hadoop shows us a possible future of computing

Computing has traditionally been processor-centric. The classic model of computing has a "central processing unit" which performs computations. The data is provided by "peripheral devices", processed by the central unit, and then routed back to peripheral devices (the same as the original devices or possibly others). Mainframes, minicomputers, and PCs all use this model. Even web applications use this model.

Hadoop changes this model. It is designed for Big Data, and the size of data requires a new model. Hadoop stores your data in segments across a number of servers -- with redundancy to prevent loss -- with each segment being 64MB to 2GB. If your data is smaller than 64MB, moving to Hadoop will gain you little. But that's not important here.

What is important is Hadoop's model. Hadoop moves away from the traditional computing model. Instead of a central processor that performs all calculations, Hadoop leverages servers that can hold data and also perform calculations.

Hadoop makes several assumptions:

  • The code is smaller than the data (or a segment of data)
  • Code is transported more easily than data (because of size)
  • Code can run on servers

With these assumptions, Hadoop builds a new model of computing. (To be fair, Hadoop may not be the only package that builds this new model of distributed processing -- or even the first. But it has a lot of interest, so I will use it as the example.)

All very interesting. But here is what I find more interesting: the distributed processing model of Hadoop can be applied to other systems. Hadoop's model makes sense for Big Data, and systems with Little (that is, not Big) data should not use Hadoop.

But perhaps smaller systems can use the model of distributed processing. Instead of moving data to the processor, we can store data with processors and move code to the data. A system could be constructed from servers holding data, connected with a network, and mobile code that can execute anywhere. The chief tasks then become identifying the need for code and moving code to the correct location.

That would give us a very different approach to system design.