Tuesday, July 17, 2012

How big is "big"?

A recent topic of interest in IT has been "big data", sometimes spelled with capitals: "Big Data". We have no hard and fast definition of big data, no specific threshold to cross from "data" to "big data". Does one terabyte constitute "big data"? If not, what about one petabyte?

This puzzle is similar to the question of "real time". Some systems must perform actions in "real time", yet we do not have a truly standard definition of them. If I design a dashboard system for an automobile and equip the automobile with sensors that report data every two seconds, then a real-time dashboard system must process all of the incoming data, by definition. Should I replace the sensors with units that report data every 1/2 second and the dashboard cannot keep up with the faster rate, then the system is not "real time".

But this means that the definition of "real time" depends not only on the design of the processing unit, but also the devices to which it communicates. The system may be considered "real time" until we change a component, then it is not.

I think that the same logic holds for "big data" systems. Today, we consider multiple petabytes to be "big data". Yet in in 1990 when PCs had disks of 30 megabytes, a data set of one gigabyte would be considered "big data". And in the 1960s, a data set of one megabyte would be "big data".

I think that, in the end, the best we can say is that "big" is as big as we want to define it, and "real time" is as fast as we want to define it. "Big data" will always be larger than the average organization can comfortably handle, and "real time" will always be fast enough to process the incoming transactions.

Which means that we will always have some systems that handle big data (and some that do not), and some systems that run in real time (and some that do not). Using the terms properly will rely not on the capabilities of the core components alone, but on our knowledge of the core and peripheral components. We must understand the whole system to declare it to be "big data" or "real time".

No comments: