Sunday, September 8, 2013

The coming problem of legacy Big Data

With all the fuss about Big Data, we seem to have forgotten about the problems of legacy Big Data.

You may think that Big Data is too new to have legacy problems. Legacy problems affect old systems, systems that were designed and built by Those Who Came Before And Did Not Know How To Plan For The Future. Big Data cannot possibly have those kinds of problems, because 1) the systems are new, and 2) they have been built by us.

Big Data systems are new, which is why I say that the problems are coming. The problems are not here now. But they will arrive, in a few years.

What kind of problems? I can think of several.

Data formats Newer tools (or newer versions of existing tools) change the formats of data and cannot read old formats. (For example, Microsoft Excel, which cannot read Lotus 1-2-3 files.)

Data value codes Values used in data to encode specific ideas, changed over time. These might be account codes, or product categories, or status codes. The problem is not that you cannot read the files, but that the values mean things other than what you think.

Missing or lost data Non-Big Data (should that be "Small Data"?) can be easily stored in version control systems or other archiving systems. Big Data, by its nature, doesn't fit well in these systems. Without an easy way to back up or archive Big Data, many shops will take the easy way and simply not make copies.

Inconsistent data Data sets of any size can hold inconsistencies. Keeping traditional data sets consistent requires discipline and proper tools. Finding inconsistencies in larger data sets is a larger problem, requiring the same discipline and mindset but perhaps more capable tools.

In short, the problems of legacy Big Data are the same problems as legacy Small Data.

The savvy shops will be prepared for these problems. They will put the proper checks in place to identify inconsistencies. They will plan for changes to formats. They will ensure that data is protected with backup and archive copies.

In short, the solutions to the problems of legacy Big Data are the same solutions to the problems of legacy Small Data.

No comments: