Sunday, February 17, 2013

Losing data in the cloud of big data

NoSQL databases have several advantages over traditional SQL databases -- in certain situations. I think most folks agree that NoSQL databases are better for some tasks, and SQL databases are better in others. And most discussions about Big Data agree that NoSQL is the tool for Big Data databases.

One aspect that I have not seen discussed is auditing. That is, knowing that we have all of the data we expect to have. Traditional data processing systems (accounting, insurance, banking, etc.) have lots of checks in place to ensure that all transactions are processed and none are lost.

These checks and audits were put in place over a long time. I suspect that each error, when detected, was reviewed and a check was added to prevent such errors, or at least detect them early.

Do we have these checks in our Big Data databases? Is it even possible to build the checks for accountability? Big Data is, by definition, big. Bigger than normal, and bigger than one can conveniently inventory. Big Data can also contain things that are not always auditable. We have the techniques to check bank accounts, but how can we check something non-numeric such as photographs, tweets, and Facebook posts?

On the other hand, there may be risks from losing data, or subsets of data. Incomplete datasets may contain bias, a problem for sampling and projections. How can you trust your data if you don't have the checks in place?

No comments: