Sunday, October 13, 2013

Unstructured data isn't really unstructured

The introduction of NoSQL databases has brought along another concept: unstructured data. Advocates of NoSQL are quick to point out that relational databases are limited to structured data, and NoSQL data stores can handle unstructured (as well as structured) data.

I think that word does not mean what you think it means.

I've seen lots of data, and all of it has been structured. I have yet to meet unstructured data. All data has structure -- the absence of structure implies random data. Even the output of a pseudo-random number generator are structured; it is a series of numeric values.

When people say "structured data", they really mean "a series of objects, each conforming to a specific structure, known in advance". Tables in a relational database certainly follow this rule, as do records in a COBOL data declaration.

NoSQL data stores relaxes these constraints, but still requires that data be structured. If a NoSQL data store uses JSON notation, then each object must be stored in a manner consistent with JSON. The objects in a set may contain different properties, so that one object has a structure quite different from the next object, but each object must be structured.

This notion is not new. While COBOL and FORTRAN were efficient at processing homogenous records, Pascal allowed for "variant" records (it used a key field at the beginning of the record to identify the record type and layout).

What is new is that the object layout is not known in advance. Earlier languages and database systems required the design of the data up front. A COBOL program would know about customer records, for example, and a FORTRAN program would know about data observations. The structure of the data was "baked in" to the program. A new type of customer, or a new type of data set, would require a new version of the program or database schema.

NoSQL lets us create new structures without changing the program or schema. We can add new fields and create new objects for storage and processing, without changing the code.

So as I see it, it's not that data is unstructured. The idea is that we have reduced the coupling between the data from the program. Data is still structured, but the structure is not part of the code.

No comments: