Showing posts with label structured data. Show all posts
Showing posts with label structured data. Show all posts

Sunday, October 13, 2013

Unstructured data isn't really unstructured

The introduction of NoSQL databases has brought along another concept: unstructured data. Advocates of NoSQL are quick to point out that relational databases are limited to structured data, and NoSQL data stores can handle unstructured (as well as structured) data.

I think that word does not mean what you think it means.

I've seen lots of data, and all of it has been structured. I have yet to meet unstructured data. All data has structure -- the absence of structure implies random data. Even the output of a pseudo-random number generator are structured; it is a series of numeric values.

When people say "structured data", they really mean "a series of objects, each conforming to a specific structure, known in advance". Tables in a relational database certainly follow this rule, as do records in a COBOL data declaration.

NoSQL data stores relaxes these constraints, but still requires that data be structured. If a NoSQL data store uses JSON notation, then each object must be stored in a manner consistent with JSON. The objects in a set may contain different properties, so that one object has a structure quite different from the next object, but each object must be structured.

This notion is not new. While COBOL and FORTRAN were efficient at processing homogenous records, Pascal allowed for "variant" records (it used a key field at the beginning of the record to identify the record type and layout).

What is new is that the object layout is not known in advance. Earlier languages and database systems required the design of the data up front. A COBOL program would know about customer records, for example, and a FORTRAN program would know about data observations. The structure of the data was "baked in" to the program. A new type of customer, or a new type of data set, would require a new version of the program or database schema.

NoSQL lets us create new structures without changing the program or schema. We can add new fields and create new objects for storage and processing, without changing the code.

So as I see it, it's not that data is unstructured. The idea is that we have reduced the coupling between the data from the program. Data is still structured, but the structure is not part of the code.

Sunday, March 27, 2011

Structure thy data

In the 1980s movie "Labyrinth", there is one scene that shows an image of David Bowie carved in rock. The camera shifts, and you see that the image was actually an illusion. It is not a single rock with the image carved, but three different rocks each with a piece of the image. When viewed from the exact right angle -- when the three rocks line up, and you see the image. When viewed from a different angle, the image disappears.

The movie "The Incredibles" has a scene that does the same thing, but in reverse. Mr. Incredible, in a cave, is looking at a set of rocks. As he shifts his position, writing carved into the rocks forms the word "Kronos". The word is visible from only one position in the cave.

In both movies, data is visible from a specific position, but not from others. This affect can happen to real-life data too.

For example, document files can contain headings, paragraphs, and tables. The data contained within the document often looks good to a human viewing the document, but is difficult (or perhaps impossible) to read by another computer system.

I'm not talking about file formats, and the difficulties in parsing .DOC and .PDF formats. Difficulties in parsing certain file types do exist, and have been reduced by the use of XML-based formats. But beyond the reading of formats, there is the challenge of extracting the structure of the data.

Documents contain headers, paragraphs, and tables. In order to properly interpret and process a document, you must identify these different parts of the document. To identify the parts of the documents, the information in the file must be present. And with our current tools and techniques, there is in way to ensure your document has this structure or verify that another author's document contains this information.

The "proper" way to create a document with this information (in, say, Microsoft Word) is to use styles to mark different parts of the text. Marking the document heading as "Heading" and section and sub-section headings as "Heading 1" and "Heading 2" places this meta-information in the document.

The "improper" way to create a document is to ignore styles and use low-level font commands to change the appearance of text. One can change the typeface, the weight, and the size of text. For each section heading, one can set specific options for typeface attributes. Its less efficient than using styles, but perhaps easier to understand.

I put the words "proper" and "improper" in quotes because there is no standard for using Microsoft Word (or other word processors). One is free to use styles or the low-level commands. And this is part of the problem. But I will ignore the need for best practices and focus on the aspect of data alignment.

Documents using the "proper" method (styles), contain meta information that can be used by other applications to interpret the documents. Documents using the "improper" method have no certain way of interpreting data. (One can make assumptions based on certain patterns, but it relies on consistency in the application of low-level formatting commands.) The data is unstructured.

Unstructured data is like the rocks in "Labyrinth" and words on rocks in "The Incredibles". From the exact right angle, the image is visible. But from any other angle, the image is not. The unstructured data in a document is visible and sensible to a human looking at the text rendered by the word processor on a screen (or on a printed page), but is not sensible to a computer program.

We have achieved much in the development of computer programs. Not just word processors, but spreadsheets, databases, accounting systems, instant messaging, image manipulation, and many more applications. We are at a point (and possible past it) that we should tolerate unstructured data. We should use structured data and encourage it when structured options are available.

Data endures, often long beyond the life of the applications that created the data. To rely on the original programs to read the data is foolish. To assume that only humans will read the data is arrogant.