Sunday, March 27, 2011

Structure thy data

In the 1980s movie "Labyrinth", there is one scene that shows an image of David Bowie carved in rock. The camera shifts, and you see that the image was actually an illusion. It is not a single rock with the image carved, but three different rocks each with a piece of the image. When viewed from the exact right angle -- when the three rocks line up, and you see the image. When viewed from a different angle, the image disappears.

The movie "The Incredibles" has a scene that does the same thing, but in reverse. Mr. Incredible, in a cave, is looking at a set of rocks. As he shifts his position, writing carved into the rocks forms the word "Kronos". The word is visible from only one position in the cave.

In both movies, data is visible from a specific position, but not from others. This affect can happen to real-life data too.

For example, document files can contain headings, paragraphs, and tables. The data contained within the document often looks good to a human viewing the document, but is difficult (or perhaps impossible) to read by another computer system.

I'm not talking about file formats, and the difficulties in parsing .DOC and .PDF formats. Difficulties in parsing certain file types do exist, and have been reduced by the use of XML-based formats. But beyond the reading of formats, there is the challenge of extracting the structure of the data.

Documents contain headers, paragraphs, and tables. In order to properly interpret and process a document, you must identify these different parts of the document. To identify the parts of the documents, the information in the file must be present. And with our current tools and techniques, there is in way to ensure your document has this structure or verify that another author's document contains this information.

The "proper" way to create a document with this information (in, say, Microsoft Word) is to use styles to mark different parts of the text. Marking the document heading as "Heading" and section and sub-section headings as "Heading 1" and "Heading 2" places this meta-information in the document.

The "improper" way to create a document is to ignore styles and use low-level font commands to change the appearance of text. One can change the typeface, the weight, and the size of text. For each section heading, one can set specific options for typeface attributes. Its less efficient than using styles, but perhaps easier to understand.

I put the words "proper" and "improper" in quotes because there is no standard for using Microsoft Word (or other word processors). One is free to use styles or the low-level commands. And this is part of the problem. But I will ignore the need for best practices and focus on the aspect of data alignment.

Documents using the "proper" method (styles), contain meta information that can be used by other applications to interpret the documents. Documents using the "improper" method have no certain way of interpreting data. (One can make assumptions based on certain patterns, but it relies on consistency in the application of low-level formatting commands.) The data is unstructured.

Unstructured data is like the rocks in "Labyrinth" and words on rocks in "The Incredibles". From the exact right angle, the image is visible. But from any other angle, the image is not. The unstructured data in a document is visible and sensible to a human looking at the text rendered by the word processor on a screen (or on a printed page), but is not sensible to a computer program.

We have achieved much in the development of computer programs. Not just word processors, but spreadsheets, databases, accounting systems, instant messaging, image manipulation, and many more applications. We are at a point (and possible past it) that we should tolerate unstructured data. We should use structured data and encourage it when structured options are available.

Data endures, often long beyond the life of the applications that created the data. To rely on the original programs to read the data is foolish. To assume that only humans will read the data is arrogant.

No comments: