Monday, September 21, 2009

Your data are not nails

Data comes in different shapes, sizes, and with different levels of structure. The containers we select for data should respect those shapes and sizes, not force the data into a different form. But all too often, we pick one form for data and force-fit all types of data into that form. The result is data that is hard to understand, because the natural form has been replaced with the imposed form.

This post, for example, is small and has little structure (beyond paragraphs, sentences, and words). The "natural" form is the one your reading now. Forcing the text into another form, such as XML, would reduce our comprehension of the data. (Unless we converted the text back into "plain" format.)

One poor choice that I saw (and later changed) was the selection of XML for build scripts. It was a system that I inherited, one that was used by a development team to perform the compile and packaging steps for a large C++/MFC application. 

The thinking behind the choice or XML was twofold: XML allowed for some structure (it was thought there would be some) and XML was the shiny new thing. (There were some other shiny new things in the system, including Java, a web server, RMI, EJB, and reflection. It turns out that I got rid of all of the shiny things and the build system still worked.)

I can't blame the designers for succumbing to XML. Even Microsoft has gone a bit XML-happy with their configuration files for projects in Visual Studio.

It's easy to pick a single form and force all data into that form. It's also comfortable. You know that a single tool (or application) will serve your needs. But anyone who has used word processors and spreadsheets knows that the form of data lets us understand it.

Some data is structured, some is free-flowing. Some data is large, some is small. Some data consists of repeated structures, other data has multiple items with structure but each item has its own structure.

For build scripts, we found that text files were the most understandable, most flexible, and most useful form. Scripts are (typically) of moderate size. Converting the XML scripts to text saw the size of scripts shrink, from 20,000 lines to about 2200 lines. The smaller scripts were much easier to maintain, and the time for simple changes dropped from weeks to hours. (Mostly for testing. The time for script changes dropped to minutes.)

Small data sets with no to light structure fit well in text files. Possibly INI files, which have a little more structure to them.

Small to medium data sets with heavy structure fit into XML files.

Large data sets with homogeneous items fit well in relational databases.

Large data sets with heterogeneous items fit better into network databases or graph databases. (The "No SQL" movement can give you information about these databases.)

Don't think of all data as a set of nails, with your One True Format as the hammer. Use forms that make your team effective. Respect the data, and it will respect you.


No comments: