Monday, August 24, 2015

The file format wars are over -- and text won

When I first started with computers, files were simple things. Most of them were source code, and a few of them were executables. The source code (BASIC, FORTRAN, and assembly) were all plain text files. The executables were in binary, since they contained machine instructions.

That simple world changed with the PC revolution and the plethora of applications that it brought. Wordstar used a format that was almost text, with ASCII characters and the end of each word marked with a regular character with its 8th bit set. Lotus 1-2-3 used a special file format for its worksheets. dBase II (and dBase III, and dBase IV) used a special format for its data.

There was a "carboniferous explosion" of binary formats. Each and every application had its own format. Binary formatted data was smaller to store, easier to parse, and somewhat proprietary. The last was important for the commercial market; once a customer had lots of data locked in a proprietary format they were unwilling to change to a competitor's product.

The conversion from DOS to Windows changed little. Applications kept their proprietary, binary formats.

Yet recently (that is, with the rise of web services and mobile computing) binary formats have declined. The new favorites are text-based formats: XML, JSON, and YAML.

I have seen no new proprietary, binary format lately. New formats have been one of the text-based formats. Even Microsoft has changed its Office applications (Word, Excel, Powerpoint, and others) to use an XML-based set of files.

This is a big change. Why did it happen?

I can think of several reasons:

First is the existence of the formats. In the "age of binary formats", a binary format was how one stored data. Everyone did it.

Second is the abundance of storage. With limited storage space, a binary format is smaller and a better fit. With today's available storage that pressure does not exist.

Third is the availability of libraries to parse and construct the text formats. We can easily read and write XML (or JSON, or YAML) with commonly-available, tested, working libraries. A proprietary format requires a new (untested) library.

Fourth is the pressure of legislation. Some countries (and some large companies) have mandated the use of open formats, to prevent the lock-in of proprietary data formats.

All of these are good reasons, yet I think there is another factor.

In the past, a file format served the application program. In the data processing world, our mindsets considered applications to "own" the data, with files being nothing more than a convenient holding space to be used when the application was not running (or when it was processing data from a different file). Programs did not share data -- or on the rare occasions when they did, it was through databases or plain text files.

Today, our mobile device apps share data with cloud-based systems. The cloud-based systems are collections of independent applications performing coordinated work. The nature of mobile/cloud is to share data from one application to another. This sharing between programs (sometimes written in different languages) is easier with standard formats and difficult with proprietary formats.

New systems will be developed with open (text) formats for storage and exchange. That means that our existing systems, the dinosaurs of the processing world with their proprietary formats, will fall out of favor.

I don't expect them to vanish completely. They work, which is an important virtue. Replacing them with a new system (or simply modifying them to use text formats) would be expensive with little apparent return on investment. Yet continuing to use them implies that some amount of data (a significant amount) will be locked within proprietary non-text formats.

Expect calls for people with skills in these file formats.

* * * * *

The recent supreme court decision about Java's API (in which the court decided not to hear an appeal) means that for now APIs and file formats can be considered intellectual property. It may be difficult to reverse-engineer the formats for old systems without the expressed permission of the vendor. (And if the vendor is out of business or sold to a larger company, it may be very difficult to obtain such permission.)

Companies may want to evaluate the risk of their data formats.

No comments: