Sunday, April 17, 2016

After the spreadsheet

The spreadsheet is a wonderful thing. It performs several functions: it holds and organizes data, it specifies calculations, and it presents results. All of these functions, wrapped into a single package, provides convenience to users. Yet that convenience of the single package will be the spreadsheet's undoing.

For the individual, the spreadsheet is a useful tool. But for the enterprise, the spreadsheet creates perhaps more problems than it solves. Since a spreadsheet file contains the data, formulas, and presentation of data, they are often replicated to share with co-workers (usually via e-mail) and duplicated to process different sets of data (the spring sales figures and then the summer sales figures, for example).

The replication of spreadsheets via e-mail can me mitigated by the use of shared file locations ("network drives") and by online versions of spreadsheets which allow for multiple users. But the bigger problem is the duplication of spreadsheets with minor changes.

The duplication of spreadsheet means the duplication of not only the data (which is often changed) but also the duplication of the formulas and the presentation (which often do not change). Since a spreadsheet contains all three components, a new version of data requires a new copy of all components. There is no way to share only one component, no way to share formulas against new data, or different presentations against data and formulas. This means that, over time, an enterprise of any size accumulates multiple spreadsheets with different data and duplicate formulas and macros -- at least you hope that they are duplicate copies.

The design of spreadsheets -- containing the data, formulas, and presentation in one package -- is a holdover from the days of Visicalc and Lotus 1-2-3. Those programs were developed for the Apple II and the IBM PC. With their ability to run only one program at a time, putting everything into one program made sense -- using one program for data, another for calculation, and a third for presentation was awkward and time-consuming. But that applies to the old single-tasking operating systems. Windows and Mac OS and Linux allow for multiple programs to run at the same time, and windowing systems allow for multiple programs to present information to the user at the same time.

If spreadsheets were being invented now in the age of web services and cloud systems and multi-window displays, their design would probably be quite different. Instead of a single program that performed all functions and a single file that contained data, formulas and presentation, we might have something very different. We might create a system of web services, providing data with some and performing calculations with others. The results could be displayed by yet other functions in other windows, possibly published for co-workers to view.

Such a multi-component system would follow the tenets of Unix, which recommends small, independent programs that read data, perform some processing, and provide data. The data and computations could be available via web services. A central service could "fan out" requests to collect data from one or more services, send that data through one or more computing services, and the provide the data to a presentation mechanism such as a graph in a window or even a printed report.

By separating the formulas and macros from the data, we can avoid needless duplication of both. (While most cases see the duplication of formulas to handle different data sets, sometimes different formulas can be applied to the same data.)

Providing data via web services is easy -- web services do that today. There are even web services to convert data into graphs. What about calculations? What language can be used to perform computations on data sets?

The traditional languages of C# and Java are not the best here; we're replacing spreadsheets with something equally usable by non-programmers (or at least similarly usable). The best candidate may be R, the statistical-oriented language. R is established, cross-platform, and capable. It's also a high-level language, close the the formulas of spreadsheets (and more powerful that Microsoft's VBA, which is used for macros in Excel).

Replacing spreadsheets with a trio of data management, computation, and presentation tools will not be easy. The advantages of the spreadsheet include convenience and familiarity. The advantages of separate components are better integration in cloud systems, leveraging of web services, and easier audits of formulas. It may not happen soon, but I think it will happen eventually.

No comments: