Wednesday, September 19, 2012

How to improve spreadsheets

Spreadsheets are the sharks of the IT world. They evolved before the IBM PC (the first spreadsheet ran on an Apple II) and have, for the most part, remain unchanged. They have migrated from the 6502 processor to the 8080, the Z-80, the 8086, and today's set. They have added graphs and fonts and extravagant file formats. They have grown from 256 rows to a rather large number. But they remain engines of data and formulas, with the data in well-defined grids and formulas with well-defined characteristics. They are sharks, evolved to a level of efficiency that has yet to be surpassed.

Spreadsheets get a number of things right:

- The syntax is easy to learn and consistent
- The data types are limited: numeric values, string values, and formulas
- Feedback for changes is immediate (no compiling or test scripts)
- Cells can only read values from other cells; they cannot assign a value to another cell

Immediate feedback is important. Teams using dynamic languages are at risk of a slew of problems; they use automated tests to identify errors. Agile processes emphasize the frequent use of tests. Spreadsheets provide this feedback without the need for tests.

The last item is most important. Cells can assign values to themselves, but they cannot assign values to other cells. This means that spreadsheets have no shared mutable state, no update collisions, no race conditions. That gives them stability.

Yet spreadsheets get a number of things wrong:

- All data is global
- Organization of data is dependent on the composer's skills and discipline

Our current spreadsheets are like the C language: fast, powerful, and dangerous. In C, one can do just about anything with the underlying machine. The C language lets you convert data from one form to another, and point to just about anything. It is quite powerful, but you have to know what you are doing.

Spreadsheets are not that dangerous. They don't have pointers, and the only things you can reference are (type safe because they are all of one type) cells within the spreadsheet.

But spreadsheets have the element of "you have to know what you are doing". The global nature of the data allows for formulas to refer to any cell (initialized or not) with little or no warning about nonsensical operations. In this sense, spreadsheet programming (in formulas) is much like C.

At first, I thought that the concepts of structured programming would improve spreadsheets. This is a false lead. Structured programming organizes code into sequences, iterations, and alternate paths. It clarifies code, but the formulas in spreadsheets are not a Turing-complete programming language. Structured programming can offer little to spreadsheets.

Instead, I think the concept of data encapsulation (from object-oriented programming) may help us advance the spreadsheet.

Spreadsheet authors tend to organize data into ranges. They may provide names for these ranges or leave them unnamed, but they will cluster their data into areas. Subsections of the grid will be used for specific types of data (for example, contact names in one range, regional sales in another).

For small spreadsheets, the "everything on one grid" concept works. Larger spreadsheets can see data split across pages (or tabs, depending on your spreadsheet manufacturer).

The problem with the spreadsheet grid is that it is, up to a point, infinite. We can add data to it without concern for the organization or structure. This becomes a problem over time; after a number of updates and revisions the effort to keep data organized becomes large.

An advanced spreadsheet would recognize that data is not stored in grids but in ranges, and would provide ranges as the key building block. Current spreadsheets let you define ranges, but the ability to operate on ranges is limited. In my new species of spreadsheet, the range would be the organizational unit, and ranges would not be infinite, empty grids. (They could expand, but only as a result of conscious action.)

Ranges are closer to tables in a database. Just as one can define a table and provide data for that table, one could define a range and provide data for that range. Unlike databases, ranges can be easily extended horizontally (more columns), re-sequenced, formatted, and edited. Unlike grids, ranges can be separated or re-combined to build new applications. Ranges must provide for local addresses (within the range) and external addresses (data within other ranges). Formulas must be able to read values from the current range and also from other grids.

If we do it right, ranges will be able to live in the cloud, being called in when needed and stored when not. Ranges will also be members of one application (or multiple applications), serving data to whatever application needs it.

Any improved spreadsheet will have to retain the advantages of the current tools. The immediacy of spreadsheets is a big advantage, allowing users of all skill levels to become proficient in a short time. Changing from grid-based spreadsheets to range-based spreadsheets must allow for this immediacy. This is a function of the UI, something that must be designed carefully.

I think that this new form of spreadsheet it possible, and offers some advantages. Now all I need is some time to implement it.

No comments: