Thursday, July 16, 2020
Low code and no code still require thinking
Sunday, January 22, 2017
The spreadsheet is a dinosaur
Could we improve spreadsheets as we have improved programming languages?
Let's begin by recognizing that improvements are subjective, for both spreadsheets and programming languages. Pascal's adherence to structured programming concepts was lauded as progress by some and decried as oppressive by others. Users of spreadsheets are probably just as opinionated as programmers, so let's avoid the term "improvement" and instead focus on "rigor": Can we improve the rigor of spreadsheets, and assume that improved rigor is accepted as a good thing?
Here are some possible ways to add rigor to spreadsheets:
No forward references Current spreadsheets allow for formulas to reference any cell in the sheet. A formula may use values that are calculated "later" in the sheet, below or to the right. Spreadsheets are relatively clever at determining the proper sequence of calculation, so this is not necessarily a problem. It can be, if a sequence of calculations is self-referencing or "cyclic". Spreadsheets also have logic to identify cyclic calculations, but the work of fixing them is left to the human.
Removing forward references prevents cyclic calculations. By removing forward references, we limit the cells which can be used by a formula. Instead of using any cell, a formula may use only cells above and to the left. (Thus the top left cell may contain a value but not a formula.) With such limits in place, any formula can use only those items that have already been defined, and none of those items can use the current formula.
Not everyone may want to consider the top left corner the "origin". We could allow for each sheet to have an "origin corner" (top left, top, right, bottom left, or bottom right) and require formulas to use cells in the direction of the origin.
Smaller sheets Current spreadsheets allow for large numbers of rows and columns. Large spreadsheets were nice before they could be linked together. Once spreadsheets could be linked, the need for very large sheets evaporated. (Although we humans still too often think that bigger is better.) Smaller sheets force one to organize data. I once worked with a spreadsheet that allowed 52 columns and 128 rows per sheet. At first it was difficult, but with time I learned to work within the restrictions, and my sheets had better structure. Also, it was easier to find and resolve errors.
No absolute coordinates Absolute coordinates, as opposed to relative coordinates, are a hack for the original spreadsheets. They are useful when replicating a formula across multiple cells, and you want to override the default behavior of adjusting cell references.
Instead of absolute coordinates, I find it better to use a named range. (Even for a single cell.) The effect on calculations is the same, and the name of the range provides better information to the reviewer of the spreadsheet.
No coordinates in formulas Extending the last idea, force the use of named ranges for all calculations. (Perhaps this is the programmer in me, familiar with variable names.) Don't use cell references ("A4" or "C15") but require a range name for every source to the formula.
Better auditing The auditing capabilities of Excel are nice, but I find them frustrating and difficult to use. Microsoft chose a visual method for auditing, and I would like an extraction of all formulas for analysis.
Import and export controls on sheets This is an expansion of the "no forward references". It is easy to retrieve values from other sheets, perhaps too easy. One can set of cyclic dependencies across sheets, with sheets mutually dependent on their calculations. Specifying the values that may be retrieved from a spreadsheet (similar to an "export" declaration in some languages) limits the values the values exposed and forces the author to think about each export.
Of course, it would be easy to simply export everything. This avoids thinking and making decisions. To discourage this behavior, we would need a cost mechanism, some penalty for each exposed value. The more values you expose, the more you have to pay. (Rather than a dollar penalty, it may be a quality rating on the spreadsheet.)
None of these changes come for free. All of these changes have the potential to break existing spreadsheets. Yet I think we will see some movement towards them. We rely on spreadsheets for critical calculations, and we need confidence that the computations are correct. Improved rigor builds that confidence.
Thursday, June 2, 2016
The big improvement in programming forty years ago
Programming has been around since the beginning of computers, and seen lots of improvements: symbolic assembly, high-level compilers (COBOL and FORTRAN), structured programming (Pascal), object-oriented programming (Smalltalk, C++), virtual machines (Java, C#), scripting languages (Perl, Python, Ruby)... the list goes on.
Yet a significant improvement in programming occurred forty years ago. It made programming simple -- so simple that a non-programmer could do it. And it was ignored by the programming community.
That improvement was... the electronic spreadsheet.
Programming, at its core, is the organization of data and the processing of that data with a sequence of instructions. The niceties of data structures, objects, and just-in-time compilation are just that: niceties. They are there for the convenience of the programmer.
So how do spreadsheets come into it? Spreadsheets, at their core, organize data and process that data with a series of instructions. (Sound familiar?)
Spreadsheets -- the basic grid of numbers and formulas, without the charts, pivot tables, and VBA code -- are programs. Any spreadsheet can be converted into just about any language, from Fortran or BASIC to Java or Python. (The reverse is not true; only a few simple programs in BASIC or Python can be converted into spreadsheets.)
The improvement that spreadsheets made to programming was immediacy. The "programmer" could see the results of a change right after making a change. That immediate feedback was not available in compiled languages, which require the programmer to save the file, compile the program, and then run it. (IDEs like Turbo Pascal and Visual Studio make those steps easy, but there is still a delay.) Even interpreted languages like BASIC or Ruby require the steps of saving and running.
This improvement in programming, the immediate results of a change in the program, went unnoticed by the programming community. Visicalc was created in 1979, almost forty years ago. At the time, popular programming languages were BASIC, COBOL, Fortran, and Pascal.
Instead of building on the innovation of the spreadsheet, programmers have gone in other directions. Programmers focused on maintainability (structured programming), larger programs (object-oriented programming), version control, automated testing, and response to changing requirements (agile methods).
There has been no (or very little) effort for the immediate feedback that we get with spreadsheets.
For forty years.
At some point, we are going to invent a new programming language, one that provides immediate feedback. (Perhaps a language, editor, and run-time environment, which is what a spreadsheet is.) The advantages are great, as anyone who works with a spreadsheet can attest.
Sunday, April 17, 2016
After the spreadsheet
For the individual, the spreadsheet is a useful tool. But for the enterprise, the spreadsheet creates perhaps more problems than it solves. Since a spreadsheet file contains the data, formulas, and presentation of data, they are often replicated to share with co-workers (usually via e-mail) and duplicated to process different sets of data (the spring sales figures and then the summer sales figures, for example).
The replication of spreadsheets via e-mail can me mitigated by the use of shared file locations ("network drives") and by online versions of spreadsheets which allow for multiple users. But the bigger problem is the duplication of spreadsheets with minor changes.
The duplication of spreadsheet means the duplication of not only the data (which is often changed) but also the duplication of the formulas and the presentation (which often do not change). Since a spreadsheet contains all three components, a new version of data requires a new copy of all components. There is no way to share only one component, no way to share formulas against new data, or different presentations against data and formulas. This means that, over time, an enterprise of any size accumulates multiple spreadsheets with different data and duplicate formulas and macros -- at least you hope that they are duplicate copies.
The design of spreadsheets -- containing the data, formulas, and presentation in one package -- is a holdover from the days of Visicalc and Lotus 1-2-3. Those programs were developed for the Apple II and the IBM PC. With their ability to run only one program at a time, putting everything into one program made sense -- using one program for data, another for calculation, and a third for presentation was awkward and time-consuming. But that applies to the old single-tasking operating systems. Windows and Mac OS and Linux allow for multiple programs to run at the same time, and windowing systems allow for multiple programs to present information to the user at the same time.
If spreadsheets were being invented now in the age of web services and cloud systems and multi-window displays, their design would probably be quite different. Instead of a single program that performed all functions and a single file that contained data, formulas and presentation, we might have something very different. We might create a system of web services, providing data with some and performing calculations with others. The results could be displayed by yet other functions in other windows, possibly published for co-workers to view.
Such a multi-component system would follow the tenets of Unix, which recommends small, independent programs that read data, perform some processing, and provide data. The data and computations could be available via web services. A central service could "fan out" requests to collect data from one or more services, send that data through one or more computing services, and the provide the data to a presentation mechanism such as a graph in a window or even a printed report.
By separating the formulas and macros from the data, we can avoid needless duplication of both. (While most cases see the duplication of formulas to handle different data sets, sometimes different formulas can be applied to the same data.)
Providing data via web services is easy -- web services do that today. There are even web services to convert data into graphs. What about calculations? What language can be used to perform computations on data sets?
The traditional languages of C# and Java are not the best here; we're replacing spreadsheets with something equally usable by non-programmers (or at least similarly usable). The best candidate may be R, the statistical-oriented language. R is established, cross-platform, and capable. It's also a high-level language, close the the formulas of spreadsheets (and more powerful that Microsoft's VBA, which is used for macros in Excel).
Replacing spreadsheets with a trio of data management, computation, and presentation tools will not be easy. The advantages of the spreadsheet include convenience and familiarity. The advantages of separate components are better integration in cloud systems, leveraging of web services, and easier audits of formulas. It may not happen soon, but I think it will happen eventually.
Wednesday, May 13, 2015
The other shadow IT
In the good old days before the open source movement and when software had to be purchased, it was easy to control purchases of software: the Purchasing Department would verify all purchase requests with the IT Department; any unauthorized requests would be refused.
Even with open source and "free" software, the IT Department could set policies on individual PCs and prevent people from installing software. IT remained the gatekeepers of software.
With cloud computing, those controls can be bypassed. Software can now be used in the browser. Point your browser at a web site, register, supply a credit card, and you're ready to go. No Purchasing Department, no administrator rights. This is the situation that most people associate with the term "shadow IT".
Yet there is another technology that is used without the knowledge of the IT Department. A technology that has been used by many people, to perform many tasks within the organization. Programs have been written, databases have been designed, and systems have been implemented without the involvement of the IT Department. Worse, these systems have been used in production without extensive testing (perhaps no testing), have not been audited, and have no backups or disaster recover plans.
I'm talking about spreadsheets.
Specifically, Microsoft Excel spreadsheets.
Microsoft Excel is a standard among corporate computing. Just about every "business" PC (as opposed to "developer" PC or "sysadmin" PC) runs Windows and has Excel. The technology is available, often mandated by the IT Department as a standard configuration.
Millions of people have access to Excel, and they use it. And why not? Excel is powerful, flexible, and useful. There are tutorials for it. There are web pages with hints and tips. Microsoft has made it easy to use. There is little work needed to use Excel to perform calculations and store data. One can even connect Excel to external data sources (to avoid re-typing data) and program Excel with macros in VBA.
Excel, in other words, is a system platform complete with programming language. It is used by millions of people in thousands (hundreds of thousands?) of organizations. Some small businesses may run completely on Excel. Larger business may run on a combination of "properly" designed and supported systems and Excel.
This is the other shadow IT. The spreadsheets used by people to perform mundane (or perhaps not-so-mundane) tasks. The queries to corporate databases. The programs in VBA that advertise themselves as "macros". All operating without IT's knowledge or support.
Comparing two programming languages is difficult. Different languages have different capabilities and different amounts of programming "power". One line of COBOL can do the work of many lines of assembly. Ten lines of Python can do more work than ten lines of Java.
I suspect that if we could compare Excel to the corporate-approved languages of C# and Java, we would find that there is more Excel code that corporate-approved code. That is a lot of code! It means that Excel is the "dark matter" of the IT universe: existing but not observed. (I realize that this amount is speculation. We have no measurements for Excel code.)
Excel is the shadow technology to watch. Don't ignore file-sharing and browser-based apps; they are risks too. But keep an eye on the technology we already have and use.
Monday, July 14, 2014
Spreadsheets can help us learn functional programming
First, by "spreadsheets" I mean the cell grid and its formulas. I am omitting Visual Basic for Applications (VBA) code which can accompany Microsoft Excel sheets.
Spreadsheets as a programming environment are capable and flexible. They let one assemble a set of data and formulas into a meaningful arrangement. They let you format the data. They provide immediate feedback, with the results of changes displayed immediately.
Spreadsheets also violate a lot of the generally accepted principals of program design. They mix input, data, calculation, and output. They have no mechanisms for structuring calculations or encapsulating data. They have no way to isolate data; everything is "global" and any cell can be used by any other cell.
The lack of structural elements means that spreadsheets tend to "scale up" poorly. A small set of data is easily handled. A somewhat larger set of data (if it is the same type of data) is also manageable. A larger collection of different types of data becomes a challenge. Even with multi-page spreadsheets, one starts allocating regions of a sheet for certain data and certain calculations. These regions become problematic as they grow -- especially if they grow at different rates.
There is no way to condense similar calculations. If ten cells (or one hundred cells) all perform the same operation, they must all contain the same formula. Internally, the spreadsheet may optimize memory usage, but from the "programmer's" point of view, the formulas are repeated. If the general formula must change, it must change in all the cells. (While it is easy to change the formula in one cell and then replicate it to the other cells, it is not always easy to identify which other cells use that formula.)
Spreadsheets offer nothing in the way of a high-level view. Everything is viewed at the cell level: to examine a formula, you must look at the specific cell that contains the formula.
So spreadsheets offer power and immediate feedback, two important aspects of programming. Yet they lack the concepts of structured programming (subroutines, control blocks) and the concepts of object-oriented programming (custom types, encapsulation, inheritance).
With all of these omissions, how can spreadsheets be a good way to learn the next programming style?
The answer is functions.
The next wave of programming (as I see it) is functional programming. With functional programming, one defines and uses functions, and functions are first-class constructs of the language. Functions can be passed as arguments to other functions. They can be constructed by functions, and evaluated by functions. The change from object-oriented programming to functional programming is as large (and maybe larger) than the change from structured programming to object-oriented programming.
Spreadsheets can help us learn functional programming because spreadsheets (the core, non-VBA version of spreadsheets) are all about functions. Every cell contains the result of a function. Once a cell's value is defined, it does not change. (Changing cells in the spreadsheet and pressing the "recalc" button is, in essence, modifying the program an re-executing it.)
Now, the comparison is not complete. Functional programming lets you pass functions as arguments to other functions and lets you build functions "on the fly", and spreadsheets let you do neither. So designing a spreadsheet is not the same as programming in a functional language.
But programming spreadsheets is a start. It is a jumping-off point. It is an introduction to some of the concepts of functional programming.
If you want to learn functional programming, perhaps a good place to start is with your local spreadsheet. Turn off (or ignore) the VBA or macro programming. Stick with cells, values, and functions. Avoid the "optimize" or "search for result" capabilities. Design spreadsheets that compute things that are easy in "real" programming languages. You may be stuck at first, given the constraints of spreadsheet calculations. But keep at it. You will learn techniques that can help you with the next wave of programming.
Thursday, April 18, 2013
Excel is the new BASIC
Some have quipped that "those whom the gods would destroy... they first teach BASIC".
COBOL may be disparaged, but only to a limited extent. People, deep down, know that COBOL is running many useful system. (Things like banking, airline reservations, and perhaps most importantly payroll.) COBOL does work, and we respect it.
BASIC, on the other hand, tried to be useful but never really made it. Despite Microsoft's attempt with its MBASIC product, Digital Research with its CBASIC compiler, Digital Equipment Corporation with its various implementations of BASIC, and others, BASIC was always second place to other programming languages. For microcomputers, those languages were assembly language, Pascal, and C.
(I'm limiting this to the interpreter BASIC, the precursor to Visual Basic. Microsoft's Visual Basic was capable and popular. It was used for many serious applications, some of which are probably still running today.)
BASIC's challenge was its design. It was a language for learning the concepts of programming, not building large, serious programs. The name itself confirms this: Beginner's All-purpose Symbolic Instruction Code.
More than the name, the constructs of the programming language are geared for small programs. This is due to the purpose of BASIC (a better FORTRAN for casual users) and the timing of BASIC (the nascent "structured programming" movement had yet to prove itself).
Without the constructs structured programming ("while" loops and "if/then/else" statements), programmers must either build their programs with structured concepts made of smaller elements, or build unstructured programs. BASIC allows you to build structured programs, but provides no assistance. Worse, BASIC relies on GOTO to build most control flows.
In contrast, modern programming languages such as Java, C#, Python, and Ruby provide the constructs for structured programming and don't offer the GOTO statement.
The people who learned to program in BASIC (and I am one of them) learned to program poorly, and we have paid a heavy price for it.
But what does this have to do with Microsoft Excel?
Excel is the application taught to people for managing data. (Microsoft Word is suitable for documents, and Powerpoint is suitable for presentations, but Excel is *the* application for data. I suspect more people know and use Excel than all of the people using Word, Powerpoint, and Access.)
Excel offers the same undisciplined approach to applications. Spreadsheets contain data and formulas (and VBA macros, but I will ignore those for now).
One might argue that Excel is a spreadsheet, different from a programming language such as BASIC. Yet the differences are small. Excel, with its formulas alone, is a programming system if not a language.
The design of Excel (and other spreadsheets, going back to Visicalc) provides no support for structure or discipline. Formulas can collect data from anywhere in the spreadsheet. There is no GOTO keyword, but one can easily build a tangled mess.
Microsoft Excel is the new BASIC: useful, popular, and undisciplined. Worse than BASIC, since Excel is the premier tool for manipulating data. BASIC, for all of its flaws, was always second to some other language.
In one way, Excel is not as bad as BASIC. Formulas may collect data from any location in the spreadsheet, but they (for the most part) modify only their own contents. This provides a small amount of order to spreadsheet-programs.
We need a new paradigm for data management. Just as programming had its "structured programming" movement which lead to the use of constructs that improved the reliability and readability of programs, spreadsheets need a new approach to the organization of data and the types of formulas that can be used on that data.
Wednesday, September 19, 2012
How to improve spreadsheets
Spreadsheets get a number of things right:
- The syntax is easy to learn and consistent
- The data types are limited: numeric values, string values, and formulas
- Feedback for changes is immediate (no compiling or test scripts)
- Cells can only read values from other cells; they cannot assign a value to another cell
Immediate feedback is important. Teams using dynamic languages are at risk of a slew of problems; they use automated tests to identify errors. Agile processes emphasize the frequent use of tests. Spreadsheets provide this feedback without the need for tests.
The last item is most important. Cells can assign values to themselves, but they cannot assign values to other cells. This means that spreadsheets have no shared mutable state, no update collisions, no race conditions. That gives them stability.
Yet spreadsheets get a number of things wrong:
- All data is global
- Organization of data is dependent on the composer's skills and discipline
Our current spreadsheets are like the C language: fast, powerful, and dangerous. In C, one can do just about anything with the underlying machine. The C language lets you convert data from one form to another, and point to just about anything. It is quite powerful, but you have to know what you are doing.
Spreadsheets are not that dangerous. They don't have pointers, and the only things you can reference are (type safe because they are all of one type) cells within the spreadsheet.
But spreadsheets have the element of "you have to know what you are doing". The global nature of the data allows for formulas to refer to any cell (initialized or not) with little or no warning about nonsensical operations. In this sense, spreadsheet programming (in formulas) is much like C.
At first, I thought that the concepts of structured programming would improve spreadsheets. This is a false lead. Structured programming organizes code into sequences, iterations, and alternate paths. It clarifies code, but the formulas in spreadsheets are not a Turing-complete programming language. Structured programming can offer little to spreadsheets.
Instead, I think the concept of data encapsulation (from object-oriented programming) may help us advance the spreadsheet.
Spreadsheet authors tend to organize data into ranges. They may provide names for these ranges or leave them unnamed, but they will cluster their data into areas. Subsections of the grid will be used for specific types of data (for example, contact names in one range, regional sales in another).
For small spreadsheets, the "everything on one grid" concept works. Larger spreadsheets can see data split across pages (or tabs, depending on your spreadsheet manufacturer).
The problem with the spreadsheet grid is that it is, up to a point, infinite. We can add data to it without concern for the organization or structure. This becomes a problem over time; after a number of updates and revisions the effort to keep data organized becomes large.
An advanced spreadsheet would recognize that data is not stored in grids but in ranges, and would provide ranges as the key building block. Current spreadsheets let you define ranges, but the ability to operate on ranges is limited. In my new species of spreadsheet, the range would be the organizational unit, and ranges would not be infinite, empty grids. (They could expand, but only as a result of conscious action.)
Ranges are closer to tables in a database. Just as one can define a table and provide data for that table, one could define a range and provide data for that range. Unlike databases, ranges can be easily extended horizontally (more columns), re-sequenced, formatted, and edited. Unlike grids, ranges can be separated or re-combined to build new applications. Ranges must provide for local addresses (within the range) and external addresses (data within other ranges). Formulas must be able to read values from the current range and also from other grids.
If we do it right, ranges will be able to live in the cloud, being called in when needed and stored when not. Ranges will also be members of one application (or multiple applications), serving data to whatever application needs it.
Any improved spreadsheet will have to retain the advantages of the current tools. The immediacy of spreadsheets is a big advantage, allowing users of all skill levels to become proficient in a short time. Changing from grid-based spreadsheets to range-based spreadsheets must allow for this immediacy. This is a function of the UI, something that must be designed carefully.
I think that this new form of spreadsheet it possible, and offers some advantages. Now all I need is some time to implement it.
Sunday, March 11, 2012
The post-spreadsheet world
Google has done impressive work with their on-line documents and spreadsheets. I have yet to see Microsoft's on-line offerings, so I will not comment on them. But I can make some predictions.
The tablet and smartphone revolution moves us into a new realm of processing. This new model of processing builds apps from small, connected services and shares data. I think that the collision of tablets and spreadsheets will give us new tools.
Spreadsheets, at their core, are scriptable data processors. They store their data in a two-dimensional format (or three-dimensional format, if you consider multiple sheets to be a dimension). The scripts can be simple formulas, or they can be programs (in Microsoft programs they are written in VBA, in Open Office they are in Java). The ability to apply simple scripts (formulas) is what gives spreadsheets their power.
I expect that in the new world of tablets we will develop small, connectable, scriptable data processors. These processors will work with small sets of data, presenting it to users with smaller screens and also letting users change the data. They will also let users create and run (and share) scripts. And most importantly, they will connect to other data processors -- probably through web services. People will not build spreadsheets but their own custom apps, plugging together these data processors.
Add version control, identity management, and access controls (based on identity), and you will be able to build enterprise-class apps.
We may keep spreadsheets, although I expect them to change. Once mission-critical data is in the cloud, we will extend spreadsheets to pull that data and merge it into a two-dimensional grid. Enthusiastic folks may build real-time updates, bi-directional updates, round-tripping, and collaboration for multiple spreadsheet users. The spreadsheet will become a client of the data processors in the cloud.
In this scenario, Alice may be working on some figures on her tablet as she commutes to the office (she rides in a carpool) while Bob reviews those same figures in the office in his spreadsheet. No one has the master spreadsheet, no one has to worry about getting the latest version.