Monday, March 27, 2017

The provenance of variables

Just about every programming language has the concept of a 'variable', a container of a value. Variables are named 'variable' because their contents can vary -- as opposed to constants. The statement

a = 10

Assigns the variable 'a' a value of '10', denoted in the program as a constant.

(There are some languages which allow for the values of constants to be changed, so one can assign a new value to a constant. It leads to unusual results and is often considered a defect. But I digress.)

The nice thing about variables is that they can vary. The problem with variables is that they vary.

More specifically, when examining a program (say with a debugger), one can see the contents of a variable but one does not know how that value was calculated. Was it assigned? Was the variable incremented? Did the value come from a constant, or was it calculated? When was it assigned?

Here is an idea: Retain the source of values. Modify the notion of a variable. Instead of being a simple container for a value, hold the value and additional information.

For example, a small program:

file f = open("filename")
a = f.read()
b = f.read()
c = (a - b) / 100

Let's assume that the file contains the text "20 4", which is the number 20 followed by a space an then the number 4, all in text format.

In today's programming languages, the variables a, b, and c contains values, and nothing else. The variable 'a' contains 20, the variable 'b' contains '4', and the variable 'c' contains 0.16. Yet they contain no information about how those values were derived.

For a small program such as this example, we can easily look at the code and identify the source of the values. But larger programs are a different story. They are often complex, and the source of a value is not obvious.

With provenance, the variables a, b, and c still contain values, and in addition contain information about those values.

The variable 'a' contains the value 20 and the value 'filename', as that was the source of the value. It would also be possible to contain more information about the file, such as a creation date, a version number (for filesystems that support version numbers), and the position within the file. It can even contain the line number of the assignment, allowing the programmer easy access to the source.

The variable 'b' contains similar information.

The variable 'c' contains information about the variables 'a' and 'b', along with their states at the time of assignment. Consider the revised program:

file f = open("filename")
a = f.read()
b = f.read()
c = (a - b) / 100
... more code
a = 0
b = 1
... more code
d = c * 20

In this program, the variable 'd' is assigned the value 3.2 (assuming that the same file is read) and at that assignment, the variable 'c' holds information about 'a' and 'b' with their initial values of 20 and 4, not their current values of 0 and 1. Thus, a developer can examine the assignment to 'd' and understand the value of 'c'.

In addition to developers, provenance may be useful for financial auditors and examiners. Anyone who cares about the origins of a specific value will find provenance helpful.

Astute readers will be already thinking of the memory requirements for such a scheme. Retaining provenance requires memory -- a lot of memory. A simple variable holding an integer requires four bytes (on many modern systems). With provenance, a 'simple' integer would require the four bytes for the value and as many bytes as required to hold its history. Instead of four bytes, it may require 40, or 400.

Clearly, provenance is not free. It costs memory. It also costs time. Yet the benefits, I think, are clear. So, how to implement it? Some ideas:

- Provenance is needed only when debugging, not during production. Enable it as part of the normal debug information and remove it for 'release' mode.
- Provenance can be applied selectively, to a few variables and not to others.
- Provenance can be implemented selectively. Perhaps one needs only a few pieces of information, such as line number of assignment. Less information requires less memory.

Our computing capacity continues to grow. Processor capabilities, memory size, and storage size, are all increasing faster than program size. That is, our computers are getting bigger, and they are getting bigger faster than our programs are getting bigger. All of that 'extra' space should do something for us, right?

Tuesday, March 14, 2017

To fragment or not fragment, that is the question

First there were punch cards, and they were good. They were a nice, neat representation of data. One record on one card -- what could be easier?

Except that record sizes were limited to 80 bytes. And if you dropped a stack, and cards got out of sequence.

Then there were magtapes, and they were good too. Better than cards, because record sizes could be larger than 80 bytes. Also, if you dropped a tape the data stayed in sequence. But also quite similar to cards, data on magtapes was simple a series of records.

At first, there was one "file" on a tape: you started at the beginning, you read the records until the "end-of-file" mark, and you stopped. Later, we figured out that a single tape could hold multiple files, one after the other.

Except that files were always contiguous data. They could not be expanded on a single tape, since the expanded file would write over a portion of the next file. (Also, reading and writing to the same tape was not possible on many systems.)

So we invented magnetic disks and magnetic drums, and they were good too. Magtapes permitted sequential access, which meant reading the entire file and processing it. Disks and drums allowed for direct access which meant you could jump to a position in the file, read or write a record, and then jump somewhere else in the file. We eventually moved away from drums and stayed with disks, for a number of reasons.

Early disks allocated space much like tapes: a disk could contain several files but data for each file was contiguous. Programmers and system operators had to manage disk space, allocating space for files in advance. Like files on magtapes, files on disks were contiguous and could not be expanded, as the expansion would write over the next file.

And then we invented filesystems. (On DEC systems, they were called "directory structures".) Filesystems managed disk space, which meant that programmers and operators didn't have to.

Filesystems store files not as a long sequence of disk space but as collections of blocks, each block holding a number of bytes. Blocks added to a file could be from any area of the disk, not necessarily in line (or even close) to the original set of blocks. By adding or removing blocks, files could grow or shrink as necessary. The dynamic allocation of disk space was great!

Except that files were not contiguous.

When processing a file sequentially, it is faster to access a contiguous file than a non-contiguous file. Each block of data follows its predecessor, so the disk's read/write heads move little. For a non-contiguous file, with blocks of data scattered about the disk, the read/write heads must move from track to track to read each set of blocks. The action of moving the read/write heads takes time, and is therefore considered expensive.

Veteran PC users may remember utility programs which had the specific purpose of defragmenting a disk. They were popular in the 1990s.

Now, Windows defragments disks as an internal task. No third-party software is needed. No action by the user is needed.

To review: We started with punch cards, which were contiguous. Then we moved to magtapes, and files were still contiguous. Then we switched to disks, at first with contiguous files and then with non-contiguous files.

Then we created utility programs to make the non-contiguous files contiguous again.

Now we have SSDs (Solid-State Disks), which are really large chunks of memory with extra logic to hold values when the power is off. But they are still memory, and the cost of non-contiguous data is low. There are no read/write heads to move across a platter (indeed, there is no platter).

So the effort expended by Windows to defragment files (on an SSD) is not buying us better performance. It may be costing us, as the "defrag" process does consume CPU and does write to the SSD, and SSDs have a limited number of write operations in their lifespan.

So now, perhaps, we're going back to non-contiguous.

Tennis, anyone?