Tuesday, March 14, 2017

To fragment or not fragment, that is the question

First there were punch cards, and they were good. They were a nice, neat representation of data. One record on one card -- what could be easier?

Except that record sizes were limited to 80 bytes. And if you dropped a stack, and cards got out of sequence.

Then there were magtapes, and they were good too. Better than cards, because record sizes could be larger than 80 bytes. Also, if you dropped a tape the data stayed in sequence. But also quite similar to cards, data on magtapes was simple a series of records.

At first, there was one "file" on a tape: you started at the beginning, you read the records until the "end-of-file" mark, and you stopped. Later, we figured out that a single tape could hold multiple files, one after the other.

Except that files were always contiguous data. They could not be expanded on a single tape, since the expanded file would write over a portion of the next file. (Also, reading and writing to the same tape was not possible on many systems.)

So we invented magnetic disks and magnetic drums, and they were good too. Magtapes permitted sequential access, which meant reading the entire file and processing it. Disks and drums allowed for direct access which meant you could jump to a position in the file, read or write a record, and then jump somewhere else in the file. We eventually moved away from drums and stayed with disks, for a number of reasons.

Early disks allocated space much like tapes: a disk could contain several files but data for each file was contiguous. Programmers and system operators had to manage disk space, allocating space for files in advance. Like files on magtapes, files on disks were contiguous and could not be expanded, as the expansion would write over the next file.

And then we invented filesystems. (On DEC systems, they were called "directory structures".) Filesystems managed disk space, which meant that programmers and operators didn't have to.

Filesystems store files not as a long sequence of disk space but as collections of blocks, each block holding a number of bytes. Blocks added to a file could be from any area of the disk, not necessarily in line (or even close) to the original set of blocks. By adding or removing blocks, files could grow or shrink as necessary. The dynamic allocation of disk space was great!

Except that files were not contiguous.

When processing a file sequentially, it is faster to access a contiguous file than a non-contiguous file. Each block of data follows its predecessor, so the disk's read/write heads move little. For a non-contiguous file, with blocks of data scattered about the disk, the read/write heads must move from track to track to read each set of blocks. The action of moving the read/write heads takes time, and is therefore considered expensive.

Veteran PC users may remember utility programs which had the specific purpose of defragmenting a disk. They were popular in the 1990s.

Now, Windows defragments disks as an internal task. No third-party software is needed. No action by the user is needed.

To review: We started with punch cards, which were contiguous. Then we moved to magtapes, and files were still contiguous. Then we switched to disks, at first with contiguous files and then with non-contiguous files.

Then we created utility programs to make the non-contiguous files contiguous again.

Now we have SSDs (Solid-State Disks), which are really large chunks of memory with extra logic to hold values when the power is off. But they are still memory, and the cost of non-contiguous data is low. There are no read/write heads to move across a platter (indeed, there is no platter).

So the effort expended by Windows to defragment files (on an SSD) is not buying us better performance. It may be costing us, as the "defrag" process does consume CPU and does write to the SSD, and SSDs have a limited number of write operations in their lifespan.

So now, perhaps, we're going back to non-contiguous.

Tennis, anyone?

Thursday, February 23, 2017

The (possibly horrifying) killer app for AI

The original (and so far only) "killer app" was the spreadsheet. The specific spreadsheet was VisiCalc (or Lotus 1-2-3, depending on who you ask) and it was the compelling reason to get a personal computer.

We may see a killer app for AI, and from a completely unexpected direction: performance reviews.

Employee performance reviews, in large companies, often work as follows: each employee is rated on a number of items, frequently from 1 to 5 and sometimes as "meets expectations" or "needs improvement". Items range from meeting budgets and delivery dates to soft skills such as communication and leadership.

HR works to ensure that performance reviews are administered fairly, which means as consistently as possible, which often means "one size fits all". Everyone in the organization, from the entry-level developer to the vice president of accounting, all have the same performance review form and topics. It leads to developers being rated on "meeting budgets" and vice presidents of accounting being rated on "meeting delivery dates".

Just about everyone fears and dislikes the process. Employees dread the annual (or semiannual) review. Managers have no joy for it either.

This is where AI may be attractive.

Instead of a human-driven process, a company may look for an AI-driven process. The human-administered process is rife with potential for inconsistencies (including favoritism) and opens the company to lawsuits. Instead of expending effort to enforce consistent criteria, HR may choose to implement AI for performance reviews. (Managers may have little say in the decision, and many may be secretly relieved at such a change.)

This is a possibly horrifying concept. The mere idea of a computer (which is what AI is, at bottom) rating and ranking employees may be unwelcome among the ranks. The fear of "computer overlords" from the 1960s is still with us, and I suspect few companies would want to be the first to implement such a system.

I recognize that such a system cannot work in a vacuum. It would need input, starting with a list of job responsibilities, assigned tasks and deadlines, and status reports. Early versions will most likely get many things wrong. Over time, I expect they will improve.

Should we move to AI for performance reviews, I have some observations.

First, AI performance review systems may move outside of companies. Just as payroll processing is often outsourced, performance review systems might be outsourced too. The driver is risk avoidance, and companies that build their own performance review AI systems may build in subtle discrimination against women or minorities. An external supplier would have to warrant their system conforms to anti-discrimination laws -- a benefit to the client company.

Second, automating performance reviews could mean more frequent reviews, and more frequent feedback to employees. The choice of annual as a frequency for performance reviews is driven, I suspect, by two factors. First, they are needed to justify changes in compensation. Second, they are expensive to administer. The former mandates at least one per year, the second discourages anything more frequent.

But automating performance reviews should reduce effort and cost. Or at least reduce the marginal cost for reviews beyond the annual review.

Another result of more frequent performance reviews? More frequent information to management about the state of their workforce.

In sum, AI offers a way to reduce cost and risk in performance reviews. It also offers more frequent feedback to employees and more frequent information to management. I see advantages to the use of AI for this despised task.

Now all we need to do is bell the cat.

Sunday, February 12, 2017

Databases, containers, and Clarke's first law

A blog post by a (self-admitted) beginner engineer rants about databases inside of containers. The author lays out the case against using databases inside containers, pointing out potential problems from security to configuration time to the problems of holding state within a container. The argument is intense and passionate, although a bit difficult for me to follow. (That, I believe, is due to my limited knowledge of databases and my even more limited knowledge of containers.)

I believe he raises questions which should be answered before one uses databases in containers. So in one sense, I think he is right.

In a larger sense, I believe he is wrong.

For that opinion, I refer to Clarke's first law, which states: When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.

I suspect that it applies to sysadmins and IT engineers just as much as it does to scientists. I also suspect that age has rather little effect, too. Our case is one of a not-elderly not-scientist claiming that databases inside of containers is impossible, or at least a Bad Idea and Will Lead Only To Suffering.

My view is that containers are useful, and databases are useful, and many in the IT field will want to use databases inside of containers. Not just run programs that access databases on some other (non-containerized) server, but host the database within a container.

Not only will people want to use databases in containers, there will be enough pressure and enough interested people that they will make it happen. If our current database technology does not work well with containers, then engineers will modify containers and databases to make them work. The result will be, quite possibly, different from what we have today. Tomorrow's database may look and act differently from today's databases. (Just as today's phones look and act differently from phones of a decade ago.)

Utility is one of the driving features of technology. Containers have it, so they will be around for a while. Databases have it (they've had it for decades) and they will be around for a while. One or both may change to work with the other.

We'll still call them databases, though. The term is useful, too.

Monday, February 6, 2017

Software development and economics

One of the delights of working in the IT field is that it interacts with so many other fields. On one side is the "user" areas: user interface design, user experience, and don't forget accessibility and section 508 compliance. On the other side is hardware, networking, latency, CPU design, caching and cache invalidation, power consumption, power dissipation, and photolithography.

And then there is economics. Not the economics of buying a new server, or the economics of cloud computing, but "real" economics, the kind used to analyze nations.

Keynesian economics, in short, says that during an economic downturn the government should spend money even if it means accumulating debt. By spending, the government keeps the economy going and speeds the recovery. Once the economy has recovered, the government reduces spending and pays down the debt.

Thus, Keynesian economics posits two stages: one in which the government accumulates debt and one in which the government reduces debt. A "normal" economy will shift from recession to boom (and back), and the government should shift from debt accumulation to debt payment (and back).

It strikes me that this two-cycle approach to fiscal policy is much like Agile development.

The normal view of Agile development is the introduction of small changes, prioritized and reviewed by stakeholders, and tested with automated means. Yet a different view of Agile shows that it is much like Keynesian economics.

If the code corresponds to the economy, and the development team corresponds to the government, then we can build an analogy. The code shifts from an acceptable state to an unacceptable state, due to a new requirement that is not met. In response, the development team implements the new requirement but does so in a way that incurs debt. (The code is messy and needs to be refactored.) At this point, the development team has incurred technical debt.

But since the requirement has been implemented, the code is now in an acceptable state. (That is, the recession is over and the economy has recovered.) At this point, the development team must pay down the debt, by improving the code.

The two-cycle operation of "code and refactor" matches the economic version of "spend and repay".

The economists have it easy, however. Economic downturns occur, but economic recoveries provide a buffer time between them. Development teams must face stakeholders, who once they have a working system, too often demand additional changes immediately. There is no natural "boom time" to allow the developers to refactor the code. Only strong management can enforce a delay to allow for refactoring.

Saturday, January 28, 2017

A multitude of virtual machines

In IT, the term "virtual machine" has multiple meanings. We use the term to identify pretend servers with pretend disk space and pretend devices, all hosted on a real (or "physical") server. Cloud computing and even plain old (non-cloud) data centers have multiple instances of virtual machines.

We also use the term to identify the pretend processors used by various programming languages. Java has its JVM, C# has the .NET processor, and other languages have their own imaginary processors. It is an old concept, made popular by Java in the mid-1990s but going back to the 1980s with the UCSD p-System and even into the 1960s.

The two types of virtual machines are complementary. The former duplicates the hardware (usually for a PC) and provides virtual instances of everything in a computer: disk, graphics card, network card, USB and serial ports, and even a floppy disk (if you want). The one thing it doesn't virtualize is the processor; the hypervisor (the program controlling the virtual machines) relies on the physical processor.

The latter is a fictitious processor (with a fictitious instruction set) that is emulated by software on the physical processor. It has no associated hardware, and the term "virtual processor" might have been a better choice. (I have no hope of changing the name now, but I will use the term for this essay.)

It is the virtual processor that interests me. Or rather, it is the number of virtual processors that exist today.

We are blessed (or cursed) with a large number of virtual processors. Oracle's Java uses one called "JVM". Microsoft uses one called "CLR" (for "common language runtime"). Perl uses a virtual processor (two, actually; one for Perl 5 and a different one for Perl 6). Python uses a virtual processor. Ruby, Erlang, PHP, and Javascript all use virtual processors.

We are awash in virtual processors. It seems that each language has its own, but that's not true. The languages Groovy, Scala, Clojure, Kotlin, JRuby, and Jython all run on JVM. Microsoft's CLR runs C#, F#, VB.NET, IronPython, and IronRuby. Even BEAM, the virtual processor for Erlang, supports "joxa", "lfe", "efene", "elixir", "eml", and others.

I will point out that not every language uses a virtual processor. C, C++, Go, and Swift all produce executable code. Their code runs on the real processor. While more efficient, an executable is bound to the processor instruction set, and you must recompile to run on a different processor.

But back to virtual processors. We have a large number of virtual processors. And I have to think: "We've been here before".

The PC world long ago settled on the Intel x86 architecture. Before it did, we had a number of processors, from Intel (the 8080, 8085, 8086, and 8088), Zilog (the Z-80), Motorola (the 6800, 6808, and 6809), and MOS (the 6502).

The mainframe world saw many processors, before the rise of the IBM System/360 processor. Its derivatives are now the standard for mainframes.

Will we converge on a single architecture for virtual processors? I see no reason for such convergence in the commercial languages. Oracle and Microsoft have nothing to gain by adopting the other's technology. Indeed, one using the other would make them beholden to the competition for improvements and corrections.

The open source community is different, and may see convergence. An independent project, providing support for open source languages, may be possible. It may also make sense, allowing the language maintainers to focus on their language-specific features and remove the burden of maintaining the virtual processor. An important factor in such a common virtual processor is the interaction between the language and the virtual processor.

Open source has separated and consolidated other items. Sometimes we settle on a single solution, sometimes multiple. The kernel settled on Linux. The windowing system settled on X and KDE. The file system. The compiler back end.

Why not the virtual processor?

Sunday, January 22, 2017

The spreadsheet is a dinosaur

Spreadsheets are dinosaurs. Or more specifically, our current notion of a spreadsheet is dinosaur, a relic from a previous age.

Its not that spreadsheets have not changed. They have changed over the years, mostly by accumulation. Features have been added but core concepts have remained the same.

The original spreadsheet was Visicalc, written for the Apple II in the late 1970s. And while spreadsheets have expanded their capacity and added charts and fonts and database connections, the original concept -- a grid of values and formulas -- has not changed. If we had a time machine, we could pluck a random Visicalc user out of 1979, whisk him to 2017, put him in front of a computer running the latest version of Excel, and he would know what to do. (Aside, perhaps, from the mouse or the touchscreen.)

Spreadsheets are quite the contrast to programming languages and IDEs, which have evolved in that same period. Programming languages have acquired discipline. IDEs have improved editing, syntax highlighting, and debugging. The development process has shifted from "waterfall" to "agile" methods.

Could we improve spreadsheets as we have improved programming languages?

Let's begin by recognizing that improvements are subjective, for both spreadsheets and programming languages. Pascal's adherence to structured programming concepts was lauded as progress by some and decried as oppressive by others. Users of spreadsheets are probably just as opinionated as programmers, so let's avoid the term "improvement" and instead focus on "rigor": Can we improve the rigor of spreadsheets, and assume that improved rigor is accepted as a good thing?

Here are some possible ways to add rigor to spreadsheets:

No forward references Current spreadsheets allow for formulas to reference any cell in the sheet. A formula may use values that are calculated "later" in the sheet, below or to the right. Spreadsheets are relatively clever at determining the proper sequence of calculation, so this is not necessarily a problem. It can be, if a sequence of calculations is self-referencing or "cyclic". Spreadsheets also have logic to identify cyclic calculations, but the work of fixing them is left to the human.

Removing forward references prevents cyclic calculations. By removing forward references, we limit the cells which can be used by a formula. Instead of using any cell, a formula may use only cells above and to the left. (Thus the top left cell may contain a value but not a formula.) With such limits in place, any formula can use only those items that have already been defined, and none of those items can use the current formula.

Not everyone may want to consider the top left corner the "origin". We could allow for each sheet to have an "origin corner" (top left, top, right, bottom left, or bottom right) and require formulas to use cells in the direction of the origin.

Smaller sheets Current spreadsheets allow for large numbers of rows and columns. Large spreadsheets were nice before they could be linked together. Once spreadsheets could be linked, the need for very large sheets evaporated. (Although we humans still too often think that bigger is better.) Smaller sheets force one to organize data. I once worked with a spreadsheet that allowed 52 columns and 128 rows per sheet. At first it was difficult, but with time I learned to work within the restrictions, and my sheets had better structure. Also, it was easier to find and resolve errors.

No absolute coordinates Absolute coordinates, as opposed to relative coordinates, are a hack for the original spreadsheets. They are useful when replicating a formula across multiple cells, and you want to override the default behavior of adjusting cell references.

Instead of absolute coordinates, I find it better to use a named range. (Even for a single cell.) The effect on calculations is the same, and the name of the range provides better information to the reviewer of the spreadsheet.

No coordinates in formulas Extending the last idea, force the use of named ranges for all calculations. (Perhaps this is the programmer in me, familiar with variable names.) Don't use cell references ("A4" or "C15") but require a range name for every source to the formula.

Better auditing The auditing capabilities of Excel are nice, but I find them frustrating and difficult to use. Microsoft chose a visual method for auditing, and I would like an extraction of all formulas for analysis.

Import and export controls on sheets This is an expansion of the "no forward references". It is easy to retrieve values from other sheets, perhaps too easy. One can set of cyclic dependencies across sheets, with sheets mutually dependent on their calculations. Specifying the values that may be retrieved from a spreadsheet (similar to an "export" declaration in some languages) limits the values the values exposed and forces the author to think about each export.

Of course, it would be easy to simply export everything. This avoids thinking and making decisions. To discourage this behavior, we would need a cost mechanism, some penalty for each exposed value. The more values you expose, the more you have to pay. (Rather than a dollar penalty, it may be a quality rating on the spreadsheet.)

None of these changes come for free. All of these changes have the potential to break existing spreadsheets. Yet I think we will see some movement towards them. We rely on spreadsheets for critical calculations, and we need confidence that the computations are correct. Improved rigor builds that confidence.

We may not see a demand for rigor immediately. It may take a significant failure, or a number of failures, before managers and executives demand more from spreadsheet users. When they do, spreadsheet users will demand more from spreadsheets.

Monday, January 16, 2017

Discipline in programming

Programming has changed over the years. We've created new languages and added features to existing languages. Old languages that many consider obsolete are still in use, and still changing. (COBOL and C++ are two examples.)

Looking at individual changes, it is difficult to see a general pattern. But stepping back and getting a broader view, we can see that the major changes have increased discipline and rigor.

The first major change was the use of high-level languages in place of assembly language. Using high-level languages provided some degree of portability across different hardware (one could, theoretically, run the same FORTRAN program on IBM, Honeywell, and Burroughs mainframes). It meant a distant relationship with the hardware and a reliance on the compiler writers.

The next change was structured programming. It changed our notions of flow control, using "while", "if/then/else", and "for" structures and discouraged the use of "goto".

Then we adopted relational databases, separate from the application program. It required using an API (later standardized as SQL) rather than accessing data directly, and it required thought and planning for the database.

Relational databases forced us to organize data stored on disk. Object-oriented programming forced us to organize data in memory. We needed object models and for very large projects, separate teams to manage the models.

Each of these changes added discipline to programming. The shift to compilers required reliable compilers and reliable vendors to support them. Structured programming applied rigor to the sequence of computation. Relational databases applied rigor to the organization of data stored outside of memory, that is, on disk. Object-oriented programming applied rigor to the organization of data stored in memory.

I should note that each of these changes was opposed. Each had naysayers, usually basing their arguments on performance. And to be fair, the initial implementation of each change did have lower performance than the old way. Yet each change has a group of advocates (I call them "the Pascal crowd" after the early devotees to that language) who pushed for the change. Eventually, the new methods were improved and accepted.

The overall trend is towards rigor and discipline. In other words, the Pascal crowd has consistently won the debates.

Which is why, when looking ahead, I think future changes will keep moving in the direction of rigor and discipline. There may be minor deviations from this path, with new languages introducing undisciplined concepts, but I suspect that they will languish. The successful languages will require more thought, more planning, and prevent more "dangerous" operations.

Functional programming is promising. It applies rigor to the state of our program. Functional programming languages use immutable objects, which once made cannot be changed. As the state of the program is the sum of the state of all variables, functional programming demands more thought given to the state of our system. That fits in with the overall trend.

So I expect that functional languages, like structured languages and object-oriented languages, will be gradually adopted and their style will be accepted as normal. And I expect more changes, all in the direction of improved rigor and discipline.