Tuesday, March 27, 2018

A mainframe technique for improving web app performance

I've been working on a modern web application, making changes to improve its performance. For one improvement, I used an old technique from the days of mainframe processing.

I call it the "master file update" algorithm, although others may use different names.

It was commonly used when mainframe computers read data from tapes (or even punch cards). The program was to update a master file (say, account numbers and balances) with transactions (each with account numbers and transaction amount). The master file could contain bank accounts, or insurance policies, or something similar.

Why did mainframes use this technique? In short, they had to. The master file was stored on a magnetic tape, and transactions were on another tape (or perhaps punch cards). Both files were sequential-access files, and the algorithm read each file's records in sequence, writing a new master file along the way. Sequential access is the only way to read a file on magtape.

How did I use this technique? Well, I wasn't reading magnetic tapes, but I was working with two sets of data, one a "master" set of values and the other a set of "transactions". The original (slow) algorithm stored both sets of data in dictionary structures and required access by key values. Each access required a lookup into the dictionary. While the access was handled by the data structure's class, it still required work to find the correct value for each update.

The revised algorithm followed the mainframe technique: store the values in lists, not dictionaries, and make a single pass through the two sets of data. Starting with the first master value and the first transaction value, walk forward through the master values until the keys match. When the keys match, update the value and advance to the next transaction value. Repeat until you reach the end of both lists.

Dictionaries are good for "random access" to data. When you have to read (or update) a handful of values, and your updates are in no particular order, a dictionary structure is useful. There is a cost, as the dictionary as to find the item you want to read or update. Our situation was such that the cost of the dictionary was affecting our performance.

Lists are simpler structures than dictionaries. They don't have to search for the data; you move to the item in the list and read its value. The result is that read and write operations are faster. The change for a single operation was small; multiplied by the number of operations the change was a modest, yet noticeable, improvement.

Fortunately for me, the data was easily converted to lists, and it was in the right sequence. The revised code was faster, which was the goal. And the code was still readable, so there was no downside to the change.

The lesson for the day? Algorithms are usually not dependent on technology. The "master file update" algorithm is not a "mainframe technique", suitable only for mainframe applications. It works quite well in web apps. Don't assume that you must use only "web app" algorithms for web apps, and only "cloud app" algorithms for cloud apps. Such assumptions can blind you to good solutions. Keep you options open, and use the best algorithm for the job.

Tuesday, March 13, 2018

Programming languages and character sets

Programming languages are similar, but not identical. Even "common" things such as expressions can be represented differently in different languages.

FORTRAN used the sequence ".LT." for "less than", normally (today) indicated by the sign <, and ".GT." for "greater than", normally the sign >. Why? Because in the early days of computing, programs were written on punch cards, and punch cards used a small set of characters (uppercase alpha, numeric, and a few punctuation). The signs for "greater than" and "less than" were not part of that character set, so the language designers had to make do with what was available.

BASIC used the parentheses to denote both function arguments and variable subscripts. Nowadays, most languages use square brackets for subscripts. Why did BASIC use parentheses? Because most BASIC programs were written on Teletype machines, large mechanical printing terminals with a limited set of characters. And -- you guessed it -- the square bracket characters were not part of that set.

When C was invented, we were moving from Teletypes to paperless terminals. These new terminals supported the entire ASCII character set, including lowercase letters and all of the punctuation available on today's US keyboards. Thus, C used all of the symbols available, including lowercase letters and just about every punctuation symbol.

Today we use modern equipment to write programs. Just about all of our equipment supports UNICODE. The programming languages we create today use... the ASCII character set.

Oh, programming languages allow string literals and identifiers with non-ASCII characters, but none of our languages require the use of a non-ASCII character. No languages make you declare a lambda function with the character λ, for example.

Why? I would think that programmers would like to use the characters in the larger UNICODE set. The larger character set allows for:
  • Greek letters for variable names
  • Multiplication (×) and division (÷) symbols
  • Distinct characters to denote templates and generics
C++ chose to denote templates with the less-than and greater-than symbols. The decision was somewhat forced, as C++ lives in the ASCII world. Java and C# have followed that convention, although its not clear that they had to. Yet the decision is has its costs; tokenizing source code is much harder with symbols that hold multiple meanings. Java and C# could have used the double-angle brackets (« and ») to denote generics.

I'm not recommending that we use the entire UNICODE set. Several glyphs (such as 'a') have different code points assigned (such as the Latin 'a' and the Cyrllic 'a') and having multiple code points that appear the same is, in my view, asking for trouble. Identifiers and names which appear (to the human eye) to be the same would be considered different by the compiler.

But I am curious as to why we have settled on ASCII as the character set for languages.

Maybe its not the character set. Maybe it is the equipment. Maybe programmers (and more specifically, program language designers) use US keyboards. When looking for characters to represent some idea, our eyes fall upon our keyboards, which present the ASCII set of characters. Maybe it is just easier to use ASCII characters -- and then allow UNICODE later.

If that's true (that our keyboard guides our language design) then I don't expect languages to expand beyond ASCII until keyboards do. And I don't expect keyboards to expand beyond ASCII just for programmers. I expect programmers to keep using the same keyboards that the general computing population uses. In the US, that means ASCII keyboards. In other countries, we will continue to see ASCII-with accented characters, Cyrillic, and special keyboards for oriental languages. I see no reason for a UNICODE-based keyboard.

If our language shapes our thoughts, then our keyboard shapes our languages.

Tuesday, March 6, 2018

My Technology is Old

I will admit it... I'm a sucker for old hardware.

For most of my work, I have a ten-year-old generic tower PC, a non-touch (and non-glare) 22-inch display, and a genuine IBM Model M keyboard.

The keyboard (a Model M13, to be precise) is the olds-tyle "clicky" keyboard with a built-in TrackPoint nub that emulates a mouse. It is, by far, the most comfortable keyboard I have used. It's also durable -- at least thirty years old and still going, even after lots of pounding. I love the shape of the keys, the long key travel (almost 4 mm), and the loud clicky sound on each keypress. (Officemates are not so fond of the last.)

For other work, I use a relatively recent HP laptop. It also has a non-glare screen. The keyboard is better than most laptop keyboards these days, with some travel and a fairly standard layout.

I prefer non-glare displays to the high-gloss touch displays. The high-gloss displays are quite good as mirrors, and reflect everything, especially lamps and windows. The reflections are distracting; non-glare displays prevent such disturbances.

I use an old HP 5200C flatbed scanner. Windows no longer recognizes it as a device. Fortunately, Linux does recognize it and lets me scan documents without problems.

A third workstation is an Apple Powerbook G4. The PowerBook is the predecessor to the MacBook. It has a PowerPC processor, perhaps 1GB of RAM (I haven't checked in a while), and a 40 GB disk. As a laptop, it is quite heavy, weighing more than 5 pounds. Some of the weight is in the battery, but a lot is in the case (aluminum), the display, and the circuits and components. The battery still works, and provides several hours of power. It holds up better than my old MacBook, which has a battery that lasts for less than two hours. The PowerBook also has a nicer keyboard, with individually shaped keys as opposed to the MacBooks flat keycaps.

Why do I use such old hardware? The answer is easy: the old hardware works, and in some ways is better than new hardware.

I prefer the sculpted keys of the IBM Model M keyboard and the PowerBook G4 keyboard. Modern systems have flat, non-sculpted keys. They look nice, but I buy keyboards for my fingers, not my eyes.

I prefer the non-glare screens. Modern systems provide touchscreens. I don't need to touch my displays; my work is with older, non-touch interfaces. A touchscreen is unnecessary, and it brings the distracting high-glare finish with it. I buy displays for my eyes, not my fingers.

Which is not to say that my old hardware is without problems. The PowerBook is so old that modern Linux distros can run only in text mode. This is not a problem, as I have several projects which live in the text world. (But at some point soon, Linux distros will drop support for the PowerPC architecture, and then I will be stuck.)

Could I replace all of this old hardware with shiny new hardware? Of course. Would the new hardware run more reliably? Probably (although the old hardware is fairly reliable.) But those are minor points. The main question is: Would the new hardware help me be more productive?

After careful consideration, I have to admit that, for me and my work, new hardware would *not* improve my productivity. It would not make me type faster, or write better software, or think more clearly.

So, for me, new hardware can wait. The old stuff is doing the job.