Tuesday, May 8, 2018

Refactor when you need it

The development cycle for Agile and TDD is simple:
  • Define a new requirement
  • Write a test for that requirement
  • Run the test (and see that it fails)
  • Change the code to make the test pass
  • Run the test (and see that it passes)
  • Refactor the code to make it clean
  • Run the test again (and see that it still passes)
Notice that refactor step near the end? That is what keeps your code clean. It allows you to write a messy solution quickly.

A working solution gives you a good understanding of the requirement, and its affect on the code. With that understanding, you can then improve the code, making it clear for other programmers. The test keeps your revised solutions correct -- if a cleanup change breaks a test, you have to fix the code.

But refactoring is not limited to after a change. You can refactor before a change.

Why would you do that? Why would you refactor before making any changes? After all, if your code is clean, it doesn't need to be refactored. It is already understandable and maintainable. So why refactor in advance?

It turns out that code is not always perfectly clean. Sometimes we stop refactoring early. Sometimes we think our refactoring is complete when it is not. Sometimes we have duplicate code, or poorly named functions, or overweight classes. And sometimes we are enlightened by a new requirement.

A new requirement can force us to look at the code from a different angle. We can see new patterns, or see opportunities for improvement that we failed to see earlier.

When that happens, we see new ways of organizing the code. Often, the new organization allows for an easy change to meet the requirement. We might refactor classes to hold data in a different arrangement (perhaps a dictionary instead of a list) or break large-ish blocks into smaller blocks.

In this situation, it is better to refactor the code before adding the new requirement. Instead of adding the new feature and refactoring, perform the operations in reverse sequence: refactor and then add the requirement. (Of course, you still test and you can still refactor at the end.) The full sequence is:
  • Define a new requirement
  • Write a test for that requirement
  • Run the test (and see that it fails)
  • Examine the code and identify improvements
  • Refactor the code (without adding the new requirement)
  • Run tests to verify that the code still works (skip the new test)
  • Change the code to make the test pass
  • Run the test (and see that it passes)
  • Refactor the code to make it clean
  • Run the test again (and see that it still passes)
I've added the new steps in bold.

Agile has taught us is to change our processes when the changes are beneficial. Changing the Agile process is part of that. You can refactor before making changes. You should refactor before making changes, when the refactoring will help you.

Saturday, April 21, 2018

Why the ACM is stuck in academe

The Association of Computing Machinery (ACM) is a professional organization. Its main web page claims it is "Advancing Computing as a Science & Profession". In 2000, it recognized that it was focussed exclusively on the academic world, and it also recognized that it had to expand. It has struggled with that expansion for the past two decades.

I recently found an example of its failure.

The flagship publication, "Communications of the ACM", is available on paper or on-line. (So far, so good.) It is available to all comers, with only some articles locked behind a paywall. (Also good.)

But the presentation is bland, almost stifling.

The Communications web site follows a standard, "C-clamp" layout with content and in the center and links and administrative items wrapped around it on the top, left, and bottom. An issue's table of contents has titles (links) with descriptions to the individual articles of the magazine. This is a reasonable arrangement.

Individual articles are presented with header and footer, but without the left-side links. They are not using the C-clamp layout. (Also good.)

The fonts and colors are appealing, and they conform to accessibility standards.

But the problem that shows how ACM fails to "get it" is with the comments. Their articles still have comments (which is good) but very few people comment. So few that many articles have no comments. How does ACM present an article with no comments? How do they convey this to the reader? With a single, mechanical phrase under the article text:

No entries found

That's it. Simply the text "no entries found". It doesn't even have a header describing the section as a comments section. (There is a horizontal rule between the article and this phrase, so the reader has some inkling that "no entries found" is somewhat distinctive from the article. But nothing indicating that the phrase refers to comments.)

Immediately under the title at the top of the page there is a link to comments (labelled "Comments") which is a simple intrapage link to the empty, unlabelled comments section.

I find phrase "no entries found" somewhat embarrassing. In the year 2018, we have the technology to provide text such as "no comments found" or "no comments" or perhaps "be the first to comment on this article". Yet the ACM, the self-proclaimed organization that "delivers resources that advance computing as a science and a profession" cannot bring itself to use any of those phrases. Instead, it allows the underlying CMS driving its web site to bleed out to the user.

A darker thought is that the ACM cares little for comments. It knows that it has to have them, to satisfy some need for "user engagement", but it doesn't really want them. That philosophy is consistent with the academic mindset of "publish and cite", in which citations to earlier publications are valued, but comments from random readers are not.

Yet the rest of the world (that is, people outside of academe) care little for citations and references. They care about opinions and information (ad profits). Comments are an ongoing problem for web sites; few are informative and many are insulting, and many web sites have abandoned comments.

ACM hasn't disabled its comments, but it hasn't encouraged them either. It sits in the middle.

This is why the ACM struggles with its outreach to the non-academic world.

Thursday, April 19, 2018

Why no language to replace SQL?

The history of programming is littered with programming languages. Some endure for ages (COBOL, C, Java) and some live briefly (Visual J++). We often develop new languages to replace existing ones (Perl, Python).

Yet one language has endured and has seen no replacements: SQL.

SQL, invented in the 1970s and popularized in the 1980s, has lived a good life with no apparent challengers.

It is an anomaly. Every language I can think of has a "challenger" language. FORTRAN was challenged by BASIC. BASIC was challenged by Pascal. C++ was challenged by Java; Java was challenged by C. Unix shell programming was challenged by AWK, which in turn was challenged by Perl, which in turn has been challenged by Python.

Yet there have been no (serious) challengers to SQL. Why not?

I can think of several reasons:
  • Everyone loves SQL and no one wants to change it.
  • Programmers think of SQL as a protocol (specialized for databases) and not a programming language. Therefore, they don't invent a new language to replace it.
  • Programmers want to work on other things.
  • The task is bigger than a programming language. Replacing SQL means designing the language, creating an interpreter (or compiler?), command-line tools (these are programmers, after all), bindings to other languages (Python, Ruby, and Perl at minimum), and data access routines. With all features of SQL, including triggers, access controls, transactions, and audit logs.
  • SQL gets a lot of things right, and works.
I'm betting on the last. SQL, for all of its warts, is effective, efficient, and correct.

But perhaps there is a challenger to SQL: NoSQL.

In one sense, NoSQL is a replacement for SQL. But it is a replacement of more than the language -- it is a replacement of the notion of data structure. NoSQL "databases" store documents and photographs and other things, but they are rarely used to process transactions. NoSQL databases don't replace SQL databases, they complement them. (Some companies move existing data from SQL databases to NoSQL databases, but this is data that fits poorly in the relational structure. They move some of their data but not all of their data out of the SQL database. These companies are fixing a problem, not replacing the SQL language.)

NoSQL is a complement of SQL, not a replacement (and therefore not a true challenger). SQL handles part of our data storage and NoSQL handles a different part.

It seems that SQL will be with us for some time. It is tied to the notion of relational organization, which is a useful mechanism for storing and processing homogeneous data.

Wednesday, April 11, 2018

Big enough is big enough

Visual Studio has a macro capability, but you might never have used it. You might not even know that it exists.

You see, you cannot use it as Visual Studio comes "out of the box". The feature is disabled. You have to take action before you can use it.

First, there is a setting inside of Visual Studio to enable macros.

Second, there is a setting inside of Windows to allow Visual Studio macros. Only system administrators can enable it.

Yes, you read that right. There are two settings to enable macros in Visual Studio, and both must be enabled to run macros.

Why? I'm not sure, but my guess is that the Visual Studio setting was there all along, allowing macros if users wanted them. The second setting (inside Windows) was added later, as a security feature.

The second setting was needed because the macro language inside of Visual Studio is powerful. It can call Windows API functions, instantiate COM objects, and talk to .NET classes. All of this in addition to the expected "insert some text" and "move the insertion point" we expect of a text editor macro.

Visual Studio's macro language is the equivalent of an industrial-strength cleaning solvent: So powerful that it can be used only with great care. And one is always at risk of a malevolent macro, sent from a co-worker or stranger.

But macros don't have to be this way.

The Notepad++ program (the editor for Windows) is a text editor -- not an IDE -- and it has macro capabilities. Its macro capability is much simpler than that of Visual Studio: it records keystrokes and plays them back. It can do anything you, the user, can do in the program, and no more.

Which means, of course, that NotePad++'s macro capabilities are safe. They can do only the "normal" operations of a text editor.

And it also means that macros in Notepad++ are safe. It's not possible to create a malevolent macro -- or send or receive one. (I guess the most malicious macro could be a "select all, delete, save-file" macro. It would be a nuisance but little else.)

The lesson? Macros that are "powerful enough" are, well, powerful enough. Macros that are "powerful enough to do anything" are, um, powerful enough to do anything, including things that are dangerous.

Notepad++ has macros that are powerful enough to do meaningful work. Visual Studio has macros that can do all sorts of things, much more that Notepad++, and apparently so powerful that they must be locked away from the "normal" user.

So Notepad++, with its relatively small macro capabilities is usable, and Visual Studio, with its impressive and all-powerful capabilities (okay, that's a bit strong, but you get the idea) is *not* usable. Visual Studio's macros are too powerful for the average user, so you can't use them.

Something to think about when designing your next product.

Wednesday, April 4, 2018

Apple to drop Intel chips (or not)

The romance between Apple and Intel has come to an end.

In 2005, Apple announced that it was switching to Intel processors for its desktop and laptop computers. Previously it had used PowerPC chips, and the laptops were called "PowerBooks". The first Intel-based laptops were called "MacBooks".

Now, Apple has announced plans to design its own processors. I'm certain that the folks over at Intel are less than happy.

Looking forward, I think a number of people will be unhappy with this change, from open source advocates to developers to even Apple itself.

Open source advocates may find that the new Apple-processor MacBooks are unable to run operating systems other than Apple's, which means that Linux will be locked out of the (new) Apple hardware. While only a miniscule number of people actually replace macOS with Linux (disclosure: I'm one) those who do may be rather vocal about the change.

Apple MacBooks are popular with developers. (Exactly why this is the case, I am not sure. I dislike the MacBook's keyboard and display, and prefer other equipment for my work. But maybe I have preferences different from most developers.)

Getting back to developers: They like Apple MacBooks. Look inside any start-up or small company, and MacBooks dominate the office space. I'm sure that part of this popularity is from Apple's use of NetBSD (a Unix derivative) as the base for macOS, which lets MacBook users run most Linux software.

When Apple switches from Intel to its own (probably proprietary) processor, will those utilities be available?

The third group affected by this change will be Apple itself. They may find that the development of processors is harder than they expect, with delays and trade-offs necessary for performance, power efficiency, security, and interfaces to other system components. Right now, Apple outsources those headaches to Intel. Apple may not like the decisions that Intel makes (after all, Intel serves other customers and must accommodate their needs as well as Intel's) and it may feel that control over the design will reduce those headaches.

In-sourcing the design of processors may reduce headaches... or it may simply move them. If Apple has been dissatisfied with Intel's delivery schedule for new chips, the new arrangement may simple mean that Apple management will be dissatisfied with their internal division's delivery schedule for new chips. Owning the design process may give Apple more control over the process but not total control over it.

The move from standard, well-known processors to proprietary and possibly not well-understood processors moves Apple away from the general market and into their own space. Apple desktops and laptops may become proprietary and secret, with Apple processors and Apple systems-on-a-chip and Apple operating systems and Apple drivers and Apple software, ... and only Apple able to upgrade, repair, or modify them.

That's a bit of a longshot, and I don't know that it will happen. Apple management may find the idea appealing, hoping for increased revenue. But it is a move towards isolationism, away from the "free trade" market that has made PCs popular and powerful. It's also a move to the market before the IBM PC, when small computers were not commodities but very different from each other. I'm not sure that it will help Apple in the long run.


Tuesday, March 27, 2018

A mainframe technique for improving web app performance

I've been working on a modern web application, making changes to improve its performance. For one improvement, I used an old technique from the days of mainframe processing.

I call it the "master file update" algorithm, although others may use different names.

It was commonly used when mainframe computers read data from tapes (or even punch cards). The program was to update a master file (say, account numbers and balances) with transactions (each with account numbers and transaction amount). The master file could contain bank accounts, or insurance policies, or something similar.

Why did mainframes use this technique? In short, they had to. The master file was stored on a magnetic tape, and transactions were on another tape (or perhaps punch cards). Both files were sequential-access files, and the algorithm read each file's records in sequence, writing a new master file along the way. Sequential access is the only way to read a file on magtape.

How did I use this technique? Well, I wasn't reading magnetic tapes, but I was working with two sets of data, one a "master" set of values and the other a set of "transactions". The original (slow) algorithm stored both sets of data in dictionary structures and required access by key values. Each access required a lookup into the dictionary. While the access was handled by the data structure's class, it still required work to find the correct value for each update.

The revised algorithm followed the mainframe technique: store the values in lists, not dictionaries, and make a single pass through the two sets of data. Starting with the first master value and the first transaction value, walk forward through the master values until the keys match. When the keys match, update the value and advance to the next transaction value. Repeat until you reach the end of both lists.

Dictionaries are good for "random access" to data. When you have to read (or update) a handful of values, and your updates are in no particular order, a dictionary structure is useful. There is a cost, as the dictionary as to find the item you want to read or update. Our situation was such that the cost of the dictionary was affecting our performance.

Lists are simpler structures than dictionaries. They don't have to search for the data; you move to the item in the list and read its value. The result is that read and write operations are faster. The change for a single operation was small; multiplied by the number of operations the change was a modest, yet noticeable, improvement.

Fortunately for me, the data was easily converted to lists, and it was in the right sequence. The revised code was faster, which was the goal. And the code was still readable, so there was no downside to the change.

The lesson for the day? Algorithms are usually not dependent on technology. The "master file update" algorithm is not a "mainframe technique", suitable only for mainframe applications. It works quite well in web apps. Don't assume that you must use only "web app" algorithms for web apps, and only "cloud app" algorithms for cloud apps. Such assumptions can blind you to good solutions. Keep you options open, and use the best algorithm for the job.

Tuesday, March 13, 2018

Programming languages and character sets

Programming languages are similar, but not identical. Even "common" things such as expressions can be represented differently in different languages.

FORTRAN used the sequence ".LT." for "less than", normally (today) indicated by the sign <, and ".GT." for "greater than", normally the sign >. Why? Because in the early days of computing, programs were written on punch cards, and punch cards used a small set of characters (uppercase alpha, numeric, and a few punctuation). The signs for "greater than" and "less than" were not part of that character set, so the language designers had to make do with what was available.

BASIC used the parentheses to denote both function arguments and variable subscripts. Nowadays, most languages use square brackets for subscripts. Why did BASIC use parentheses? Because most BASIC programs were written on Teletype machines, large mechanical printing terminals with a limited set of characters. And -- you guessed it -- the square bracket characters were not part of that set.

When C was invented, we were moving from Teletypes to paperless terminals. These new terminals supported the entire ASCII character set, including lowercase letters and all of the punctuation available on today's US keyboards. Thus, C used all of the symbols available, including lowercase letters and just about every punctuation symbol.

Today we use modern equipment to write programs. Just about all of our equipment supports UNICODE. The programming languages we create today use... the ASCII character set.

Oh, programming languages allow string literals and identifiers with non-ASCII characters, but none of our languages require the use of a non-ASCII character. No languages make you declare a lambda function with the character λ, for example.

Why? I would think that programmers would like to use the characters in the larger UNICODE set. The larger character set allows for:
  • Greek letters for variable names
  • Multiplication (×) and division (÷) symbols
  • Distinct characters to denote templates and generics
C++ chose to denote templates with the less-than and greater-than symbols. The decision was somewhat forced, as C++ lives in the ASCII world. Java and C# have followed that convention, although its not clear that they had to. Yet the decision is has its costs; tokenizing source code is much harder with symbols that hold multiple meanings. Java and C# could have used the double-angle brackets (« and ») to denote generics.

I'm not recommending that we use the entire UNICODE set. Several glyphs (such as 'a') have different code points assigned (such as the Latin 'a' and the Cyrllic 'a') and having multiple code points that appear the same is, in my view, asking for trouble. Identifiers and names which appear (to the human eye) to be the same would be considered different by the compiler.

But I am curious as to why we have settled on ASCII as the character set for languages.

Maybe its not the character set. Maybe it is the equipment. Maybe programmers (and more specifically, program language designers) use US keyboards. When looking for characters to represent some idea, our eyes fall upon our keyboards, which present the ASCII set of characters. Maybe it is just easier to use ASCII characters -- and then allow UNICODE later.

If that's true (that our keyboard guides our language design) then I don't expect languages to expand beyond ASCII until keyboards do. And I don't expect keyboards to expand beyond ASCII just for programmers. I expect programmers to keep using the same keyboards that the general computing population uses. In the US, that means ASCII keyboards. In other countries, we will continue to see ASCII-with accented characters, Cyrillic, and special keyboards for oriental languages. I see no reason for a UNICODE-based keyboard.

If our language shapes our thoughts, then our keyboard shapes our languages.