Tuesday, March 13, 2018

Programming languages and character sets

Programming languages are similar, but not identical. Even "common" things such as expressions can be represented differently in different languages.

FORTRAN used the sequence ".LT." for "less than", normally (today) indicated by the sign <, and ".GT." for "greater than", normally the sign >. Why? Because in the early days of computing, programs were written on punch cards, and punch cards used a small set of characters (uppercase alpha, numeric, and a few punctuation). The signs for "greater than" and "less than" were not part of that character set, so the language designers had to make do with what was available.

BASIC used the parentheses to denote both function arguments and variable subscripts. Nowadays, most languages use square brackets for subscripts. Why did BASIC use parentheses? Because most BASIC programs were written on Teletype machines, large mechanical printing terminals with a limited set of characters. And -- you guessed it -- the square bracket characters were not part of that set.

When C was invented, we were moving from Teletypes to paperless terminals. These new terminals supported the entire ASCII character set, including lowercase letters and all of the punctuation available on today's US keyboards. Thus, C used all of the symbols available, including lowercase letters and just about every punctuation symbol.

Today we use modern equipment to write programs. Just about all of our equipment supports UNICODE. The programming languages we create today use... the ASCII character set.

Oh, programming languages allow string literals and identifiers with non-ASCII characters, but none of our languages require the use of a non-ASCII character. No languages make you declare a lambda function with the character λ, for example.

Why? I would think that programmers would like to use the characters in the larger UNICODE set. The larger character set allows for:
  • Greek letters for variable names
  • Multiplication (×) and division (÷) symbols
  • Distinct characters to denote templates and generics
C++ chose to denote templates with the less-than and greater-than symbols. The decision was somewhat forced, as C++ lives in the ASCII world. Java and C# have followed that convention, although its not clear that they had to. Yet the decision is has its costs; tokenizing source code is much harder with symbols that hold multiple meanings. Java and C# could have used the double-angle brackets (« and ») to denote generics.

I'm not recommending that we use the entire UNICODE set. Several glyphs (such as 'a') have different code points assigned (such as the Latin 'a' and the Cyrllic 'a') and having multiple code points that appear the same is, in my view, asking for trouble. Identifiers and names which appear (to the human eye) to be the same would be considered different by the compiler.

But I am curious as to why we have settled on ASCII as the character set for languages.

Maybe its not the character set. Maybe it is the equipment. Maybe programmers (and more specifically, program language designers) use US keyboards. When looking for characters to represent some idea, our eyes fall upon our keyboards, which present the ASCII set of characters. Maybe it is just easier to use ASCII characters -- and then allow UNICODE later.

If that's true (that our keyboard guides our language design) then I don't expect languages to expand beyond ASCII until keyboards do. And I don't expect keyboards to expand beyond ASCII just for programmers. I expect programmers to keep using the same keyboards that the general computing population uses. In the US, that means ASCII keyboards. In other countries, we will continue to see ASCII-with accented characters, Cyrillic, and special keyboards for oriental languages. I see no reason for a UNICODE-based keyboard.

If our language shapes our thoughts, then our keyboard shapes our languages.

Tuesday, March 6, 2018

My Technology is Old

I will admit it... I'm a sucker for old hardware.

For most of my work, I have a ten-year-old generic tower PC, a non-touch (and non-glare) 22-inch display, and a genuine IBM Model M keyboard.

The keyboard (a Model M13, to be precise) is the olds-tyle "clicky" keyboard with a built-in TrackPoint nub that emulates a mouse. It is, by far, the most comfortable keyboard I have used. It's also durable -- at least thirty years old and still going, even after lots of pounding. I love the shape of the keys, the long key travel (almost 4 mm), and the loud clicky sound on each keypress. (Officemates are not so fond of the last.)

For other work, I use a relatively recent HP laptop. It also has a non-glare screen. The keyboard is better than most laptop keyboards these days, with some travel and a fairly standard layout.

I prefer non-glare displays to the high-gloss touch displays. The high-gloss displays are quite good as mirrors, and reflect everything, especially lamps and windows. The reflections are distracting; non-glare displays prevent such disturbances.

I use an old HP 5200C flatbed scanner. Windows no longer recognizes it as a device. Fortunately, Linux does recognize it and lets me scan documents without problems.

A third workstation is an Apple Powerbook G4. The PowerBook is the predecessor to the MacBook. It has a PowerPC processor, perhaps 1GB of RAM (I haven't checked in a while), and a 40 GB disk. As a laptop, it is quite heavy, weighing more than 5 pounds. Some of the weight is in the battery, but a lot is in the case (aluminum), the display, and the circuits and components. The battery still works, and provides several hours of power. It holds up better than my old MacBook, which has a battery that lasts for less than two hours. The PowerBook also has a nicer keyboard, with individually shaped keys as opposed to the MacBooks flat keycaps.

Why do I use such old hardware? The answer is easy: the old hardware works, and in some ways is better than new hardware.

I prefer the sculpted keys of the IBM Model M keyboard and the PowerBook G4 keyboard. Modern systems have flat, non-sculpted keys. They look nice, but I buy keyboards for my fingers, not my eyes.

I prefer the non-glare screens. Modern systems provide touchscreens. I don't need to touch my displays; my work is with older, non-touch interfaces. A touchscreen is unnecessary, and it brings the distracting high-glare finish with it. I buy displays for my eyes, not my fingers.

Which is not to say that my old hardware is without problems. The PowerBook is so old that modern Linux distros can run only in text mode. This is not a problem, as I have several projects which live in the text world. (But at some point soon, Linux distros will drop support for the PowerPC architecture, and then I will be stuck.)

Could I replace all of this old hardware with shiny new hardware? Of course. Would the new hardware run more reliably? Probably (although the old hardware is fairly reliable.) But those are minor points. The main question is: Would the new hardware help me be more productive?

After careful consideration, I have to admit that, for me and my work, new hardware would *not* improve my productivity. It would not make me type faster, or write better software, or think more clearly.

So, for me, new hardware can wait. The old stuff is doing the job.

Wednesday, February 28, 2018

Backwards compatibility but not for Python

Programming languages change over time. (So do the natural, human-spoken languages. But let's stick to the programming languages.)

Most languages are designed and changes are carefully constructed to avoid breaking older programs. This is a tradition from the earliest days of programming. New versions of FORTRAN and COBOL were introduced with new features, yet the newer compilers accepted the older programs. (Probably because the customers of the expensive computers would be very mad to learn that an "upgrade" had broken their existing programs.)

Since then, almost every language has followed this tradition. BASIC, Pascal, dBase (and Clipper and XBase), Java, Perl, ... they all strove for (and still strive for) backwards compatibility.

The record is not perfect. A few exceptions do come to mind:
  • In the 1990s, multiple releases of Visual Basic broke compatibility with older versions as Microsoft decided to improve the syntax.
  • Early versions of Perl changed syntax. Those changes were Larry Wall deciding on improvements to the syntax.
  • The C language changed syntax for the addition-assignment and related operators (from =+ to +=) which resolved an ambiguity in the syntax.
  • C++ broke compatibility with a scoping change in "for" statements. That was its only such change, to my knowledge.
These exceptions are few. The vast history of programming languages shows compatibility from old to new versions.

But there is one language that is an exception.

That language is Python.

Python has seen a number of changes over time. I should say "Pythons", as there are two paths for Python development: Python 2 and Python 3. Each path has multiple versions (Python 2.4, 2.5, 2.6, and Python 3.4, 3.5, 3.6, etc.).

The Python 3 path was started as the "next generation" of Python interpreters, and it was started with the explicit statement that it would not be compatible with the Python 2 path.

Not only are the two paths different (and incompatible), versions within each path (or at least the Python 3 path) are sometimes incompatible. That is, some things in Python 3.6 are different in Python 3.7.

I should point out that the changes between versions (Python 3.6 and 3.7, or even Python 2 and 3) are small. Most of the language remains the same across versions. If you know Python 2, you will find Python 3 familiar. (The familiarity may cause frustration as you stumble across one of the compatibility-breaking changes, though.)

Should we care? What does it mean for Python? What does it mean for programming in general?

One could argue that changes to a programming language are necessary. The underlying technology changes, and programming languages must "keep up". Thus, changes will happen, either in many small changes or one big change. The latter often is a shift away from one programming language to another. (One could cite the transition from FORTRAN to BASIC as computing changed from batch to interactive, for example.)

But that argument doesn't hold up against other evidence. COBOL, for example, has been popular for transaction processing and remains so. C and C++ have been popular for operating systems, device drivers, and other low-level applications, and remain so. Their backwards-compatible growth has not appreciably diminished their roles in development.

Other languages have gained popularity and remain popular too. Java and C# have strong followings. They, too, have not been hurt by backwards-compatibility.

Python is an opportunity to observe the behavior of the market. We have been working on the assumption that backwards-compatibility is desired by the user base. This assumption may be a false one, and the Python approach may be a good start to observe the true desires of the market. If successful (and Python is successful, so far) then we may see other languages adopt the "break a few things" philosophy for changes.

Of course, there may be some demand for languages that keep compatibility across versions. It may be a subset of the market, something that isn't visible with only one language breaking compatibility, but only visible when more languages change their approach. If that is the case, we may see some languages advertising their backwards-compatibility as a feature.

Who knows? It may be that the market demand for backwards-compatibility may come from Python users. As Python gains popularity (and it is gaining popularity), more and more individuals and organizations build Python projects, they may find Python's approach unappealing.

Let's see what happens!

Thursday, February 22, 2018

Variables are... variable

The nice (and sometimes frustrating) thing about different programming languages is that they handle things, well, differently.

Consider the simple concept of a "variable". It is a thing in a program that holds a value. One might think that programming languages agree on something so simple -- yet they don't.

There are four actions associated with variables: declaration, initialization, assignment, and reference (as in 'use', not a constrained of pointer).

A declaration tells the compiler or interpreter that a variable exists often specifies a type. Some languages require a declaration before a variable can be assigned a value or used in a calculation; others do not.

Initialization provides a value during declaration. This is a special form of assignment.

Assignment assign a value, and is not part of declaration. It occurs after the declaration, and may occur multiple times. (Some languages do not allow for assignment after initialization.)

A reference of a variable is the use the value, to compute some other value or provide the value to a function or subroutine.

It turns out that different languages have different ideas about these operations. Most languages follow these definitions; the differences are in the presence or absence of these actions.

C, C++, and COBOL (to pick a few languages) all require declarations, allow for initialization, and allow for assignment and referencing.

In C and C++ we can write:

int i = 17;
i = 12;
printf("%d\n", i);

This code declares and initializes the variable i as an int with value 17, then assigns the value 12, then calls the printf() function to write the value to the console. COBOL has similar abilities, although the syntax is different.

Perl, Python, and Ruby (to pick different languages) do not have declarations and initialization but do allow for assignment and reference.

In Ruby we can write:

i = 12
puts i

Which assigns the value 12 to i and then writes it to the console. Notice that there is no declaration and no type specified for the variable.

Astute readers will point out that Python and Ruby don't have "variables", they have "names". A name is a reference to an underlying object, and multiple names can point to the same object. Java and C# use a similar mechanism for non-trivial objects. The difference is not important for this post.

BASIC (not Visual Basic or VB.NET, but old-school BASIC) is a bit different. Like Perl, Python, and Ruby it does not have declarations. Unlike those languages, it lets you write a statement that prints the value of an undeclared (and therefore uninitialized and unassigned) variable:


This is a concept that would cause a C compiler to emit errors and refuse to supply an executable. In the scripting languages, this would cause a run-time error. BASIC handles this with grace, providing a default value of 0 for numeric variables and "" for text (string) variables. (The AWK language also assigns a reasonable value to uninitialized variables.)

FORTRAN has an interesting mix of capabilities. It allows for declarations but does not require them. Variables have a specific type, either integer or real. When a variable is listed in a declaration, it has the specified type; when a variable is not declared it has a type based on the first letter of its name!

Like BASIC, variables in FORTRAN can be referenced without being initialized. Unlike BASIC, it does not provide default values. Instead it blissfully uses whatever values are in memory at the location assigned for the variable. (COBOL, C, and C++ have this behavior too.)

What's interesting is the trend over time. Let's look at a summary of languages and their capabilities, and the year in which they were created:

Languages which require declaration but don't force initialization

COBOL (1950s)
Pascal (1970s)
C (1970s)
C++ (1980s)
Java (1995)
C# (2000s)
Objective-C (1990s)

Languages which require declaration and require initialization (or initialize for you)

EIFFEL (1980s)
Go (2010)
Swift (2010)
Rust (2015)

Languages which don't allow declarations and require assignment before reference

Perl (1987)
Python (1989)
Ruby (1990s)

Languages which don't require (or don't allow) declaration and allow reference before assignment

FORTRAN (1950s)
BASIC (1960s)
AWK (1970s)
PowerShell (2000s)

This list of languages is hardly comprehensive, and it ignores the functional programming languages completely. Yet it shows something interesting: there is no trend for variables. That is, languages in the 1950s required declarations (COBOL) or didn't (FORTRAN), and later languages require declaration (Go) or don't (Ruby). Early languages allow for initialization, as do later languages. Early languages allow for use-without-assignment, as do later languages.

Perhaps a more comprehensive list may show trends over time. Perhaps splitting out the different versions of languages will show convergence of variables. Or perhaps not.

It is possible that we (that is, programmers and language designers) don't really know how we want variables to behave in our languages. With more than half a century of experience we're still developing languages with different capabilities.

Or maybe we have, in some way, decided. Its possible that we have decided that we need languages with different capabilities for variables (and therefore different languages). If that is the case, then we will never see a single language become dominant.

That, I think, is a good outcome.

Tuesday, February 6, 2018

The IRS made me a better programmer

We US taxpayers have opinions of the IRS, the government agency tasked with the collection of taxes. Those opinions tend to be strong and tend to fall on the "not favorable" side. Yet the IRS did me a great favor and helped me become a better programmer.

The assistance I received was not through employment at the IRS, nor did they send me a memo entitled "How to be a better programmer". They did give me some information, not related to programming, yet it turned out to be the most helpful advice on programming in my career.

That advice was the simple philosophy: One operation at a time.

The IRS uses this philosophy when designing the forms for tax returns. There are a lot of forms, and some cover rather complex notions and operations, and all must be understandable by the average taxpayer. I've looked at these forms (and used a number of them over the years) and while I may dislike our tax laws, I must admit that the forms are as easy and understandable as tax law permits. (Tax law can be complex with intricate concepts, and we can consider this complexity to be "essential" -- it will be present in any tax form no matter how well you design it.)

Back to programming. How does the philosophy of "one operation at a time" change the way I write programs?

A lot, as it turns out.

The philosophy of "one operation at a time" is directly applicable to programming. Well, my programming, at least. I had, over the years, developed a style of combining operations onto a single line.

Here is a simplified example of my code, using the "multiple operations" style:

Foo harry = y.elements().iterate().select('harry')

It is concise, putting several activities on a single line. This style makes for shorter programs, but not necessarily more understandable programs. Shorter programs are better when the shortness is measured in operations, not raw lines. Packing a bunch of operations -- especially unrelated operations -- onto a single line is not simplifying a program. If anything, it is making it more complex, as we tend to assume that operations on the same line are somehow connected.

I changed my style. I shifted from multi-operation lines to single operation lines, and I was immediately pleased with the result.

Here's the example from above, but with the philosophy of one operation per line:

elements = y.elements()
Foo harry = nil
elements.each do |element|
  harry = element if element.name == 'harry'

I have found two immediate benefits from this new style.

The first benefit is a better experience when debugging. When stepping through the code with the debugger, I can examine intermediate values. Debuggers are line-oriented, and execute the single-line version all in one go. (While there are ways to force the debugger to execute each function separately, there are no variables to hold the intermediate results.)

The second benefit is that it is easier to identify duplicate code. By splitting operations onto multiple lines, I find it easier to identify duplicate sequences. Sometimes the code is not an exact duplicate, but the structure is the same. Sometimes portions of the code is the same. I can refactor the duplicated code into functions, which simplifies the code (fewer lines) and consolidates common logic in a single place (one point of truth).

Looking back, I can see that my code is somewhat longer, in terms of lines. (Refactoring common logic reduces it somewhat, but not enough to offset the expansion of multiline operations.)

Yet the longer code is easier to read, easier to explain to others, and easier to fix. And since the programs I am writing are much smaller than the computer's capabilities, there is little expense at slightly longer programs. I suspect that compilers (for languages that use them) are optimizing a lot of my "one at a time" operations and condensing them, perhaps better than I can. The executables produced are about the same size as before. Interpreters, too, seem to have little problem with multiple simple statements, and run the "one operation" version of programs just as fast as the "multiple operations" version. (This is my perception; I have not conducted formal time trials of the two versions.)

Simpler code, easier to debug, and easier to explain to others. What's not to like?

Wednesday, January 31, 2018

Optimizing in the wrong direction

Back in the late 200X years, I toyed with the idea of a new version control system. It wasn't git, or even git-like. In fact, it was the opposite.

At the time, version control was centralized. There was a single instance of the repository and you (the developer) had a single "snapshot" of the files. Usually, your snapshot was the "tip", the most recent version of each file.

My system, like other version control systems of the time, was a centralized system, with versions for each file stored as 'diff' packages. That was the traditional approach for version control, as storing a 'diff' was smaller than storing the entire version of the file.

Git changed the approach for version control. Instead of a single central repository, git is a distributed version control system. It replicates the entire repository in every instance and uses a sophisticated protocol to synchronize changes across instances. When you clone a repo in git, you get the entire repository.

Git can do what it does because disk space is now plentiful and cheap. Earlier version control systems worked on the assumption that disk space was expensive and limited. (Which, when SCCS was created in the 1970s, was true.)

Git is also directory-oriented, not file-oriented. Git looks at the entire directory tree, which allows it to optimize operations that move files or duplicate files in different directories. File-oriented version control systems, looking only at the contents of a single file at a time, cannot make those optimizations. That difference, while important, is not relevant to this post.

I called my system "Amnesia". My "brilliant" idea was to, over time, remove diffs from the repository and thereby use even less disk space. Deletion was automatic, and I let the use specify a set of rules for deletion, so important versions could be saved indefinitely.

My improvement was based on the assumption of disk space being expensive. Looking back, I should have known better. Disk space was not expensive, and not only was it not expensive it was not getting expensive -- it was getting cheaper.

Anyone looking at this system today would be, at best, amused. Even I can only grin at my error.

I was optimizing, but for the wrong result. The "Amnesia" approach reduced disk space, at the cost of time (it takes longer to compute diffs than it does to store the entire file), information (the removal of versions also removes information about who made the change), and development cost (for the auto-delete functions).

The lesson? Improve, but think about your assumptions. When you optimize something, do it in the right direction.

Wednesday, January 24, 2018

Cloud computing is repeating history

A note to readers: This post is a bit of a rant, driven by emotion. My 'code stat' project, hosted on Microsoft Azure's web app PaaS platform, has failed and I have yet to find a resolution.

Something has changed in Azure, and I can no longer deploy a new version to the production servers. My code works; I can test it locally. Something in the deployment sequence fails. This is a test project, using the free level of Azure, which means no monthly costs but also means no support -- other than the community help pages.

There are a few glorious advances in IT, advances which stand out above the others. They include the PC revolution (which saw individuals purchasing and using computers), the GUI (which saw people untrained in computer science using computers), and the smartphone (which saw lots more people using computers for lots more sophisticated tasks).

The PC revolution was a big change. Prior to personal computers (whether they were IBM PCs, Apple IIs, or Commodore 64s), computers were large, expensive, and complicated; they were especially difficult to administer. Mainframes and even minicomputers were large and expensive; an individual could afford one if they were an enormously wealthy individual and had lots of time to read manuals and try different configurations to make the thing work.

The consumer PCs changed all of that. They were expensive, but within the range of the middle class. They required little or no administration effort. (The Commodore 64 was especially easy: plug it in, attach to a television, and turn it on.)

Apple made the consumer PC easier to use with the Macintosh. The graphical user interface (lifted from Xerox PARC's Alto, and later copied by Microsoft Windows) made many operations and concepts consistent. Configuration was buried, and sometimes options were reduced to "the way Apple wants you to do it".

It strikes me that cloud computing is in a "mainframe phase". It is large and complex, and while an individual can create a an account (even a free account), the complexity and time necessary to learn and use the platform is significant.

My issue with Microsoft Azure is precisely that. Something has changed and it behaves differently than it did in the past. (It's not my code, the change is in the deployment of my app.) I don't think that I have changed something in Azure's configuration -- although I could have.

The problem is that once you go beyond the 'three easy steps to deploy a web app', Azure is a vast and intimidating beast with lots of settings, each with new terminology. I could poke at various settings, but will that fix the problem or make things worse?

From my view, cloud computing is a large, complex system that requires lots of knowledge and expertise. In other words, it is much like a mainframe. (Except, of course, you don't need a large room dedicated to the equipment.)

The "starter plans" (often free) are not the equivalent of a PC. They are merely the same, enterprise-level plans with certain features turned off.

A PC is different from a mainframe reduced to tabletop size. Both have CPUs and memory and peripheral devices and operating systems, but are two different creatures. PCs have fewer options, fewer settings, fewer things you (the user) can get wrong.

Cloud computing is still at the "mainframe level" of options and settings. It's big and complicated, and it requires a lot of expertise to keep it running.

If we repeat history, we can expect companies to offer smaller, simpler versions of cloud computing. The advantage will be an easier learning curve and less required expertise; the disadvantage will be lower functionality. (Just as minicomputers were easier and less capable than mainframes and PCs were easier and less capable than minicomputers.)

I'll go out on a limb and predict that the companies who offer simpler cloud platforms will not be the current big providers (Amazon.com, Microsoft, Google). Mainframes were challenged by minicomputers from new vendors, not the existing leaders. PCs were initially constructed by hobbyists from kits. Soon after companies such as Radio Shack, Commodore, and the newcomer Apple offered fully-assembled, ready-to-run computers. IBM offered the PC after the success of these upstarts.

The driver for simpler cloud platforms will be cost -- direct and indirect, mostly indirect. The "cloud computing is a mainframe" analogy is not perfect, as the billed costs for cloud platforms can be inexpensive. The expense is not in the hardware, but the time to make the thing work. Current cloud platforms require expertise, and expertise that is not cheap. Companies are willing to pay for that expertise... for now.

I expect that we will see competition to the big cloud platforms, and the marketing will focus on ease of use and low Total Cost of Ownership (TCO). The newcomers will offer simpler clouds, sacrificing performance for reduced administration cost.

My project is currently stuck. Deployments fail, so I cannot update my app. Support is not really available, so I must rely on the limited web pages and perhaps trial and error. I may have to create a new app in Azure and copy my existing code to it. I'm not happy with the experience.

I'm also looking for a simpler cloud platform.