Friday, June 21, 2019

The complexity of programming languages

A recent project saw me examining and tokenizing code for different programming languages. The languages ranged from old languages (COBOL and FORTRAN, among others) to modern languages (Python and Go, among others). It was an interesting project, and I learned quite a bit about many different languages. (By 'tokenize', I mean to identify the type of each item in a program: Variables, identifiers, function names, operators, etc. I was not parsing the code, or building an abstract syntax tree, or compiling the code into op-codes. Tokenizing is the first step of compiling, but a far cry from actually compiling the code.)

One surprising result: newer languages are easier to tokenize than older languages. Python is easier to tokenize than COBOL, and Go is easier to tokenize than FORTRAN.

This is counterintuitive. One would think that older languages would be primitive (and therefore easy to tokenize) and modern languages sophisticated (and therefore difficult to tokenize). Yet my experience shows the opposite.

Why would this be? I can think of two -- no, three -- reasons.

First, the old languages (COBOL, FORTRAN, and PL/I) were designed in the age of punch cards, and punch cards impose limits on source code. COBOL, FORTRAN, and PL/I have few things in common, but one thing that they do have in common is line layout and the 'identification' field in columns 72 through 80.

When your program is stored on punch cards, a risk is that someone will drop the deck of cards and the cards will become out of order. Such a thing cannot happen with programs stored in disk files, but with punch cards such an event is a real risk. To recover from that event, the right-most columns were reserved for identification: a code, unique to each line, that would let a card sorter machine (there were such things) put the cards back into their proper order.

The need for an identification column is tied to the punch card medium, yet it became part of each language standard. COBOL, FORTRAN, and PL/I standards all refer to the columns 72 through 80 as reserved for identification, and they could not be used for "real" source code. Programs transferred from punch cards to disk files (when disks became available to programmers) kept the rule for the identification field -- probably to make conversion easy.  Later versions of languages did drop the rule, but the damage had been done. The identification field was part of the language specification.

As part of the language specification, I had to tokenize the identification numbers. Mostly they were not a problem -- just another "thing" to tokenize -- but sometimes they occurred in the middle of a string literal or a comment, which are awkward situations.

Anyway, the tokenization of old languages has its challenges.

New languages don't suffer from such problems. Their source code was never stored on punch cards, and they never had identification fields. (Either within string literals or not.)

But the tokenization of modern languages is easier. Each language has a set of token types, but older languages have a larger set, and a more varied set. Most languages have identifiers, numeric literals, and operators; COBOL also has picture values and level indicators, and PL/I has attributes and conditions (among other token types).

Which brings me to the second reason for modern languages to have simpler tokenizing requirements: The languages are designed to be easy to tokenize.

It seems to me that, intentionally or not, the designers of modern languages have made design choices that reduce the work for tokenizers. They have built languages that are easy to tokenize, and therefore have simple logic for tokenizers. (All compilers and interpreters have tokenizers; it is a step in converting the source to executable bytes.)

So maybe the simplicity of language tokenization is the result of the "laziness" of language designers.

But I have a third reason, one that I believe is the true reason for the simplicity of modern language tokenizers.

Modern languages are easy to tokenize because they are easy to read (by humans).

A language that is easy to read (for a human) is also easy to tokenize. Language designers have been consciously designing languages to be easy to read. (Python is the leading example, but all designers claim their language is "easy to read".)

Languages that are easy to read are easy to tokenize. It's that simple. We've been designing languages are humans, and as a side effect we have made them easy for computers.

I, for one, welcome the change. Not only does it make my job easier (tokenizing all of those languages) but it makes every developer's job easier (reading code from other developers and writing new code).

So I say three cheers for simple* programming languages!

* Simple does not imply weak. A simple programming language may be easy to understand, yet it may also be powerful. The combination of the two is the real benefit here.