Tuesday, March 13, 2018

Programming languages and character sets

Programming languages are similar, but not identical. Even "common" things such as expressions can be represented differently in different languages.

FORTRAN used the sequence ".LT." for "less than", normally (today) indicated by the sign <, and ".GT." for "greater than", normally the sign >. Why? Because in the early days of computing, programs were written on punch cards, and punch cards used a small set of characters (uppercase alpha, numeric, and a few punctuation). The signs for "greater than" and "less than" were not part of that character set, so the language designers had to make do with what was available.

BASIC used the parentheses to denote both function arguments and variable subscripts. Nowadays, most languages use square brackets for subscripts. Why did BASIC use parentheses? Because most BASIC programs were written on Teletype machines, large mechanical printing terminals with a limited set of characters. And -- you guessed it -- the square bracket characters were not part of that set.

When C was invented, we were moving from Teletypes to paperless terminals. These new terminals supported the entire ASCII character set, including lowercase letters and all of the punctuation available on today's US keyboards. Thus, C used all of the symbols available, including lowercase letters and just about every punctuation symbol.

Today we use modern equipment to write programs. Just about all of our equipment supports UNICODE. The programming languages we create today use... the ASCII character set.

Oh, programming languages allow string literals and identifiers with non-ASCII characters, but none of our languages require the use of a non-ASCII character. No languages make you declare a lambda function with the character λ, for example.

Why? I would think that programmers would like to use the characters in the larger UNICODE set. The larger character set allows for:
  • Greek letters for variable names
  • Multiplication (×) and division (÷) symbols
  • Distinct characters to denote templates and generics
C++ chose to denote templates with the less-than and greater-than symbols. The decision was somewhat forced, as C++ lives in the ASCII world. Java and C# have followed that convention, although its not clear that they had to. Yet the decision is has its costs; tokenizing source code is much harder with symbols that hold multiple meanings. Java and C# could have used the double-angle brackets (« and ») to denote generics.

I'm not recommending that we use the entire UNICODE set. Several glyphs (such as 'a') have different code points assigned (such as the Latin 'a' and the Cyrllic 'a') and having multiple code points that appear the same is, in my view, asking for trouble. Identifiers and names which appear (to the human eye) to be the same would be considered different by the compiler.

But I am curious as to why we have settled on ASCII as the character set for languages.

Maybe its not the character set. Maybe it is the equipment. Maybe programmers (and more specifically, program language designers) use US keyboards. When looking for characters to represent some idea, our eyes fall upon our keyboards, which present the ASCII set of characters. Maybe it is just easier to use ASCII characters -- and then allow UNICODE later.

If that's true (that our keyboard guides our language design) then I don't expect languages to expand beyond ASCII until keyboards do. And I don't expect keyboards to expand beyond ASCII just for programmers. I expect programmers to keep using the same keyboards that the general computing population uses. In the US, that means ASCII keyboards. In other countries, we will continue to see ASCII-with accented characters, Cyrillic, and special keyboards for oriental languages. I see no reason for a UNICODE-based keyboard.

If our language shapes our thoughts, then our keyboard shapes our languages.

No comments: