Wednesday, February 28, 2018

Backwards compatibility but not for Python

Programming languages change over time. (So do the natural, human-spoken languages. But let's stick to the programming languages.)

Most languages are designed and changes are carefully constructed to avoid breaking older programs. This is a tradition from the earliest days of programming. New versions of FORTRAN and COBOL were introduced with new features, yet the newer compilers accepted the older programs. (Probably because the customers of the expensive computers would be very mad to learn that an "upgrade" had broken their existing programs.)

Since then, almost every language has followed this tradition. BASIC, Pascal, dBase (and Clipper and XBase), Java, Perl, ... they all strove for (and still strive for) backwards compatibility.

The record is not perfect. A few exceptions do come to mind:
  • In the 1990s, multiple releases of Visual Basic broke compatibility with older versions as Microsoft decided to improve the syntax.
  • Early versions of Perl changed syntax. Those changes were Larry Wall deciding on improvements to the syntax.
  • The C language changed syntax for the addition-assignment and related operators (from =+ to +=) which resolved an ambiguity in the syntax.
  • C++ broke compatibility with a scoping change in "for" statements. That was its only such change, to my knowledge.
These exceptions are few. The vast history of programming languages shows compatibility from old to new versions.

But there is one language that is an exception.

That language is Python.

Python has seen a number of changes over time. I should say "Pythons", as there are two paths for Python development: Python 2 and Python 3. Each path has multiple versions (Python 2.4, 2.5, 2.6, and Python 3.4, 3.5, 3.6, etc.).

The Python 3 path was started as the "next generation" of Python interpreters, and it was started with the explicit statement that it would not be compatible with the Python 2 path.

Not only are the two paths different (and incompatible), versions within each path (or at least the Python 3 path) are sometimes incompatible. That is, some things in Python 3.6 are different in Python 3.7.

I should point out that the changes between versions (Python 3.6 and 3.7, or even Python 2 and 3) are small. Most of the language remains the same across versions. If you know Python 2, you will find Python 3 familiar. (The familiarity may cause frustration as you stumble across one of the compatibility-breaking changes, though.)

Should we care? What does it mean for Python? What does it mean for programming in general?

One could argue that changes to a programming language are necessary. The underlying technology changes, and programming languages must "keep up". Thus, changes will happen, either in many small changes or one big change. The latter often is a shift away from one programming language to another. (One could cite the transition from FORTRAN to BASIC as computing changed from batch to interactive, for example.)

But that argument doesn't hold up against other evidence. COBOL, for example, has been popular for transaction processing and remains so. C and C++ have been popular for operating systems, device drivers, and other low-level applications, and remain so. Their backwards-compatible growth has not appreciably diminished their roles in development.

Other languages have gained popularity and remain popular too. Java and C# have strong followings. They, too, have not been hurt by backwards-compatibility.

Python is an opportunity to observe the behavior of the market. We have been working on the assumption that backwards-compatibility is desired by the user base. This assumption may be a false one, and the Python approach may be a good start to observe the true desires of the market. If successful (and Python is successful, so far) then we may see other languages adopt the "break a few things" philosophy for changes.

Of course, there may be some demand for languages that keep compatibility across versions. It may be a subset of the market, something that isn't visible with only one language breaking compatibility, but only visible when more languages change their approach. If that is the case, we may see some languages advertising their backwards-compatibility as a feature.

Who knows? It may be that the market demand for backwards-compatibility may come from Python users. As Python gains popularity (and it is gaining popularity), more and more individuals and organizations build Python projects, they may find Python's approach unappealing.

Let's see what happens!

Thursday, February 22, 2018

Variables are... variable

The nice (and sometimes frustrating) thing about different programming languages is that they handle things, well, differently.

Consider the simple concept of a "variable". It is a thing in a program that holds a value. One might think that programming languages agree on something so simple -- yet they don't.

There are four actions associated with variables: declaration, initialization, assignment, and reference (as in 'use', not a constrained of pointer).

A declaration tells the compiler or interpreter that a variable exists often specifies a type. Some languages require a declaration before a variable can be assigned a value or used in a calculation; others do not.

Initialization provides a value during declaration. This is a special form of assignment.

Assignment assign a value, and is not part of declaration. It occurs after the declaration, and may occur multiple times. (Some languages do not allow for assignment after initialization.)

A reference of a variable is the use the value, to compute some other value or provide the value to a function or subroutine.

It turns out that different languages have different ideas about these operations. Most languages follow these definitions; the differences are in the presence or absence of these actions.

C, C++, and COBOL (to pick a few languages) all require declarations, allow for initialization, and allow for assignment and referencing.

In C and C++ we can write:

int i = 17;
i = 12;
printf("%d\n", i);

This code declares and initializes the variable i as an int with value 17, then assigns the value 12, then calls the printf() function to write the value to the console. COBOL has similar abilities, although the syntax is different.

Perl, Python, and Ruby (to pick different languages) do not have declarations and initialization but do allow for assignment and reference.

In Ruby we can write:

i = 12
puts i

Which assigns the value 12 to i and then writes it to the console. Notice that there is no declaration and no type specified for the variable.

Astute readers will point out that Python and Ruby don't have "variables", they have "names". A name is a reference to an underlying object, and multiple names can point to the same object. Java and C# use a similar mechanism for non-trivial objects. The difference is not important for this post.

BASIC (not Visual Basic or VB.NET, but old-school BASIC) is a bit different. Like Perl, Python, and Ruby it does not have declarations. Unlike those languages, it lets you write a statement that prints the value of an undeclared (and therefore uninitialized and unassigned) variable:

130 PRINT A

This is a concept that would cause a C compiler to emit errors and refuse to supply an executable. In the scripting languages, this would cause a run-time error. BASIC handles this with grace, providing a default value of 0 for numeric variables and "" for text (string) variables. (The AWK language also assigns a reasonable value to uninitialized variables.)

FORTRAN has an interesting mix of capabilities. It allows for declarations but does not require them. Variables have a specific type, either integer or real. When a variable is listed in a declaration, it has the specified type; when a variable is not declared it has a type based on the first letter of its name!

Like BASIC, variables in FORTRAN can be referenced without being initialized. Unlike BASIC, it does not provide default values. Instead it blissfully uses whatever values are in memory at the location assigned for the variable. (COBOL, C, and C++ have this behavior too.)

What's interesting is the trend over time. Let's look at a summary of languages and their capabilities, and the year in which they were created:

Languages which require declaration but don't force initialization

COBOL (1950s)
Pascal (1970s)
C (1970s)
C++ (1980s)
Java (1995)
C# (2000s)
Objective-C (1990s)

Languages which require declaration and require initialization (or initialize for you)

EIFFEL (1980s)
Go (2010)
Swift (2010)
Rust (2015)

Languages which don't allow declarations and require assignment before reference

Perl (1987)
Python (1989)
Ruby (1990s)

Languages which don't require (or don't allow) declaration and allow reference before assignment

FORTRAN (1950s)
BASIC (1960s)
AWK (1970s)
PowerShell (2000s)

This list of languages is hardly comprehensive, and it ignores the functional programming languages completely. Yet it shows something interesting: there is no trend for variables. That is, languages in the 1950s required declarations (COBOL) or didn't (FORTRAN), and later languages require declaration (Go) or don't (Ruby). Early languages allow for initialization, as do later languages. Early languages allow for use-without-assignment, as do later languages.

Perhaps a more comprehensive list may show trends over time. Perhaps splitting out the different versions of languages will show convergence of variables. Or perhaps not.

It is possible that we (that is, programmers and language designers) don't really know how we want variables to behave in our languages. With more than half a century of experience we're still developing languages with different capabilities.

Or maybe we have, in some way, decided. Its possible that we have decided that we need languages with different capabilities for variables (and therefore different languages). If that is the case, then we will never see a single language become dominant.

That, I think, is a good outcome.


Tuesday, February 6, 2018

The IRS made me a better programmer

We US taxpayers have opinions of the IRS, the government agency tasked with the collection of taxes. Those opinions tend to be strong and tend to fall on the "not favorable" side. Yet the IRS did me a great favor and helped me become a better programmer.

The assistance I received was not through employment at the IRS, nor did they send me a memo entitled "How to be a better programmer". They did give me some information, not related to programming, yet it turned out to be the most helpful advice on programming in my career.

That advice was the simple philosophy: One operation at a time.

The IRS uses this philosophy when designing the forms for tax returns. There are a lot of forms, and some cover rather complex notions and operations, and all must be understandable by the average taxpayer. I've looked at these forms (and used a number of them over the years) and while I may dislike our tax laws, I must admit that the forms are as easy and understandable as tax law permits. (Tax law can be complex with intricate concepts, and we can consider this complexity to be "essential" -- it will be present in any tax form no matter how well you design it.)

Back to programming. How does the philosophy of "one operation at a time" change the way I write programs?

A lot, as it turns out.

The philosophy of "one operation at a time" is directly applicable to programming. Well, my programming, at least. I had, over the years, developed a style of combining operations onto a single line.

Here is a simplified example of my code, using the "multiple operations" style:

Foo harry = y.elements().iterate().select('harry')

It is concise, putting several activities on a single line. This style makes for shorter programs, but not necessarily more understandable programs. Shorter programs are better when the shortness is measured in operations, not raw lines. Packing a bunch of operations -- especially unrelated operations -- onto a single line is not simplifying a program. If anything, it is making it more complex, as we tend to assume that operations on the same line are somehow connected.

I changed my style. I shifted from multi-operation lines to single operation lines, and I was immediately pleased with the result.

Here's the example from above, but with the philosophy of one operation per line:

elements = y.elements()
Foo harry = nil
elements.each do |element|
  harry = element if element.name == 'harry'

I have found two immediate benefits from this new style.

The first benefit is a better experience when debugging. When stepping through the code with the debugger, I can examine intermediate values. Debuggers are line-oriented, and execute the single-line version all in one go. (While there are ways to force the debugger to execute each function separately, there are no variables to hold the intermediate results.)

The second benefit is that it is easier to identify duplicate code. By splitting operations onto multiple lines, I find it easier to identify duplicate sequences. Sometimes the code is not an exact duplicate, but the structure is the same. Sometimes portions of the code is the same. I can refactor the duplicated code into functions, which simplifies the code (fewer lines) and consolidates common logic in a single place (one point of truth).

Looking back, I can see that my code is somewhat longer, in terms of lines. (Refactoring common logic reduces it somewhat, but not enough to offset the expansion of multiline operations.)

Yet the longer code is easier to read, easier to explain to others, and easier to fix. And since the programs I am writing are much smaller than the computer's capabilities, there is little expense at slightly longer programs. I suspect that compilers (for languages that use them) are optimizing a lot of my "one at a time" operations and condensing them, perhaps better than I can. The executables produced are about the same size as before. Interpreters, too, seem to have little problem with multiple simple statements, and run the "one operation" version of programs just as fast as the "multiple operations" version. (This is my perception; I have not conducted formal time trials of the two versions.)

Simpler code, easier to debug, and easier to explain to others. What's not to like?