Tuesday, April 15, 2014

Integer Arithmetic Considered Harmful

The general trend of programming has been one of increasing safety in our tools. From machine language to assembly language, from FORTRAN to BASIC, from C to C++ to Java, our languages have relieved programmers from tedious chores and prevented them from doing the wrong thing.

Assembly language was better than machine language. It allowed the use of mnemonics for opcodes and symbols for memory locations. Programmers could more easily remember "ADD B" than "0x80" and "JMP ENDLOOP" instead of "0xC3 0x3F2A" -- especially when revisions moved ENDLOOP's location from 0x3F2A to 0x3F35.

Structured programming languages (Algol, Pascal, C, and others) improved on the GOTO statement in FORTRAN and BASIC. Loops built of while and for were much easier to read (and maintain) than loops built with GOTO.

Through it all, the one element that has remained unchanged is integer arithmetic. Integer arithmetic has been available since before COBOL and FORTRAN, and has remained unchanged up to Java, C#, and even Python (well, Python 2). It is the shark of the computing world, something that evolved to an efficient form and has not changed.

Integer arithmetic provides fast operations for addition, subtraction, multiplication, and division. They operate over a domain specific to the underlying hardware; most modern PCs and servers use 64-bit integers while recent hardware used 32-bit integers. (Really old hardware like the original IBM PC used 16-bit integers.)

Integer arithmetic on computers is not quite the same as the mathematician's integer arithmetic. Mathematicians have an infinite supply of integers which allows them to perform lots of calculations that are not available to computers. Mathematicians can work with integers of arbitrarily large values, where computers are limited to a hardware-specific domain. For 16-bit processors, that domain is 0 to 65535; for 32-bit processors it is 0 to 4,294,967,295.

The domain limits are important. They constrain your calculations; any value larger than the maximum "wraps around" and starts at zero. For a 16-bit processor, the sum of 65530 and 10 is not 65540 (a number outside the range) but 5.

Similarly, subtraction wraps at zero "back" to the high end. On a 32-bit processor, the value 5 subtracted from 3 is not -2 (again, outside the range) but 4,294,967,293.

Some languages have signed integers which allow for negative values. This is a mere sleight-of-hand, shifting the domain. For 16-bit integers, signed integers operate over the range -32768 to 32767; for 32-bit processors the range is from −2,147,483,648 to 2,147,483,647. The starting and ending ranges change but the number of numbers, the number of different values, remains the same. The "wrap" effect is still present; it simply happens at different points. An unsigned integer can compute "3 minus 5" and provide the result -2, but on a 16-bit processor the sum of 32760 and 10 is -32766.

This wrapping often happens silently. While processors have built-in hardware to indicate an overflow (usually a status flag than can be checked with an additional operation) our compilers generally ignore such overflows. They generate code that silently wraps values past the minimum and maximum values.

Compilers also allow for the conversion of signed integers to unsigned integers, and also for the reverse. For many values this conversion is harmless, but for others it can cause problems. Converting a signed integer 5 to an unsigned integer provides the desired result of 5. Converting signed integer -5 to an unsigned integer provides the result -65534 on a 16-bit processor and −2,147,483,643 on a 32-bit processor. Are these the desired values? The answer is not always clear.

Integer arithmetic is dangerous because values wrap silently at the end of the range of values.

Compounding the problem is our use of integers for indexes into arrays. Access operations for arrays occurs in one of three typical cases: We are iterating over the entire array, iterating over a subset, or we are accessing one specific element in the array. In each case, we manipulate an index value (an integer) to "point" to the desired element or elements. Correct execution depends on the index value; an incorrect index value will yield incorrect results.

Which brings me back to the general trend in programming languages, the trend towards safety. Integer arithmetic is inherently dangerous. Values can overflow (or underflow) with no warning. Programmers have been taught to "program defensively" and "guard against errors" when writing integer operations, but not all of them do. The result is that many of our programs have defects, some more harmful than others, some more subtle than others.

We have long avoided changes to integer arithmetic in our languages. The reasons are several. The current (unsafe) operations are fast, and there is no call for slower operations. The current (unsafe) compiler outputs are correct, and adding checks may introduce other changes. Changing the compilers now would break lots of existing code. These are all good reasons.

Despite those reasons, I see changes to integer arithmetic.

Run-time checks for integer operations Pascal and Ada have such checks; I expect other languages to adopt them. Integer operations that result in overflow or underflow will throw exceptions (for languages that have exceptions).
Checked conversions Conversions from signed to unsigned (or unsigned to signed) will be checked at run-time. Errors will throw an exception.
Hard-to-code conversions Conversions from signed to unsigned (or unsigned to signed) will be harder to code, probably required a specific cast.
Implementations of integers than allow for arbitrarily large values Python 3 handles this now. Python 3 uses "normal" integers for values that fit within an integer range, and a custom integer for values outside of that range. This design eliminates the problems of "wrapping".
New types to implement a limited-range integer for indexing Arrays will allow indexing with only a specific type of index, not a plain "int". This type might be constructed at run-time, since the size of the array may be known only at run-time.

Not all of these changes need be implemented in any one language. The run-time checks for integer operations and the use of arbitrarily large values may "overlap" in that they both solve the problem of exceeding a range.

These changes will meet with some resistance. Changes for safety often do; many decry the performance of the safe operations. Despite the nay-sayers, changes for safety often do work their way into languages. I expect these changes to become part of the standard for our popular languages.

No comments: