Tuesday, April 15, 2014

Integer Arithmetic Considered Harmful

The general trend of programming has been one of increasing safety in our tools. From machine language to assembly language, from FORTRAN to BASIC, from C to C++ to Java, our languages have relieved programmers from tedious chores and prevented them from doing the wrong thing.

Assembly language was better than machine language. It allowed the use of mnemonics for opcodes and symbols for memory locations. Programmers could more easily remember "ADD B" than "0x80" and "JMP ENDLOOP" instead of "0xC3 0x3F2A" -- especially when revisions moved ENDLOOP's location from 0x3F2A to 0x3F35.

Structured programming languages (Algol, Pascal, C, and others) improved on the GOTO statement in FORTRAN and BASIC. Loops built of while and for were much easier to read (and maintain) than loops built with GOTO.

Through it all, the one element that has remained unchanged is integer arithmetic. Integer arithmetic has been available since before COBOL and FORTRAN, and has remained unchanged up to Java, C#, and even Python (well, Python 2). It is the shark of the computing world, something that evolved to an efficient form and has not changed.

Integer arithmetic provides fast operations for addition, subtraction, multiplication, and division. They operate over a domain specific to the underlying hardware; most modern PCs and servers use 64-bit integers while recent hardware used 32-bit integers. (Really old hardware like the original IBM PC used 16-bit integers.)

Integer arithmetic on computers is not quite the same as the mathematician's integer arithmetic. Mathematicians have an infinite supply of integers which allows them to perform lots of calculations that are not available to computers. Mathematicians can work with integers of arbitrarily large values, where computers are limited to a hardware-specific domain. For 16-bit processors, that domain is 0 to 65535; for 32-bit processors it is 0 to 4,294,967,295.

The domain limits are important. They constrain your calculations; any value larger than the maximum "wraps around" and starts at zero. For a 16-bit processor, the sum of 65530 and 10 is not 65540 (a number outside the range) but 5.

Similarly, subtraction wraps at zero "back" to the high end. On a 32-bit processor, the value 5 subtracted from 3 is not -2 (again, outside the range) but 4,294,967,293.

Some languages have signed integers which allow for negative values. This is a mere sleight-of-hand, shifting the domain. For 16-bit integers, signed integers operate over the range -32768 to 32767; for 32-bit processors the range is from −2,147,483,648 to 2,147,483,647. The starting and ending ranges change but the number of numbers, the number of different values, remains the same. The "wrap" effect is still present; it simply happens at different points. An unsigned integer can compute "3 minus 5" and provide the result -2, but on a 16-bit processor the sum of 32760 and 10 is -32766.

This wrapping often happens silently. While processors have built-in hardware to indicate an overflow (usually a status flag than can be checked with an additional operation) our compilers generally ignore such overflows. They generate code that silently wraps values past the minimum and maximum values.

Compilers also allow for the conversion of signed integers to unsigned integers, and also for the reverse. For many values this conversion is harmless, but for others it can cause problems. Converting a signed integer 5 to an unsigned integer provides the desired result of 5. Converting signed integer -5 to an unsigned integer provides the result -65534 on a 16-bit processor and −2,147,483,643 on a 32-bit processor. Are these the desired values? The answer is not always clear.

Integer arithmetic is dangerous because values wrap silently at the end of the range of values.

Compounding the problem is our use of integers for indexes into arrays. Access operations for arrays occurs in one of three typical cases: We are iterating over the entire array, iterating over a subset, or we are accessing one specific element in the array. In each case, we manipulate an index value (an integer) to "point" to the desired element or elements. Correct execution depends on the index value; an incorrect index value will yield incorrect results.

Which brings me back to the general trend in programming languages, the trend towards safety. Integer arithmetic is inherently dangerous. Values can overflow (or underflow) with no warning. Programmers have been taught to "program defensively" and "guard against errors" when writing integer operations, but not all of them do. The result is that many of our programs have defects, some more harmful than others, some more subtle than others.

We have long avoided changes to integer arithmetic in our languages. The reasons are several. The current (unsafe) operations are fast, and there is no call for slower operations. The current (unsafe) compiler outputs are correct, and adding checks may introduce other changes. Changing the compilers now would break lots of existing code. These are all good reasons.

Despite those reasons, I see changes to integer arithmetic.

Run-time checks for integer operations Pascal and Ada have such checks; I expect other languages to adopt them. Integer operations that result in overflow or underflow will throw exceptions (for languages that have exceptions).
Checked conversions Conversions from signed to unsigned (or unsigned to signed) will be checked at run-time. Errors will throw an exception.
Hard-to-code conversions Conversions from signed to unsigned (or unsigned to signed) will be harder to code, probably required a specific cast.
Implementations of integers than allow for arbitrarily large values Python 3 handles this now. Python 3 uses "normal" integers for values that fit within an integer range, and a custom integer for values outside of that range. This design eliminates the problems of "wrapping".
New types to implement a limited-range integer for indexing Arrays will allow indexing with only a specific type of index, not a plain "int". This type might be constructed at run-time, since the size of the array may be known only at run-time.

Not all of these changes need be implemented in any one language. The run-time checks for integer operations and the use of arbitrarily large values may "overlap" in that they both solve the problem of exceeding a range.

These changes will meet with some resistance. Changes for safety often do; many decry the performance of the safe operations. Despite the nay-sayers, changes for safety often do work their way into languages. I expect these changes to become part of the standard for our popular languages.

Wednesday, April 9, 2014

Files exist on PCs but not in the cloud

We're quite familiar with the concept of data files and file systems. And why shouldn't we be? We've been using files to store data since the first PCs and PC-DOS, and even earlier than that. (CP/M and other predecessors to PC-DOS used files and file systems.)

But computer systems did not always store data in files, or provide a familiar API into the file system.

Prior to the PC revolution, there were non-PC systems used for various purposes. Mainframes, minicomputers, and a rather interesting device: the dedicated word processor. These word processing systems were small (ish) computers that were built for one purpose: the composition and editing (and eventually printing) of documents. They were larger than today's typical tower desktop PC system, and even larger than the original IBM PC -- but not much larger, and they fit comfortably in many offices.

While they were computers, they were single-purpose computers. One did not choose to run an application from a list; the only application was the word processing software. (Quite similar to today's Chromebooks which run only the Chrome browser.) They were made by different vendors (DEC, Wang, and even IBM) and each was its own separate island of computing power.

These dedicated word processors did not have a "Start" menu, they did not have Windows Explorer, they did not have a command-line interface. All of the "common file operations" that we associate with Windows Explorer were handled inside the word processing system, usually in a "utility" menu. One did not work with files but with documents; one created a new document, edited a document, or printed a document.

Documents had names -- nice, long names which allowed for spaces. Since this was a computer, and since it did have a file system, the word processing system mapped the document names to arbitrarily assigned file names. The user never saw these file names; the word processing system insulated the user from such minutiae.

These dedicated word processors stored your information and retrieved it for you, and you had no opportunity to work with the data outside of the system. Unlike personal computers (with data stored in the file system accessible to other programs), word processing systems shielded the details from the user. Users never had to worry about file formats, or parsing a file into another format. The word processor was the only program to access the data, and the user ran only the word processor. Your data was a captive of the system.

Those dedicated word processing systems with their captive data are things of the past. PC-DOS, MS-DOS, and Windows all use file systems -- even Linux uses file systems -- and these file systems allow us to always access our data with any program we choose. We can always examine the files which contain our data (provided we know the location of the files). What does this ancient history of computing have to do with today's technology?

I think that this shielding technique could make a comeback. It could be revived in the world of cloud computing, specifically in the Software-as-a-Service (SaaS) offerings.

Anyone who has used Google's "Drive" feature (or Google Apps) knows that they can create and edit documents and spreadsheets "in the cloud". Using a browser, one can create an account and easily work with one or more documents. (Or spreadsheets. Let's call all of these items "documents", instead of the verbose "documents, spreadsheets, presentations, and other things".)

While one can create and edit these documents, one cannot "see" them as one can "see" files on your local hard drive. One can manipulate them only within the Google applications. The PC files, in contrast, can be manipulated by Windows Explorer (or the Mac Finder, or the Linux Filer programs). A document created in Microsoft Word and stored on my local drive can be opened with Microsoft Word or with LibreOffice or with various other programs. That's not possible with the documents in the Google cloud. (I'm picking on Google here, but this concept holds for any of the Software-as-a-Service offerings. Microsoft's Office 365 works the same way.)

Documents stored in the cloud are documents, but not files as we know them. We cannot view or edit the file, except with the cloud provider's app. We cannot rename them, except with the cloud provider's app. We cannot e-mail them to friends, except with the cloud provider's app. We cannot see the bytes used to store the document, nor do we know the format of that stored form of the document.

In fact, we do not know that the document is stored in a file (as we know the term "file") at all!

The Software-as-a-Service systems do allow for documents to be uploaded from our PC and downloaded to our PC. On our PC, those documents are stored in files, and we can treat them like any other file. We can rename the file. We can open the file in alternate applications. We can write our own applications to "pick apart" the file. (I don't recommend that you devote much time to that endeavor.)

But those operations occur on the PC, not in the cloud. Those operations are possible because the cloud system allowed us to export our document to our PC. That export operation is not guaranteed. I could create a cloud-based word processing service that did not allow you to export documents, one that kept your documents in the cloud and did not permit you to move them to another location. (Such a service may be unwelcome today, but it is a possible offering.)

The ability to move documents from cloud-based systems is possible only when the cloud-based system permits you to export documents.

Even when SaaS systems allow you to export documents today, there is no guarantee that such capabilities will always be available. The system vendor could change their system and disable the exporting of documents.

Should that happen (and I'm not saying that it will happen, only that it could happen) then the cloud-based SaaS systems will operate very much like the dedicated word processing systems of the 1970s. They will hold your documents captive and not allow you to export them to other systems. This applies to any SaaS system: word processing, spreadsheets, presentations, emails, calendars, ... you name it. It applies to any SaaS vendor: Google, Apple, IBM, Microsoft, ... you name the vendor.

I'm going to think about that for a while.

Tuesday, April 8, 2014

Java and C# really derive from Pascal

In the history of programming, the "C versus Pascal" debate was a heated and sometimes unpleasant discussion on language design. It was fought over the capabilities of programming languages.

The Pascal side advocated restrictive code, what we today call "type safe" code. Pascal was designed as a teaching language, a replacement for BASIC that contained the ideas of structured programming.

The C side advocated liberal code, what we today call "unsafe code". C was designed not to teach but to get the job done, specifically systems programming jobs that required access to the hardware.

The terms "type safe" and "unsafe code" are telling, and they give away the eventual resolution. C won over Pascal in the beginning, at kept its lead for many years, but Pascal (or rather the ideas in Pascal) have been gaining ground. Even the C and C++ standards have been moving towards the restrictive design of Pascal.

Notable ideas in Pascal included:

  • Structured programming (blocks, 'while' and 'repeat' loops, 'switch/case' flows, limited goto)
  • Array data type
  • Array index checking at run-time
  • Pointer data type
  • Strong typing, including pointers
  • Overflow checking on arithmetic operations
  • Controlled conversions from one type to another
  • A constant qualifier for variables
  • Standard features across implementations

Notable ideas in K&R C:

  • Structured programming (blocks, 'while' and 'repeat' loops, 'switch/case' flows, limited goto)
  • Array data type (sort of -- really a syntactic trick involving pointers)
  • No checking of array index (at compile-time or run-time)
  • Pointer data type
  • Strong typing, but not for pointers
  • No overflow checking
  • Free conversions from one type to another
  • No 'const' qualifier
  • Many features were implementation-dependent

For programmers coming from BASIC (or FORTRAN) the structured programming concepts, common in C and Pascal, were appealing. Yet the other aspects of the C and Pascal programming languages were polar opposites.

It's hard to define a clear victor in the C/Pascal war. Pascal got a boost with the UCSD p-System and a large boost with the Turbo Pascal IDE. C was big in the Unix world and also big for programming Windows. Today, Pascal is viewed as a legacy language while C and its derivatives C++, Java, and C# enjoy popularity.

But if C won in name, Pascal won in spirit. The early, liberal K&R C has been "improved" with later standards that limit the ability to implicitly convert data types. K&R C was also enhanced with the 'const' keyword for variables. C++ introduced classes which allow programmers to build their own data types. So do Java and C#, and they eliminate pointers, check array indexes, and standardize operations across platforms. Java and C# are closer to the spirit of Pascal than C.

Yes, there are differences. Java and C# use braces to define blocks, where Pascal used 'BEGIN' and 'END'. Pascal declares variables with the name-and-then-type sequence, while C, Java, and C# use the type-and-then-name sequence. But if you look at the features, especially those Pascal features criticized as reducing performance, you see them in Java and C#.

We had many debates about the C and Pascal programming languages. In the end, it was not the "elegance" of a language or the capabilities of the IDE that solved the argument. Advances in technology neutralized many of our objections. Faster processors and improvements in compilers eliminated the need for speed tricks and allowed for the "performance killing" features in Pascal. And without realizing it, we adopted them, slowly, quietly, and with new names. We didn't adopt the name Pascal, we didn't adopt the syntax of Pascal, but we did adopt the features of Pascal.

Sunday, April 6, 2014

How to untangle code: Make loops smaller

Loops all too often do many things. Some programs pack as much as possible into a single loop. The cause is possibly a desire to optimize, to reduce the work performed by the machine. The thought is that one loop is more efficient than several loops, because 1) there is only one loop to "set up" and "take down" and 2) the computer can perform tasks on multiple items as it "goes along" an array of data structures. This is not necessarily so.

Perhaps the code is:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
    member_b[i] = something + some_other_value(i);
    member_c[i] = member_a[i] + member_b[i];
}

The "optimization" of packing loops full of functionality is not necessarily a true optimization. The notion that loops are expensive is an old one, dating back to the mainframe era (when it was true). Since that time, we have designed faster processors, create more capable instruction sets, and improved compilers.

Packing loops full of instructions has a cost: the code is more complex. Being more complex, it is harder to read. (My example is simple, of course. Real code is more complicated. But I think the idea holds.)

I change this (admittedly simple) single loop to a set of smaller loops:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_c[i] = member_a[i] + member_b[i];
}

The revised code looks longer (and to some, terribly inefficient) but look again. Each loop is simple and can be easily understood. Each loop performs one task, and one task only.

Moreover, languages that support vector operations (and there are a few such languages) can see their code simplified further:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
member_c = member_a + member_b;

Using smaller loops isolates the steps in the loop. The smaller loops can be optimized independently.

If the functions 'some_value()' and 'some_other_value()' can be changed to return vectors of values, the code can be simplified further:

member_a = some_values();
member_b = something + some_other_values();
member_c = member_a + member_b;

Doesn't that look simpler than the earlier versions of the code?

Languages without vector operations can approach the brevity of vector operations. Assuming an object-oriented language (without operator overloading), one could write:

member_a = some_values();
member_b = Numbers.Add(something, some_other_values());
member_c = Numbers.Add(member_a, member_b);

Assuming you had the functions:

double[] Numbers.Add(double value, double[] values);
double[] Numbers.Add(double[] values1, double[] values2);

and these functions are not that hard to write.

Code can be complex, sometimes because we think we are "helping" the computer. It's often better to help ourselves, to write programs that are simple and easy to understand.

Friday, April 4, 2014

What was true is no longer true

The IT world has truths, basic precepts that govern other decisions, but these precepts can change over time.

Early computers were expensive. Really expensive. So expensive that we took steps to ensure that computers were used efficiently, scheduling jobs to maximize the use of the hardware. A computer that was not being used was considered a waste of resources -- you had purchased (or leased) a computer larger than you needed.

Today computers are inexpensive and we care little for the CPU utilization rate. We leave computers unassigned with abandon. (Letting computers run and "do nothing" would horrify those early-days computer operators.) We leave computers running as we go for lunch (or dinner), we leave them running overnight with no active jobs.

The early days of computing saw expensive hardware and cheap programmers. Hardware and "computing time" was so precious that every moment counted. We could not afford to let the computer check our programs for errors; we asked programmers to "desk check" their programs for errors prior to compiling and running.

Today programmers are expensive and we gleefully let our compilers check for errors. Modern IDEs check for errors as we type. It is better to let the computers check our syntax several times each second, rather than force the programmers to think about syntax.

The introduction of compilers (FORTRAN and COBOL, and other early languages) sparked a debate about efficiency. Compilers convert high-level statements into machine code, but a programmer skilled in assembly language could create programs that were more efficient than the code from early compilers.

Today, the reverse is true. The code-generators in compilers have been improved and routinely emit efficient code. The CPU architecture has become more complex, so complex that only the most skilled programmer can out-perform a compiler (and only by spending more time). Processors with multiple cores and multiple pipelines demand extreme attention to data alignment and memory access, and compilers are better than people when generating machine instructions.

Things that we know are true may be true today and false tomorrow. Our challenge is to recognize when the cost equation changes. Too often we learn a "truth" and keep it with us, even when reality changes. Then our hard-won, comforting knowledge leads us astray.

What fundamental truths (aside from those above) have become invalid? Quite a few. Here is a sample:

"C++ is slower and bulkier than pure C code" - possibly still true, but C++ compilers have improved to the point that it is hard to see the difference.

"Java is slower than C++" - Java compilers and run-time libraries have improved. C++ may be faster to run, but not by much. Java is faster to write, and the overall benefit may be with Java.

"Personal computers are toys" - initially yes, they were little more than toys and incapable of performing in businesses. Yet today almost every business has personal computers in some capacity.

"Microsoft is the standard" - many open source projects have demonstrated their competence (or even superiority) to Microsoft tools. Many companies have adopted technologies other than Microsoft's.

"The Windows PC is the standard" - phones and tablets have challenged the dominance of the Windows PC.

"Databases are always relational" - Not true in the early days of computers; it became true in the 1980s with the rise of SQL. Now alternatives offer competent data storage.

Looking forward, we can see other maxims proven false:

"Source code must be text"

"Computers will always become cheaper"

"Computers will always become faster"

"Virtualized computers are cheaper"

"Linux is immune to malware"

"Windows is always vulnerable to malware"

That which we believe to be true today may be true today. It may be false. Yet even if it is true, it may become false tomorrow. Best to be on our guard. The danger is not in what we know, but in what we believe.

Monday, March 31, 2014

Microsoft Azure may be the new Windows

For the past two decades Microsoft has used the Windows platform to build its empire. Microsoft delivered a capable combination of operating system and applications. Microsoft's applications ran on Windows (and only Windows) and used proprietary formats. The combination gave Microsoft a near-stranglehold on the market.

The world is changing. Perhaps it is time for Microsoft to move on to something new. Here's why:

File formats The formats for Microsoft applications are open and documented. (Due to court decisions.) Proprietary formats worked to Microsoft's advantage. Now non-Microsoft applications can read and write files which can be exchanged with Microsoft applications.

Other operating systems for personal computers Mac OS and Linux are capable operating systems. Both have a significant number of applications. One can run a home or a small office with Windows, Mac OS, or Linux.

Competing applications The Microsoft Office suite is no longer the only game in town. Competing applications handle word processing, spreadsheets, presentations, e-mail, and even project management.

The Web Applications are moving from PC desktops to the browser, a trend that may have been started by Microsoft itself, with its web version of Outlook.

Phones and tablets Mobile devices offer a new vision of computing, one that entails less administration.

I think that Microsoft has looked at these changes and decided that Windows is not the way forward. I think that Windows, while still an important part of Microsoft's offerings, is no longer the center of its world.

Microsoft's re-branding of "Windows Azure" as "Microsoft Azure" is telling. The cloud computing platform supports more than Windows, and more than just Microsoft's Windows-centric languages.

Windows is an old operating system. It carries a lot of baggage, code to ensure compatibility with previous versions. While Linux and Mac OS are based on the older Unix, Windows has seen more changes as Microsoft added features and fixed defects. It may be that previous design decisions, the accumulated baggage of two decades, are limiting the ability of Windows to rise to new challenges.

My guess is that Microsoft may de-emphasize Windows and focus on subscriptions such as Office 365 and the web version of Visual Studio. Such a change would correspond to a move from the PC platform to a cloud platform. Instead of Windows, Microsoft will sell its Azure platform.

The knowledgeable reader will point out that Azure is built on Windows, so Windows is still part of the system. This is true -- for now. I expect Microsoft to replace Azure's "core" of Windows with an operating system better suited to servers and cloud processing, just as it replaced the early Windows "core" of MS-DOS. Windows was, in its early incarnations, a DOS application. Microsoft expanded it into a full operating system, one that surpassed MS-DOS.

I think Microsoft can do the same with Azure. Initially a system built on Windows, it can become larger than Windows, a better operating system for cloud computing, and more capable than Windows.

Windows made sense when people installed software on their personal computers. Today, people buy apps and the installation is automatic. The world is ready for a successor to Windows, and I think Azure can be that successor.

Sunday, March 30, 2014

How to untangle code: Start at the bottom

Messy code is cheap to make and expensive to maintain. Clean code is not so cheap to create but much less expensive to maintain. If you can start with clean code and keep the code clean, you're in a good position. If you have messy code, you can reduce your maintenance costs by improving your code.

But where to begin? The question is difficult to answer, especially on a large code base. Some ideas are:
  • Re-write the entire code
  • Re-write logical sections of code (vertical slices)
  • Re-write layers of code (horizontal slices)
  • Make small improvements everywhere
All of these ideas have merit -- and risk. For very small code sets, a complete re-write is possible. For a system larger than "small", though, a re-write entails a lot of risk.

Slicing the system (either vertically or horizontally) has the appeal of independent teams. The idea is to assign a number of teams to the project, with each project working on an independent section of code. Since the code sets are independent, the teams can work independently. This is an appealing idea but not always practical. It is rare that a system is composed of independent systems. More often, the system is composed of several mutually-dependent systems, and adjustments to any one sub-system will ripple throughout the code.

One can make small improvements everywhere, but this has its limits. The improvements tend to be narrow in scope and systems often need high-level revisions.

Experience has taught me that improvements must start at the "bottom" of the code and work upwards. Improvements at the bottom layer can be made with minimal changes to higher layers. Note that there are some changes to higher layers -- in most systems there are some affects that ripple "upwards". Once the bottom layer is "clean", one can move upwards to improve the next-higher level.

How to identify the bottom layer? In object-oriented code, the process is easy: classes that can stand alone are the bottom layer. Object-oriented code consists of different classes, and some (usually most) classes depend on other classes. (A "car system" depends on various subsystems: "drive train", "suspension", "electrical", etc., and those subsystems in turn depend on smaller components.)

No matter how complex the hierarchy, there is a bottom layer. Some classes are simple enough that they do not include other classes. (At least not other classes that you maintain. They may contain framework-provided classes such as strings and lists and database connections.)

These bottom classes are where I start. I make improvements to these classes, often making them immutable (so they can hold state but they cannot change state). I change their public methods to use consistent names. I simplify their code. When these "bottom" classes are complex (when they hold many member variables) I split them into multiple classes.

The result is a set of simpler, cleaner code that is reliable and readable.

Most of these changes affect the other parts of the system. I make changes gradually, introducing one or two and then re-building the system and fixing broken code. I create unit tests for the revised classes. I share changes with other members of the team and ask for their input.

I don't stop with just these "bottom" classes. Once cleaned, I move up to the next level of code: the classes than depend only on framework and the newly-cleaned classes. With a solid base of clean code below, one can improve the next layer of classes. The improvements are the same: make classes immutable, use consistent names for functions and variables, and split complex classes into smaller classes.

Using this technique, one works from the bottom of the code to the top, cleaning all of the code and ensuring that the entire system is maintainable.

This method is not without drawbacks. Sometimes there are cyclic dependencies between classes and there is no clear "bottom" class. (Good judgement and re-factoring can usually resolve that issue.) The largest challenge is not technical but political -- large code bases with large development teams often have developers with egos, developers who think that they own part of the code. They are often reluctant to give up control of "their" code. This is a management issue, and much has been written on "egoless programming".

Despite the difficulties, this method works. It is the only method that I have found to work. The other approaches too often run into the problem of doing too much at once. The "bottom up" method allows for small, gradual changes. It reduces risk, but cannot eliminate it. It lets the team work at a measured pace, and lets the team measure their progress (how many classes cleaned).