Tuesday, April 29, 2014

Cloud vendor lock-in is different from PC vendor lock-in

The IBM PC hardware was open, yet the business model of the PC software market was a fragmented one, with each vendor providing a mostly closed system.

The emerging model of cloud-based computing is a fragmented market, with each vendor providing a mostly closed system, but the nature of the closed-ness is quite different from the PC market.

In the PC market, the strategy was to allow data on any platform but use proprietary formats to tie data to an application. Microsoft Office applications used complex formats for their files, formats that made it difficult to use the files with any other application.

For cloud-based systems the fragmentation will be around data storage, not applications or the format of data. Vendors will strive to keep data in their storage system (Google Docs, Microsoft OneDrive) and push apps onto multiple platforms (Windows, iOS, Android).

Google Docs, Microsoft Office 365, and other cloud-based services use private storage systems. Create a document in Google Docs and it is stored in Google Drive. Create a spreadsheet in Office 365 and it is stored in Microsoft's OneDrive. These are private storage systems -- only the Google Docs apps can access Google Drive, and only Office 365 can access OneDrive.

We have limited visibility into these private storage systems. We cannot "see" our data, other than through the API and UI offered by the vendor. We cannot directly access our data. This allows the vendor to store the data in any convenient format: as a file, in a relational database, or in some other form.

Accessibility is what allows one to change from one office suite to another and still read your old documents. The new office suite must be able to read the format, of course, but such operations are possible. Microsoft used this trick to convert users from Wordperfect and Lotus 1-2-3 to Microsoft Word and Excel. Open Office uses this trick to read .DOC and .XLS files.

Cloud-based offerings don't allow such tricks. One cannot use Office 365 to read a document stored in Google Drive. (Not because the format is different, but because Office 365 cannot reach into Google Drive. Google Docs cannot reach into OneDrive, either.)

Cloud-based systems do allow one to download documents to your PC. When you do, they are stored in files (that's what PCs use). You can they upload the document to a different cloud-based system. But keep in mind: this download/upload trick works only while the cloud-based systems allow you to download a document to your PC. The owners of the cloud-based system can change or remove that capability at any time.

Switching from one cloud-based system to another may be difficult, and perhaps impossible. If a cloud vendor offers no way to get data "out", then the data, once entered into the system, remains there.

Vendors want to lock customers into their systems. The strategy for PC software was to use storage formats that tied data to an application. The strategy for cloud-based systems is not the format but the storage location. Look for Microsoft, Google, and others to offer convenient ways to transfer your data from your PC into their cloud.

And also look for convenient ways to get the data out of that cloud.

Tuesday, April 22, 2014

We no longer think about operating systems

Windows XP remains popular, despite its age, its limitations, and its lack of support from Microsoft.

The desire to keep Windows XP shows that users want stable, reliable operating systems that they can install and then ignore. Well, perhaps not ignore, but at least not think about.

Things were not always this way. Early in the age of Windows, corporations, individuals, hobbyists, and programmers all looked forward to new versions of Microsoft's operating system. Windows 3.1 was desired for its networking capabilities; Windows 95 for its user interface (in contrast to Windows 8); and Windows NT for its security. Windows 2000 brought all of those features together, and was eagerly adopted.

I think that the lesson of Windows 8 (and Windows Vista, and Windows 7) is this: We no longer care about the operating system.

In the old days, operating systems were important -- much more than today. Certain applications would run on only certain operating systems; pick the wrong operating system and you could not run your application. Not running your application meant that you could not get your work done, or deliver for your client.

Today, most applications run on most operating systems. Yes, most Microsoft products run only on Windows, but other products run on Windows, MacOS, and Linux. Moreover, web apps run in browsers, and most web apps run in the popular browsers (Firefox, Chrome, IE, and Safari) and care nothing about the operating system.

Applications are not tied so closely to operating systems as they were.

The mobile world has made operating systems commodities, with equivalent apps available on iOS and Android. In the mobile world, very few people care about the operating system.

With less dependence on the operating system, we tend to think of other things. We still think of performance -- although modern processors are fast enough for most tasks and cloud computing can provide computing power for large tasks.

Today we tend to think of portability (an app for my phone) and network connectivity (coverage by mobile service provider).

The operating system, for most people, is a means to an end but it is not the end. We think of it as we think of electricity, or of sidewalks: there and ready for us to use, but nothing distinguishing about them. They are becoming part of "the infrastructure", that part of our world that we use without thinking about it.

To be sure, there are some folks who do care about operating systems. The system designers, to start. And I'm sure that Microsoft's product teams care about the features in Windows (as do Apple's product designers care about features in MacOS and iOS). Hobbyists and tinkerers enjoy exploring new versions of operating systems. Support teams for large organizations, security analysts, and the "black hat" hackers who look for vulnerabilities -- they all care about operating systems.

But walk down the street and ask individuals at random, and most will answer that they don't care. Some may not even know which operating system are used by their devices!

We've moved on to other things.

Tuesday, April 15, 2014

Integer Arithmetic Considered Harmful

The general trend of programming has been one of increasing safety in our tools. From machine language to assembly language, from FORTRAN to BASIC, from C to C++ to Java, our languages have relieved programmers from tedious chores and prevented them from doing the wrong thing.

Assembly language was better than machine language. It allowed the use of mnemonics for opcodes and symbols for memory locations. Programmers could more easily remember "ADD B" than "0x80" and "JMP ENDLOOP" instead of "0xC3 0x3F2A" -- especially when revisions moved ENDLOOP's location from 0x3F2A to 0x3F35.

Structured programming languages (Algol, Pascal, C, and others) improved on the GOTO statement in FORTRAN and BASIC. Loops built of while and for were much easier to read (and maintain) than loops built with GOTO.

Through it all, the one element that has remained unchanged is integer arithmetic. Integer arithmetic has been available since before COBOL and FORTRAN, and has remained unchanged up to Java, C#, and even Python (well, Python 2). It is the shark of the computing world, something that evolved to an efficient form and has not changed.

Integer arithmetic provides fast operations for addition, subtraction, multiplication, and division. They operate over a domain specific to the underlying hardware; most modern PCs and servers use 64-bit integers while recent hardware used 32-bit integers. (Really old hardware like the original IBM PC used 16-bit integers.)

Integer arithmetic on computers is not quite the same as the mathematician's integer arithmetic. Mathematicians have an infinite supply of integers which allows them to perform lots of calculations that are not available to computers. Mathematicians can work with integers of arbitrarily large values, where computers are limited to a hardware-specific domain. For 16-bit processors, that domain is 0 to 65535; for 32-bit processors it is 0 to 4,294,967,295.

The domain limits are important. They constrain your calculations; any value larger than the maximum "wraps around" and starts at zero. For a 16-bit processor, the sum of 65530 and 10 is not 65540 (a number outside the range) but 5.

Similarly, subtraction wraps at zero "back" to the high end. On a 32-bit processor, the value 5 subtracted from 3 is not -2 (again, outside the range) but 4,294,967,293.

Some languages have signed integers which allow for negative values. This is a mere sleight-of-hand, shifting the domain. For 16-bit integers, signed integers operate over the range -32768 to 32767; for 32-bit processors the range is from −2,147,483,648 to 2,147,483,647. The starting and ending ranges change but the number of numbers, the number of different values, remains the same. The "wrap" effect is still present; it simply happens at different points. An unsigned integer can compute "3 minus 5" and provide the result -2, but on a 16-bit processor the sum of 32760 and 10 is -32766.

This wrapping often happens silently. While processors have built-in hardware to indicate an overflow (usually a status flag than can be checked with an additional operation) our compilers generally ignore such overflows. They generate code that silently wraps values past the minimum and maximum values.

Compilers also allow for the conversion of signed integers to unsigned integers, and also for the reverse. For many values this conversion is harmless, but for others it can cause problems. Converting a signed integer 5 to an unsigned integer provides the desired result of 5. Converting signed integer -5 to an unsigned integer provides the result -65534 on a 16-bit processor and −2,147,483,643 on a 32-bit processor. Are these the desired values? The answer is not always clear.

Integer arithmetic is dangerous because values wrap silently at the end of the range of values.

Compounding the problem is our use of integers for indexes into arrays. Access operations for arrays occurs in one of three typical cases: We are iterating over the entire array, iterating over a subset, or we are accessing one specific element in the array. In each case, we manipulate an index value (an integer) to "point" to the desired element or elements. Correct execution depends on the index value; an incorrect index value will yield incorrect results.

Which brings me back to the general trend in programming languages, the trend towards safety. Integer arithmetic is inherently dangerous. Values can overflow (or underflow) with no warning. Programmers have been taught to "program defensively" and "guard against errors" when writing integer operations, but not all of them do. The result is that many of our programs have defects, some more harmful than others, some more subtle than others.

We have long avoided changes to integer arithmetic in our languages. The reasons are several. The current (unsafe) operations are fast, and there is no call for slower operations. The current (unsafe) compiler outputs are correct, and adding checks may introduce other changes. Changing the compilers now would break lots of existing code. These are all good reasons.

Despite those reasons, I see changes to integer arithmetic.

Run-time checks for integer operations Pascal and Ada have such checks; I expect other languages to adopt them. Integer operations that result in overflow or underflow will throw exceptions (for languages that have exceptions).
Checked conversions Conversions from signed to unsigned (or unsigned to signed) will be checked at run-time. Errors will throw an exception.
Hard-to-code conversions Conversions from signed to unsigned (or unsigned to signed) will be harder to code, probably required a specific cast.
Implementations of integers than allow for arbitrarily large values Python 3 handles this now. Python 3 uses "normal" integers for values that fit within an integer range, and a custom integer for values outside of that range. This design eliminates the problems of "wrapping".
New types to implement a limited-range integer for indexing Arrays will allow indexing with only a specific type of index, not a plain "int". This type might be constructed at run-time, since the size of the array may be known only at run-time.

Not all of these changes need be implemented in any one language. The run-time checks for integer operations and the use of arbitrarily large values may "overlap" in that they both solve the problem of exceeding a range.

These changes will meet with some resistance. Changes for safety often do; many decry the performance of the safe operations. Despite the nay-sayers, changes for safety often do work their way into languages. I expect these changes to become part of the standard for our popular languages.

Wednesday, April 9, 2014

Files exist on PCs but not in the cloud

We're quite familiar with the concept of data files and file systems. And why shouldn't we be? We've been using files to store data since the first PCs and PC-DOS, and even earlier than that. (CP/M and other predecessors to PC-DOS used files and file systems.)

But computer systems did not always store data in files, or provide a familiar API into the file system.

Prior to the PC revolution, there were non-PC systems used for various purposes. Mainframes, minicomputers, and a rather interesting device: the dedicated word processor. These word processing systems were small (ish) computers that were built for one purpose: the composition and editing (and eventually printing) of documents. They were larger than today's typical tower desktop PC system, and even larger than the original IBM PC -- but not much larger, and they fit comfortably in many offices.

While they were computers, they were single-purpose computers. One did not choose to run an application from a list; the only application was the word processing software. (Quite similar to today's Chromebooks which run only the Chrome browser.) They were made by different vendors (DEC, Wang, and even IBM) and each was its own separate island of computing power.

These dedicated word processors did not have a "Start" menu, they did not have Windows Explorer, they did not have a command-line interface. All of the "common file operations" that we associate with Windows Explorer were handled inside the word processing system, usually in a "utility" menu. One did not work with files but with documents; one created a new document, edited a document, or printed a document.

Documents had names -- nice, long names which allowed for spaces. Since this was a computer, and since it did have a file system, the word processing system mapped the document names to arbitrarily assigned file names. The user never saw these file names; the word processing system insulated the user from such minutiae.

These dedicated word processors stored your information and retrieved it for you, and you had no opportunity to work with the data outside of the system. Unlike personal computers (with data stored in the file system accessible to other programs), word processing systems shielded the details from the user. Users never had to worry about file formats, or parsing a file into another format. The word processor was the only program to access the data, and the user ran only the word processor. Your data was a captive of the system.

Those dedicated word processing systems with their captive data are things of the past. PC-DOS, MS-DOS, and Windows all use file systems -- even Linux uses file systems -- and these file systems allow us to always access our data with any program we choose. We can always examine the files which contain our data (provided we know the location of the files). What does this ancient history of computing have to do with today's technology?

I think that this shielding technique could make a comeback. It could be revived in the world of cloud computing, specifically in the Software-as-a-Service (SaaS) offerings.

Anyone who has used Google's "Drive" feature (or Google Apps) knows that they can create and edit documents and spreadsheets "in the cloud". Using a browser, one can create an account and easily work with one or more documents. (Or spreadsheets. Let's call all of these items "documents", instead of the verbose "documents, spreadsheets, presentations, and other things".)

While one can create and edit these documents, one cannot "see" them as one can "see" files on your local hard drive. One can manipulate them only within the Google applications. The PC files, in contrast, can be manipulated by Windows Explorer (or the Mac Finder, or the Linux Filer programs). A document created in Microsoft Word and stored on my local drive can be opened with Microsoft Word or with LibreOffice or with various other programs. That's not possible with the documents in the Google cloud. (I'm picking on Google here, but this concept holds for any of the Software-as-a-Service offerings. Microsoft's Office 365 works the same way.)

Documents stored in the cloud are documents, but not files as we know them. We cannot view or edit the file, except with the cloud provider's app. We cannot rename them, except with the cloud provider's app. We cannot e-mail them to friends, except with the cloud provider's app. We cannot see the bytes used to store the document, nor do we know the format of that stored form of the document.

In fact, we do not know that the document is stored in a file (as we know the term "file") at all!

The Software-as-a-Service systems do allow for documents to be uploaded from our PC and downloaded to our PC. On our PC, those documents are stored in files, and we can treat them like any other file. We can rename the file. We can open the file in alternate applications. We can write our own applications to "pick apart" the file. (I don't recommend that you devote much time to that endeavor.)

But those operations occur on the PC, not in the cloud. Those operations are possible because the cloud system allowed us to export our document to our PC. That export operation is not guaranteed. I could create a cloud-based word processing service that did not allow you to export documents, one that kept your documents in the cloud and did not permit you to move them to another location. (Such a service may be unwelcome today, but it is a possible offering.)

The ability to move documents from cloud-based systems is possible only when the cloud-based system permits you to export documents.

Even when SaaS systems allow you to export documents today, there is no guarantee that such capabilities will always be available. The system vendor could change their system and disable the exporting of documents.

Should that happen (and I'm not saying that it will happen, only that it could happen) then the cloud-based SaaS systems will operate very much like the dedicated word processing systems of the 1970s. They will hold your documents captive and not allow you to export them to other systems. This applies to any SaaS system: word processing, spreadsheets, presentations, emails, calendars, ... you name it. It applies to any SaaS vendor: Google, Apple, IBM, Microsoft, ... you name the vendor.

I'm going to think about that for a while.

Tuesday, April 8, 2014

Java and C# really derive from Pascal

In the history of programming, the "C versus Pascal" debate was a heated and sometimes unpleasant discussion on language design. It was fought over the capabilities of programming languages.

The Pascal side advocated restrictive code, what we today call "type safe" code. Pascal was designed as a teaching language, a replacement for BASIC that contained the ideas of structured programming.

The C side advocated liberal code, what we today call "unsafe code". C was designed not to teach but to get the job done, specifically systems programming jobs that required access to the hardware.

The terms "type safe" and "unsafe code" are telling, and they give away the eventual resolution. C won over Pascal in the beginning, at kept its lead for many years, but Pascal (or rather the ideas in Pascal) have been gaining ground. Even the C and C++ standards have been moving towards the restrictive design of Pascal.

Notable ideas in Pascal included:

  • Structured programming (blocks, 'while' and 'repeat' loops, 'switch/case' flows, limited goto)
  • Array data type
  • Array index checking at run-time
  • Pointer data type
  • Strong typing, including pointers
  • Overflow checking on arithmetic operations
  • Controlled conversions from one type to another
  • A constant qualifier for variables
  • Standard features across implementations

Notable ideas in K&R C:

  • Structured programming (blocks, 'while' and 'repeat' loops, 'switch/case' flows, limited goto)
  • Array data type (sort of -- really a syntactic trick involving pointers)
  • No checking of array index (at compile-time or run-time)
  • Pointer data type
  • Strong typing, but not for pointers
  • No overflow checking
  • Free conversions from one type to another
  • No 'const' qualifier
  • Many features were implementation-dependent

For programmers coming from BASIC (or FORTRAN) the structured programming concepts, common in C and Pascal, were appealing. Yet the other aspects of the C and Pascal programming languages were polar opposites.

It's hard to define a clear victor in the C/Pascal war. Pascal got a boost with the UCSD p-System and a large boost with the Turbo Pascal IDE. C was big in the Unix world and also big for programming Windows. Today, Pascal is viewed as a legacy language while C and its derivatives C++, Java, and C# enjoy popularity.

But if C won in name, Pascal won in spirit. The early, liberal K&R C has been "improved" with later standards that limit the ability to implicitly convert data types. K&R C was also enhanced with the 'const' keyword for variables. C++ introduced classes which allow programmers to build their own data types. So do Java and C#, and they eliminate pointers, check array indexes, and standardize operations across platforms. Java and C# are closer to the spirit of Pascal than C.

Yes, there are differences. Java and C# use braces to define blocks, where Pascal used 'BEGIN' and 'END'. Pascal declares variables with the name-and-then-type sequence, while C, Java, and C# use the type-and-then-name sequence. But if you look at the features, especially those Pascal features criticized as reducing performance, you see them in Java and C#.

We had many debates about the C and Pascal programming languages. In the end, it was not the "elegance" of a language or the capabilities of the IDE that solved the argument. Advances in technology neutralized many of our objections. Faster processors and improvements in compilers eliminated the need for speed tricks and allowed for the "performance killing" features in Pascal. And without realizing it, we adopted them, slowly, quietly, and with new names. We didn't adopt the name Pascal, we didn't adopt the syntax of Pascal, but we did adopt the features of Pascal.

Sunday, April 6, 2014

How to untangle code: Make loops smaller

Loops all too often do many things. Some programs pack as much as possible into a single loop. The cause is possibly a desire to optimize, to reduce the work performed by the machine. The thought is that one loop is more efficient than several loops, because 1) there is only one loop to "set up" and "take down" and 2) the computer can perform tasks on multiple items as it "goes along" an array of data structures. This is not necessarily so.

Perhaps the code is:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
    member_b[i] = something + some_other_value(i);
    member_c[i] = member_a[i] + member_b[i];
}

The "optimization" of packing loops full of functionality is not necessarily a true optimization. The notion that loops are expensive is an old one, dating back to the mainframe era (when it was true). Since that time, we have designed faster processors, create more capable instruction sets, and improved compilers.

Packing loops full of instructions has a cost: the code is more complex. Being more complex, it is harder to read. (My example is simple, of course. Real code is more complicated. But I think the idea holds.)

I change this (admittedly simple) single loop to a set of smaller loops:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_c[i] = member_a[i] + member_b[i];
}

The revised code looks longer (and to some, terribly inefficient) but look again. Each loop is simple and can be easily understood. Each loop performs one task, and one task only.

Moreover, languages that support vector operations (and there are a few such languages) can see their code simplified further:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
member_c = member_a + member_b;

Using smaller loops isolates the steps in the loop. The smaller loops can be optimized independently.

If the functions 'some_value()' and 'some_other_value()' can be changed to return vectors of values, the code can be simplified further:

member_a = some_values();
member_b = something + some_other_values();
member_c = member_a + member_b;

Doesn't that look simpler than the earlier versions of the code?

Languages without vector operations can approach the brevity of vector operations. Assuming an object-oriented language (without operator overloading), one could write:

member_a = some_values();
member_b = Numbers.Add(something, some_other_values());
member_c = Numbers.Add(member_a, member_b);

Assuming you had the functions:

double[] Numbers.Add(double value, double[] values);
double[] Numbers.Add(double[] values1, double[] values2);

and these functions are not that hard to write.

Code can be complex, sometimes because we think we are "helping" the computer. It's often better to help ourselves, to write programs that are simple and easy to understand.

Friday, April 4, 2014

What was true is no longer true

The IT world has truths, basic precepts that govern other decisions, but these precepts can change over time.

Early computers were expensive. Really expensive. So expensive that we took steps to ensure that computers were used efficiently, scheduling jobs to maximize the use of the hardware. A computer that was not being used was considered a waste of resources -- you had purchased (or leased) a computer larger than you needed.

Today computers are inexpensive and we care little for the CPU utilization rate. We leave computers unassigned with abandon. (Letting computers run and "do nothing" would horrify those early-days computer operators.) We leave computers running as we go for lunch (or dinner), we leave them running overnight with no active jobs.

The early days of computing saw expensive hardware and cheap programmers. Hardware and "computing time" was so precious that every moment counted. We could not afford to let the computer check our programs for errors; we asked programmers to "desk check" their programs for errors prior to compiling and running.

Today programmers are expensive and we gleefully let our compilers check for errors. Modern IDEs check for errors as we type. It is better to let the computers check our syntax several times each second, rather than force the programmers to think about syntax.

The introduction of compilers (FORTRAN and COBOL, and other early languages) sparked a debate about efficiency. Compilers convert high-level statements into machine code, but a programmer skilled in assembly language could create programs that were more efficient than the code from early compilers.

Today, the reverse is true. The code-generators in compilers have been improved and routinely emit efficient code. The CPU architecture has become more complex, so complex that only the most skilled programmer can out-perform a compiler (and only by spending more time). Processors with multiple cores and multiple pipelines demand extreme attention to data alignment and memory access, and compilers are better than people when generating machine instructions.

Things that we know are true may be true today and false tomorrow. Our challenge is to recognize when the cost equation changes. Too often we learn a "truth" and keep it with us, even when reality changes. Then our hard-won, comforting knowledge leads us astray.

What fundamental truths (aside from those above) have become invalid? Quite a few. Here is a sample:

"C++ is slower and bulkier than pure C code" - possibly still true, but C++ compilers have improved to the point that it is hard to see the difference.

"Java is slower than C++" - Java compilers and run-time libraries have improved. C++ may be faster to run, but not by much. Java is faster to write, and the overall benefit may be with Java.

"Personal computers are toys" - initially yes, they were little more than toys and incapable of performing in businesses. Yet today almost every business has personal computers in some capacity.

"Microsoft is the standard" - many open source projects have demonstrated their competence (or even superiority) to Microsoft tools. Many companies have adopted technologies other than Microsoft's.

"The Windows PC is the standard" - phones and tablets have challenged the dominance of the Windows PC.

"Databases are always relational" - Not true in the early days of computers; it became true in the 1980s with the rise of SQL. Now alternatives offer competent data storage.

Looking forward, we can see other maxims proven false:

"Source code must be text"

"Computers will always become cheaper"

"Computers will always become faster"

"Virtualized computers are cheaper"

"Linux is immune to malware"

"Windows is always vulnerable to malware"

That which we believe to be true today may be true today. It may be false. Yet even if it is true, it may become false tomorrow. Best to be on our guard. The danger is not in what we know, but in what we believe.