Wednesday, October 7, 2020

Platforms are not neutral

We like to think that our platforms (processor, operating system and operating environment) are neutral, at least when it comes to programming languages. After all, why should a processor care if the code that it executes (machine code) was generated by a compiler for C#, or Java, or Fortran, or Cobol? Ditto for the operating system. Does Windows care of the code was from a C++ program? Does Linux care of the code came from Go?

And if processors and operating systems don't care about code, why should a platform such as .NET or the JVM care about code?

One could argue that the JVM was designed for Java, and that it has the data types and operations that are needed for Java programs and not other languages. That argument is correct: the JVM was built for Java. Yet people have built compilers that convert other languages to the JVM bytecode. That list includes Clojure, Lisp, Kotlin, Groovy, JRuby, and Jython. All of those languages run on the JVM. All of those languages use the same data types as Java.

The argument for .NET is somewhat different. The .NET platform was designed for multiple languages. When announced, Microsoft supplied not only the C# compiler but also compilers for C++, VB, and Visual J++. Other companies have added compilers for many other languages.

But those experiences do not mean that the platforms are unbiased.

The .NET platform, and specifically the Common Language Runtime (CLR) was about interoperability. The goal was to allow programs written in different languages to work together. For example, to call a function in Visual Basic from a function in C++.

To achieve this interoperability, the CLR requires languages to use a common set of data types. These common types include 32-bit integers, 64-bit floating-point numbers, and strings. Prior to .NET, the different language compilers from Microsoft all had different ideas about numeric types and string types. C and C++ user null-terminated strings, but Visual Basic used counter-in-front strings. (Thus, a string in Visual Basic could contain NUL characters, but a string in C or C++ could not.) There were differences with floating-point numeric values also.

Notice that these types are aligned with the C data types. The CLR, and the agreement on data types, works for languages that use data types that match C's data types. The .NET version of Visual Basic (VB.NET) had to change its data types in order to comply with the rules of the CLR. Thus, VB.NET was quite a bit different from the previous Visual Basic.

The CLR works for languages that use C-style data types. The CLR supports custom data types, which is nice, and necessary for languages that do not use C-style data types, but then one loses interoperability, and interoperability was the major benefit of .NET.

The .NET platform favors C-style data types. (Namely integers, floating point, and NUL-terminated strings.)

The JVM also favors C-style data types.

Many languages use C-style data types.

What languages don't use C-style data types?

Cobol, for one. Cobol was developed prior to C, and it has its own ideas about data. It allows numeric values with PICTURE clauses, which can define limits and also formatting. Some examples:

   05  AMOUNT1          PIC 999.99.
   05  AMOUNT2          PIC 99999.99.
   05  AMOUNT3          PIC 999999.99.
   05  AMOUNT4          PIC 99.99.

(The '05' at the front of each line is not a part of the variable, but indicates how the variable is part of a larger structure.)

These four different values are numeric, but they do not align well with any of the CLR types. Thus, they cannot be exchanged with other programs in .NET.

There are compilers for Cobol that emit .NET modules. I don't know how they work, but I suspect that they either use custom types (which are not easily exchanged with modules from other languages) or they convert the Cobol-style data to a C-style value (which would incur a performance penalty).

Pascal has a similar problem with data types. Strings in Pascal are length-count strings, not NUL-terminated strings. Pascal has "sets" which can contain a set of values. The notion of a set translates poorly to other languages. C, C++, Java, and C# can use enums to do some of the work, but sets in Pascal are not quite enums.

Pascal also has definite ideas about memory management and pointers, and those ideas do not quite align with memory management in C (or .NET). With care, one can make it work, but Pascal is not a native .NET language any more than Cobol.

Fortran is another language that predates the .NET platform, and doesn't work well on it. Fortran is a simpler language that Cobol or Pascal, and concentrates on numeric values. The numeric types can convert to the CLR numeric types, so compiling and exchanging data is possible.

Fortran's strength was speed. It was (and still is) one of the fastest languages for numeric processing. Its speed is due to its static memory layout, something that I have not seen in compilers for Fortran to .NET modules. Thus, Fortran on .NET loses its advantage. Fortran on .NET is not fast, and I fear it never will be.

Processors, too, are biased. Intel processors handle binary numeric values for integers and floating-point values, but not BCD values. IBM S/360 processors (and their descendants) can handle BCD data. (BCD data is useful for financial transactions because it avoids many issues with floating-point representations.)

Our platforms are biased. We often don't see that bias, most likely because with use only a single platform. (A single processor type, a single operating system, a single programming language.) The JVM and .NET platforms are biased towards C-style data types.

There are different approaches to data and to computation, and we're limiting ourselves by limiting our expertise. I suspect that in the future, developers will rediscover the utility of data types that are not C-style types, especially the PICTURE-specified numeric types of Cobol. As C and its descendants are ill-equipped to handle such data, we will see new languages with the new (old) data types.