Wednesday, February 27, 2019

R shows that open source permits poor quality

We like to think that open source projects are better than closed source projects. Not just cheaper, but better -- higher quality, more reliable, and easier to use. But while high quality and reliability and usability may be the result of some open source projects, they are not guaranteed.

Consider the R tool chain, which includes the R interpreter, the Rmd markdown language, the Rstudio IDE, and commonly used models built in R. All of these are open source, and all have significant problems.

The R interpreter is a large and complicated program. It is implemented in multiple programming languages: C, C++, Fortran -- and Java seems to be part of the build as well. To build R you need compilers for all of these languages, and you also need lots of libraries. The source code is not trivial; it takes quite a bit of time to compile the R source and get a working executable.

The time for building R concerns me less than the mix of languages and the number of libraries. R sits on top of a large stack of technologies, and a problem in any piece can percolate up and become a problem in R. If one is lucky, R will fail to run; if not, R will run and use whatever data happens to be available after the failure.

The R language itself has problems. It uses one-letter names for common functions ('t' to transpose a matrix, 'c' to combine values into a list) which means that these letters are not available for "normal" variables. (Or perhaps they are, if R keeps variables and functions in separate namespaces. But even then, a program would be confusing to read.)

R also suffers from too many data containers. One can have a list, which is different from a vector, which is different from a matrix, which is different from a data frame. The built-in libraries all expect data of some type, but  woe to he that uses one when a different structure is expected. (Some functions do the right thing, and others complain about a type mismatch.)

Problems are not confined to the R language. The Rmd markdown language is another problem area. Rmd is based on Markdown, which has problems of its own. Rmd inherits these problems and adds more. A document in Rmd can contain plain text, markdown for text effects such as bold and underline, blocks of R code, blocks of Tex. Rmd is processed into regular Markdown, which is then processed into the output form of your choice (PDF, HTML, MS-Word, and a boatload of other formats).

Markdown allows you to specify line breaks by typing two space characters at the end of a line. (Invisible markup at the end of a line! I thought 'make' had poor design with TAB characters at the front of lines.) Markdown also allows you to force a line break with a backslash at the end of a line, which is at least visible -- but Rmd removes this capability and requires the invisible characters at the end of a line.

The Rstudio IDE is perhaps the best of the different components of R, yet it too has problems. It adds packages as needed, but when it does it displays status messages in red, a color usually associated with errors or warnings. It allows one to create documents in R or Rmd format, asking for a name. But for Rmd documents, the name you enter is not the name of the file; it is inserted into the file as part of a template. (Rmd documents contain metadata in the document, and a title is inserted into the metadata.) When creating an Rmd document in Rstudio, you have to start the process, enter a name to satisfy the metadata, see the file displayed, and then save the file -- Rstudio then asks you (again) for a name -- but this time it is for the file, not the metadata.

The commonly used models (small or not-so-small programs written in R or a mix of languages) is probably the worst area of the R ecosystem. The models can perform all sorts of calculations, and the quality of the models ranges from good to bad. Some models, such as those for linear programming, use variables and formulas to specify the problem you want solved. But the variables of the model are not variables in R; do not confuse the two separate things with the same name. There are two namespaces (one for R and one for the model) and each namespace holds variables. The programmer must mentally keep variables sorted. Referring to a variable in the wrong namespace yields the expected "variable not found" error.

Some models have good error messages, others do not. One popular model for linear programming, upon finding a variable name that has not been specified for the model's namespace, simply reports "A variable was specified that is not part of the model." (That's the entire message. It does not report the offending name, nor even a program line number. You have to hunt through your code to find the problem name.)

Given the complexity of R, the mix of languages in Rmd, the foibles of Rstudio, and the mediocre quality of commonly used extensions to R, I can say that R is a problem. The question is, how did this situation arise? All of the components are open source. How did open source "allow" such poor quality?

Its a good question, and I don't have a definite answer. But I do have an idea.

We've been conditioned to think of open source as a way to develop quality projects from the commonly known successful open source projects: Linux, Perl, Python, Ruby, and OfficeLibre. These projects are well-respected and popular with many users. They have endured; each is over ten years old. (There are other well-respected open source projects, too.)

When we think of these projects, we see success and quality. Success and quality is the result of hard work, dedication, and a bit of luck. These projects had all of those elements. Open source, by itself is not enough to force a result of high quality.

These successful projects have been run by developers, and more importantly, for developers. That is certainly true of the early, formative years of Linux, and true for any open source programming language. I suspect that the people working on OfficeLibre are primarily developers.

I believe that this second concept does not hold for the R ecosystem. I suspect that the people working on the R language, the Rmd markdown format, and especially the people building the commonly used models are first and foremost data scientists and analysts, and developers second. They are building R for their needs as data scientists.

(I omit Rstudio from this list. It appears that Rstudio is built to be a commercial endeavor, which means that their developers are paid to be developers. It makes status messages in red even more embarrassing.)

I will note that the successful open source projects have had an individual as a strong leader for the project. (Linux has Linus Torvalds, Perl has Larry Wall, etc.) I don't see a strong individual leading the R ecosystem or any component. These projects are -- apparently -- run by committee, or built by separate teams. It is a very Jeffersonian approach to software development, one which may have an effect on the quality of the result. (Again, this is all an idea. I have not confirmed this.)

Where does this leave us?

First, am reluctant to trust any important work to R. There are too many "moving pieces" for my taste -- too many technologies, too many impediments to good code, too many things that can go wrong. The risks outweigh the benefits.

Second, in the long term, we may move away from R. The popularity of R is not the R language, it is the ability to create (or use) linear programming models. Someone will create a platform for analysis, with the ability to define and run linear programming models. It will "just work", to borrow a phrase from Apple.

Moving away from R and the current toolchain is not guaranteed. The current tools may have achieved a "critical mass" of acceptance in the data science community, and the cost of moving to a different toolchain may viewed as unacceptable. In that case, the data science community can look forward to decades of struggles with the tools.

The real lesson is that open source does not guarantee quality software. R and the models are open source, but the quality is... mediocre. High quality requires more than just open source. Open source may be, as the mathematicians say, "necessary but not sufficient". We should consider this when managing any of our projects. Starting a project as open source will not guarantee success, nor will converting an existing project to open source. A successful project needs more: capable management, good people, and a vision and definition of success.

Tuesday, February 12, 2019

Praise for Microsoft

I am not Microsoft's biggest fan. I disliked their products and strategies in the 1990s, when they had a virtual monopoly on desktop operating systems, office software, and development tools. Yet I must give them credit for two recent products: OneDrive and Visual Studio Code.

OneDrive

OneDrive synchronizes files across multiple devices. I can store a file in OneDrive on computer A and later retrieve it on computer B. OneDrive stores data on Microsoft's servers and associates it with my account. If I log in to a Windows computer with my ID and password, I can see all of my files on OneDrive. The files are not copied to the local computer, they are simply available for me to view, change, or delete.

OneDrive also provides storage for online services such as Office Online. This lets me use any computer, even a public one in a library. (I think. I have yet to try this. But it makes sense for Microsoft to do things this way.)

Visual Studio Code

The other product that deserves credit is Visual Studio Code.

Microsoft advertises Visual Studio Code as an editor, yet it is much more. It edits, color-highlights, checks syntax, refactors, debugs (at least with Python), and integrates with git. It has an impressive array of features in a small package. What is significant is that the features are just the right set -- at least for me, and I suspect a large number of developers. It is not weighed down with all of the features of Microsoft's classic Visual Studio package. Visual Studio Code omits the templates and the auto-generation. It replaces the package manager with a series of lightweight plug-ins. It seems to ignore Team Foundation Server (and services), although I could be mistaken about that. (Perhaps there is an enterprise version of VS Code that connects to TFS.)

Beyond the feature set, Visual Studio Code... works. It's a competent product, one that feels good to use. It has just enough to get the job done, and it gets the job done well. I feel comfortable using it. (And that's a rare thing with me and Microsoft products.)

Visual Studio Code is a departure from the traditional Microsoft approach to software. The old Microsoft built software for Windows -- and Windows only. (A few exceptions were made for Mac OS.) Visual Studio Code breaks from that tradition: it is available for Windows, Mac OS, and Linux. This is indeed a ground-breaking project.

OneDrive and Visual Studio Code make for a pleasant experience when developing code. Microsoft deserves credit for bold choices and good tools. If you have not tried them, I recommend that you do.

What have you got to lose?