Wednesday, February 27, 2019

R shows that open source permits poor quality

We like to think that open source projects are better than closed source projects. Not just cheaper, but better -- higher quality, more reliable, and easier to use. But while high quality and reliability and usability may be the result of some open source projects, they are not guaranteed.

Consider the R tool chain, which includes the R interpreter, the Rmd markdown language, the Rstudio IDE, and commonly used models built in R. All of these are open source, and all have significant problems.

The R interpreter is a large and complicated program. It is implemented in multiple programming languages: C, C++, Fortran -- and Java seems to be part of the build as well. To build R you need compilers for all of these languages, and you also need lots of libraries. The source code is not trivial; it takes quite a bit of time to compile the R source and get a working executable.

The time for building R concerns me less than the mix of languages and the number of libraries. R sits on top of a large stack of technologies, and a problem in any piece can percolate up and become a problem in R. If one is lucky, R will fail to run; if not, R will run and use whatever data happens to be available after the failure.

The R language itself has problems. It uses one-letter names for common functions ('t' to transpose a matrix, 'c' to combine values into a list) which means that these letters are not available for "normal" variables. (Or perhaps they are, if R keeps variables and functions in separate namespaces. But even then, a program would be confusing to read.)

R also suffers from too many data containers. One can have a list, which is different from a vector, which is different from a matrix, which is different from a data frame. The built-in libraries all expect data of some type, but  woe to he that uses one when a different structure is expected. (Some functions do the right thing, and others complain about a type mismatch.)

Problems are not confined to the R language. The Rmd markdown language is another problem area. Rmd is based on Markdown, which has problems of its own. Rmd inherits these problems and adds more. A document in Rmd can contain plain text, markdown for text effects such as bold and underline, blocks of R code, blocks of Tex. Rmd is processed into regular Markdown, which is then processed into the output form of your choice (PDF, HTML, MS-Word, and a boatload of other formats).

Markdown allows you to specify line breaks by typing two space characters at the end of a line. (Invisible markup at the end of a line! I thought 'make' had poor design with TAB characters at the front of lines.) Markdown also allows you to force a line break with a backslash at the end of a line, which is at least visible -- but Rmd removes this capability and requires the invisible characters at the end of a line.

The Rstudio IDE is perhaps the best of the different components of R, yet it too has problems. It adds packages as needed, but when it does it displays status messages in red, a color usually associated with errors or warnings. It allows one to create documents in R or Rmd format, asking for a name. But for Rmd documents, the name you enter is not the name of the file; it is inserted into the file as part of a template. (Rmd documents contain metadata in the document, and a title is inserted into the metadata.) When creating an Rmd document in Rstudio, you have to start the process, enter a name to satisfy the metadata, see the file displayed, and then save the file -- Rstudio then asks you (again) for a name -- but this time it is for the file, not the metadata.

The commonly used models (small or not-so-small programs written in R or a mix of languages) is probably the worst area of the R ecosystem. The models can perform all sorts of calculations, and the quality of the models ranges from good to bad. Some models, such as those for linear programming, use variables and formulas to specify the problem you want solved. But the variables of the model are not variables in R; do not confuse the two separate things with the same name. There are two namespaces (one for R and one for the model) and each namespace holds variables. The programmer must mentally keep variables sorted. Referring to a variable in the wrong namespace yields the expected "variable not found" error.

Some models have good error messages, others do not. One popular model for linear programming, upon finding a variable name that has not been specified for the model's namespace, simply reports "A variable was specified that is not part of the model." (That's the entire message. It does not report the offending name, nor even a program line number. You have to hunt through your code to find the problem name.)

Given the complexity of R, the mix of languages in Rmd, the foibles of Rstudio, and the mediocre quality of commonly used extensions to R, I can say that R is a problem. The question is, how did this situation arise? All of the components are open source. How did open source "allow" such poor quality?

Its a good question, and I don't have a definite answer. But I do have an idea.

We've been conditioned to think of open source as a way to develop quality projects from the commonly known successful open source projects: Linux, Perl, Python, Ruby, and OfficeLibre. These projects are well-respected and popular with many users. They have endured; each is over ten years old. (There are other well-respected open source projects, too.)

When we think of these projects, we see success and quality. Success and quality is the result of hard work, dedication, and a bit of luck. These projects had all of those elements. Open source, by itself is not enough to force a result of high quality.

These successful projects have been run by developers, and more importantly, for developers. That is certainly true of the early, formative years of Linux, and true for any open source programming language. I suspect that the people working on OfficeLibre are primarily developers.

I believe that this second concept does not hold for the R ecosystem. I suspect that the people working on the R language, the Rmd markdown format, and especially the people building the commonly used models are first and foremost data scientists and analysts, and developers second. They are building R for their needs as data scientists.

(I omit Rstudio from this list. It appears that Rstudio is built to be a commercial endeavor, which means that their developers are paid to be developers. It makes status messages in red even more embarrassing.)

I will note that the successful open source projects have had an individual as a strong leader for the project. (Linux has Linus Torvalds, Perl has Larry Wall, etc.) I don't see a strong individual leading the R ecosystem or any component. These projects are -- apparently -- run by committee, or built by separate teams. It is a very Jeffersonian approach to software development, one which may have an effect on the quality of the result. (Again, this is all an idea. I have not confirmed this.)

Where does this leave us?

First, am reluctant to trust any important work to R. There are too many "moving pieces" for my taste -- too many technologies, too many impediments to good code, too many things that can go wrong. The risks outweigh the benefits.

Second, in the long term, we may move away from R. The popularity of R is not the R language, it is the ability to create (or use) linear programming models. Someone will create a platform for analysis, with the ability to define and run linear programming models. It will "just work", to borrow a phrase from Apple.

Moving away from R and the current toolchain is not guaranteed. The current tools may have achieved a "critical mass" of acceptance in the data science community, and the cost of moving to a different toolchain may viewed as unacceptable. In that case, the data science community can look forward to decades of struggles with the tools.

The real lesson is that open source does not guarantee quality software. R and the models are open source, but the quality is... mediocre. High quality requires more than just open source. Open source may be, as the mathematicians say, "necessary but not sufficient". We should consider this when managing any of our projects. Starting a project as open source will not guarantee success, nor will converting an existing project to open source. A successful project needs more: capable management, good people, and a vision and definition of success.

No comments: