Friday, December 30, 2011

The wonder of Git

I say "git" in the title of this post, but this is really about distributed version control systems (DVCS).

Git is easy to install and set up. It's easy to learn, and easy to use. (One can make the same claim of other programs, such as Mercurial.)

It's not the simply installation or operation that I find interesting about git. What I find interesting is the organization of the repositories.

Git (and possibly Mercurial and other DVCS packages) allows for a hierarchical collection of repositories. With a hierarchical arrangement, a project starts with a single repository, and then as people join the project they clone the original repository to form their own. They are the committers for their repositories, and the project owner remains the committer for the top-most repository. (This description is a gross over-simplification; there can be multiple committers and more interactions between project members. But bear with me.)

The traditional, "heavyweight" version control systems (PVCS, Visual SourceSafe, TFS) use a single repository. Projects that use these products tend to allow everyone on the project to check in changes -- there are no committers, no one specifically assigned to review changes and approve them. One can set policies to limit check-in privileges, although the mechanisms are clunky. One can set a policy to manually review all code changes, but the VCS provides no support for this policy -- it is enforced from the outside.

The hierarchical arrangement of multiple repositories aligns "commit" privileges with position in the organization. If you own a repository, you are responsible for changes; you are the committer. (Again, this is a simplification.)

Once you approve your changes, you can "send them up" to the next higher level of the repository hierarchy. Git supports this operation, bundling your changes and sending them automatically.

Git supports the synchronization of your repository with the rest of the organization, so you get changes made by others. You may have to resolve conflicts, but they would exist only in areas of the code in which you work.

The capabilities of distributed version control systems supports your organization. They align responsibility with position, requiring more responsibility with authority. (If you want to manage a large part of the code, you must be prepared to review changes for that code.) In contrast, the older version control systems provide nothing in the way of support, and sometimes require effort to manage the project as you would like.

This is a subtle difference, one that is not discussed. I suspect that there will be a quiet revolution, as projects move from the old tools to the new.

Saturday, December 17, 2011

The character of programming languages

Many languages use C-style blocks denoted with braces (the characters '{' and '}').

The BCPL programming language was the first language to use braces as part of its syntax. Earlier languages (notably COBOL, FORTRAN, algol, and LISP) did not use the brace characters.

Earlier languages did not use brace characters because the characters did not exist, at least not as defined characters. There was little in the way of standards for character sets, with each vendor (and sometimes each system) using its own character set. For a language to run on multiple computers, one had to limit the characters used in the language to those available on all planned platforms. Thus, FORTRAN uses uppercase letters and parentheses but not square brackets.

With the introduction of the ASCII and EBCDIC character sets, things changed. A standard character set (well, two standards) let one assume the existence of all of the defined characters.

First published in 1963, the character sets predate the effort to build BCPL in 1966. Thus, when BCPL was designed, the brace characters were present and ready to be used. They also have the virtue of not being used for anything before.

Our character sets define, to some extent, the syntax of our languages.

Monday, December 12, 2011

In open source, can there be more than one?

In the commercial market, multiple products is considered a healthy sign. The prevailing logic states that competition is a good thing, giving us the best possible product.

Vendors must position their products with a balance of different features and compatible operations. A word processor must provide some unique set of functions, but must provide some core set of functions to be considered a word processor. It must provide some compatibility to be considered useful to new customers (perhaps to convert their existing files). A bold, new approach to letter-writing, an approach that varies from the conventions of current products, will have a difficult time gaining acceptance. A word processor that performs the basic tasks of existing word processors, that is ten percent better at most things, and that offers a few new ideas, has a better chance of success. The commercial market allows for different, similar products.

The commercial market also has the risk of failure. Building a system on a product (say, a compiler or a  version control system) builds in the risk of that product. Companies fail, and products are discontinued (even when the vendor succeeds). The user must choose carefully from the available products.

In the open source ecosystem, the dynamics are different. Multiple products (or projects) are not viewed as a necessity. Consider the popular open source solutions for different tasks: Linux, LibreOffice, GIMP, gcc, SendMail, and NFS. There are competing offerings for these functions, but the "market" has settled on these projects. The chances of a project replacing the Linux kernel, or the GIMP package, are low. (Although not zero, as LibreOffice recently replaced OpenOffice.)

Open source is not monolithic, nor is it limited to single solutions. There are competing ideas for scripting languages (Perl, Python, Ruby) and editors (vi and Emacs). There are competing ideas for databases (MySQL and PostGres, not to mention CouchDB).

I think that it is harder for an open source project to remain independent from the lead project than it is for a commercial product to remain independent from the market leader.

In open source, your ideas (and source code) are available. A small project that is mostly compatible with a large project can be absorbed into the large project. To remain independent, a project must remain different in some core aspect. The languages Perl, Python, and Ruby are all different. The editors vi and Emacs are different. Because of their differences, they can continue to exist as independent projects.

For most software functions, I believe that there is a "Highlander effect": there can be only one. There will be one wildly popular kernel, one wildly popular office suite, one wildly popular C++ compiler.

When there are "competing" open source projects, they will either eventually merge or eventually distance themselves (as with the case of vi and Emacs).

A popular open source project can "absorb" other, similar open source projects.

This effect will give a degree of stability to the ecosystem. One can build systems on top of the popular solutions. A system built with Linux, GNU utilities, gcc, and Python will endure for many years.

Sunday, December 11, 2011

Tradeoffs

It used to be that we had to write small, fast programs. Processors were slow, storage media (punch cards, tape drives, disc drives) were even slower, and memory was limited. In such a world, programmers were rewarded for tight code, and DP managers were rewarded for maintaining systems at utilization rates of ninety to ninety-five percent of machine capacity. The reason was that a higher rate meant that you needed more equipment, and a lower rate meant that you had purchased (or more likely, leased) too much equipment.

In that world, programmers had to make tradeoffs when creating systems. Readable code might not be fast, and fast code might not be readable (and often the two were true). Fast code won out over readable (slower) code. Small code that squeezed the most out of the hardware won out over readable (less efficient) code. The tradeoffs were reasonable.

The world has changed. Computers have become more powerful. Networks are faster and more reliable. Databases are faster, and we have multiple choices of database designs -- not everything is a flat file or a set of related tables. Equipment is cheap, almost commodities.

This change means that the focus of costs now shifts. Equipment is not the big cost item. CPU time is not the big cost item. Telecommunications is not the big cost item.

The big problem of application development, the big expense that concerns managers, the thing that will get attention, will be maintenance: the time and cost to modify or enhance an existing system.

The biggest factor in maintenance costs, in my mind, is the readability of the code. Readable code is easy to change (possibly). Opaque code is impossible to change (certainly).

Some folks look to documentation, such as design or architecture documents. I put little value in documentation; I have always found the code to be the final and most accurate description of the system. Documents suffer from aging: they were correct some but the system has been modified. Documents suffer from imprecision: they specify some but not all of the details. Documents suffer from inaccuracy: they specify what the author thought the system was doing, not what the system actually does.

Sometimes documentation can be useful. The business requirements of a system can be useful. But I find "System architecture" and "Design overview" documents useless.

If the code is to be the documentation for itself, then it must be readable.

Readability is a slippery concept. Different programmers have different ideas about "readability". What is readable to me may not be readable to you. Over my career, my ideas of readability have changed, as I learned new programming techniques (structured programming, object-oriented programming, functional programming), and even as I learned more about a language (my current ideas of "readable" C++ code are very different from my early ideas of "readable" C++ code).

I won't define readability. I will let each project decide on a meaningful definition of readability. I will list a few ideas that will let teams improve the readability of their code (however they define it).

Version control for source code A shop that is not using version control is not serious about software development. There are several reliable, well-documented and well supported, popular systems for version control. Version control lets multiple team members work together and coordinate their changes.

Automated builds An automated build lets you build the system reliably, consistently, and at low effort. You want the product for the customer to be built with a reliable and consistent method.

Any developer can build the system Developers need to build the system to run their tests. They need a reliable, consistent, low-effort, method to do that. And it has to work with their development environment, allowing them to change code and debug the system.

Automated testing Like version control, automated testing is necessary for a modern shop. You want to test the product before you send it to your customers, and you want the testing to be consistent and reliable. (You also want it easy to run.)

Any developer can test the system Developers need to know that their changes affect only the behaviors that they intend, and no other parts of the system. They need to use the tests to ensure that their changes have no unintended side-effects. Low-effort automated tests let them run the tests often.

Acceptance of refactoring To improve code, complicated classes and modules must be changed into sets of smaller, simpler classes and modules. Refactoring changes the code without changing the external behavior of the code. If I start with a system that passes its tests (automated tests, right?) and I refactor it, it should pass the same tests. When I can rearrange code, without changing the behavior, I can make the code more readable.

Incentives for developers to use all of the above Any project that discourages developers from using automated builds or automated tests, either explicitly or implicitly, will see little or no improvements in readability.

But the biggest technique for readable code is that the organization -- its developers and managers -- must want readable code. If the organization is more concerned with "delivering a quality product" or "meeting the quarterly numbers", then they will trade off readability for those goals.