Friday, December 30, 2011

The wonder of Git

I say "git" in the title of this post, but this is really about distributed version control systems (DVCS).

Git is easy to install and set up. It's easy to learn, and easy to use. (One can make the same claim of other programs, such as Mercurial.)

It's not the simply installation or operation that I find interesting about git. What I find interesting is the organization of the repositories.

Git (and possibly Mercurial and other DVCS packages) allows for a hierarchical collection of repositories. With a hierarchical arrangement, a project starts with a single repository, and then as people join the project they clone the original repository to form their own. They are the committers for their repositories, and the project owner remains the committer for the top-most repository. (This description is a gross over-simplification; there can be multiple committers and more interactions between project members. But bear with me.)

The traditional, "heavyweight" version control systems (PVCS, Visual SourceSafe, TFS) use a single repository. Projects that use these products tend to allow everyone on the project to check in changes -- there are no committers, no one specifically assigned to review changes and approve them. One can set policies to limit check-in privileges, although the mechanisms are clunky. One can set a policy to manually review all code changes, but the VCS provides no support for this policy -- it is enforced from the outside.

The hierarchical arrangement of multiple repositories aligns "commit" privileges with position in the organization. If you own a repository, you are responsible for changes; you are the committer. (Again, this is a simplification.)

Once you approve your changes, you can "send them up" to the next higher level of the repository hierarchy. Git supports this operation, bundling your changes and sending them automatically.

Git supports the synchronization of your repository with the rest of the organization, so you get changes made by others. You may have to resolve conflicts, but they would exist only in areas of the code in which you work.

The capabilities of distributed version control systems supports your organization. They align responsibility with position, requiring more responsibility with authority. (If you want to manage a large part of the code, you must be prepared to review changes for that code.) In contrast, the older version control systems provide nothing in the way of support, and sometimes require effort to manage the project as you would like.

This is a subtle difference, one that is not discussed. I suspect that there will be a quiet revolution, as projects move from the old tools to the new.

Saturday, December 17, 2011

The character of programming languages

Many languages use C-style blocks denoted with braces (the characters '{' and '}').

The BCPL programming language was the first language to use braces as part of its syntax. Earlier languages (notably COBOL, FORTRAN, algol, and LISP) did not use the brace characters.

Earlier languages did not use brace characters because the characters did not exist, at least not as defined characters. There was little in the way of standards for character sets, with each vendor (and sometimes each system) using its own character set. For a language to run on multiple computers, one had to limit the characters used in the language to those available on all planned platforms. Thus, FORTRAN uses uppercase letters and parentheses but not square brackets.

With the introduction of the ASCII and EBCDIC character sets, things changed. A standard character set (well, two standards) let one assume the existence of all of the defined characters.

First published in 1963, the character sets predate the effort to build BCPL in 1966. Thus, when BCPL was designed, the brace characters were present and ready to be used. They also have the virtue of not being used for anything before.

Our character sets define, to some extent, the syntax of our languages.

Monday, December 12, 2011

In open source, can there be more than one?

In the commercial market, multiple products is considered a healthy sign. The prevailing logic states that competition is a good thing, giving us the best possible product.

Vendors must position their products with a balance of different features and compatible operations. A word processor must provide some unique set of functions, but must provide some core set of functions to be considered a word processor. It must provide some compatibility to be considered useful to new customers (perhaps to convert their existing files). A bold, new approach to letter-writing, an approach that varies from the conventions of current products, will have a difficult time gaining acceptance. A word processor that performs the basic tasks of existing word processors, that is ten percent better at most things, and that offers a few new ideas, has a better chance of success. The commercial market allows for different, similar products.

The commercial market also has the risk of failure. Building a system on a product (say, a compiler or a  version control system) builds in the risk of that product. Companies fail, and products are discontinued (even when the vendor succeeds). The user must choose carefully from the available products.

In the open source ecosystem, the dynamics are different. Multiple products (or projects) are not viewed as a necessity. Consider the popular open source solutions for different tasks: Linux, LibreOffice, GIMP, gcc, SendMail, and NFS. There are competing offerings for these functions, but the "market" has settled on these projects. The chances of a project replacing the Linux kernel, or the GIMP package, are low. (Although not zero, as LibreOffice recently replaced OpenOffice.)

Open source is not monolithic, nor is it limited to single solutions. There are competing ideas for scripting languages (Perl, Python, Ruby) and editors (vi and Emacs). There are competing ideas for databases (MySQL and PostGres, not to mention CouchDB).

I think that it is harder for an open source project to remain independent from the lead project than it is for a commercial product to remain independent from the market leader.

In open source, your ideas (and source code) are available. A small project that is mostly compatible with a large project can be absorbed into the large project. To remain independent, a project must remain different in some core aspect. The languages Perl, Python, and Ruby are all different. The editors vi and Emacs are different. Because of their differences, they can continue to exist as independent projects.

For most software functions, I believe that there is a "Highlander effect": there can be only one. There will be one wildly popular kernel, one wildly popular office suite, one wildly popular C++ compiler.

When there are "competing" open source projects, they will either eventually merge or eventually distance themselves (as with the case of vi and Emacs).

A popular open source project can "absorb" other, similar open source projects.

This effect will give a degree of stability to the ecosystem. One can build systems on top of the popular solutions. A system built with Linux, GNU utilities, gcc, and Python will endure for many years.

Sunday, December 11, 2011

Tradeoffs

It used to be that we had to write small, fast programs. Processors were slow, storage media (punch cards, tape drives, disc drives) were even slower, and memory was limited. In such a world, programmers were rewarded for tight code, and DP managers were rewarded for maintaining systems at utilization rates of ninety to ninety-five percent of machine capacity. The reason was that a higher rate meant that you needed more equipment, and a lower rate meant that you had purchased (or more likely, leased) too much equipment.

In that world, programmers had to make tradeoffs when creating systems. Readable code might not be fast, and fast code might not be readable (and often the two were true). Fast code won out over readable (slower) code. Small code that squeezed the most out of the hardware won out over readable (less efficient) code. The tradeoffs were reasonable.

The world has changed. Computers have become more powerful. Networks are faster and more reliable. Databases are faster, and we have multiple choices of database designs -- not everything is a flat file or a set of related tables. Equipment is cheap, almost commodities.

This change means that the focus of costs now shifts. Equipment is not the big cost item. CPU time is not the big cost item. Telecommunications is not the big cost item.

The big problem of application development, the big expense that concerns managers, the thing that will get attention, will be maintenance: the time and cost to modify or enhance an existing system.

The biggest factor in maintenance costs, in my mind, is the readability of the code. Readable code is easy to change (possibly). Opaque code is impossible to change (certainly).

Some folks look to documentation, such as design or architecture documents. I put little value in documentation; I have always found the code to be the final and most accurate description of the system. Documents suffer from aging: they were correct some but the system has been modified. Documents suffer from imprecision: they specify some but not all of the details. Documents suffer from inaccuracy: they specify what the author thought the system was doing, not what the system actually does.

Sometimes documentation can be useful. The business requirements of a system can be useful. But I find "System architecture" and "Design overview" documents useless.

If the code is to be the documentation for itself, then it must be readable.

Readability is a slippery concept. Different programmers have different ideas about "readability". What is readable to me may not be readable to you. Over my career, my ideas of readability have changed, as I learned new programming techniques (structured programming, object-oriented programming, functional programming), and even as I learned more about a language (my current ideas of "readable" C++ code are very different from my early ideas of "readable" C++ code).

I won't define readability. I will let each project decide on a meaningful definition of readability. I will list a few ideas that will let teams improve the readability of their code (however they define it).

Version control for source code A shop that is not using version control is not serious about software development. There are several reliable, well-documented and well supported, popular systems for version control. Version control lets multiple team members work together and coordinate their changes.

Automated builds An automated build lets you build the system reliably, consistently, and at low effort. You want the product for the customer to be built with a reliable and consistent method.

Any developer can build the system Developers need to build the system to run their tests. They need a reliable, consistent, low-effort, method to do that. And it has to work with their development environment, allowing them to change code and debug the system.

Automated testing Like version control, automated testing is necessary for a modern shop. You want to test the product before you send it to your customers, and you want the testing to be consistent and reliable. (You also want it easy to run.)

Any developer can test the system Developers need to know that their changes affect only the behaviors that they intend, and no other parts of the system. They need to use the tests to ensure that their changes have no unintended side-effects. Low-effort automated tests let them run the tests often.

Acceptance of refactoring To improve code, complicated classes and modules must be changed into sets of smaller, simpler classes and modules. Refactoring changes the code without changing the external behavior of the code. If I start with a system that passes its tests (automated tests, right?) and I refactor it, it should pass the same tests. When I can rearrange code, without changing the behavior, I can make the code more readable.

Incentives for developers to use all of the above Any project that discourages developers from using automated builds or automated tests, either explicitly or implicitly, will see little or no improvements in readability.

But the biggest technique for readable code is that the organization -- its developers and managers -- must want readable code. If the organization is more concerned with "delivering a quality product" or "meeting the quarterly numbers", then they will trade off readability for those goals.


Wednesday, November 30, 2011

Is "cheap and easy" a good thing?

In the IT industry, we are all about developing (and adopting) new techniques. The techniques often start as manual processes, often slow, expensive, and unreliable. We automate these processes, and eventually, the processes become cheap and easy. One would think that this path is a good thing.

But there is a dark spot.

Consider two aspects of software development: backups and version control.

More often than I like, I encounter projects that do not use a version control system. And many times, I encounter shops that have no process for creating backup copies of data.

In the early days of PCs, backups were expensive and consumed time and resources. The history of version control systems is similar. The earliest (primitive) systems were followed by (expensive) commercial solutions (that also consumed time and resources).

But the early objections to backups and version control no longer hold. There are solutions that are freely available, easy to use, easy to administer, and mostly automatic. Disk space and network connections are plentiful.

These solutions do require some effort and some administration. Nothing is completely free, or completely automatic. But the costs are significantly less than they were.

The resistance to version control is, then, only in the mindset of the project manager (or chief programmer, or architect, or whoever is running the show). If a project is not using version control, its because the project manager thinks that not using version control will be faster (or cheaper, or better) than using version control. If a shop is not making backup copies of important data, its because the manager thinks that not making backups is cheaper than making backups.

It is not enough for a solution to be cheap and easy. A solution has to be recognized as cheap and easy, and recognized as the right thing to do. The problem facing "infrastructure" items like backups and version control is that as they become cheap and easy, they also fade into the background. Solutions that "run themselves" require little in the way of attention from managers, who rightfully focus their efforts on running the business.

When solutions become cheap and easy (and reliable), they fall off of managers' radar. I suspect that few magazine articles talk about backup systems. (The ones that do probably discuss compliance with regulations for specific environments.) Today's articles on version control talk about the benefits of the new technologies (distributed version control systems), not the necessity of version control.

So here is the fading pain effect: We start with a need. We develop solutions, and make those tasks easier and more reliable, and we reduce the pain. As the pain is reduced, the visibility of the tasks drops. As the visibility drops, the importance assigned by managers drops. As the importance drops, fewer resources are assigned to the task. Resources are allocated to other, bigger pains. ("The squeaky wheel gets the grease.")

Beyond that, there seems to be a "window of awareness" for technical infrastructure solutions. When we invent techniques (version control, for example), there is a certain level of discussion and awareness of the techniques. As we improve the tools, the discussions become fewer, and at some point they live only in obscure corners of web forums. Shops that have adopted the techniques continue to use them, but shops that did not adopt the techniques have little impetus to adopt them, since they (the solutions) are no longer discussed.

So if you're a shop and you're "muddling through" with a manual solution (or no solution), you eventually stop getting the message that there are automated solutions. At this point, it is likely that you will never adopt the technology.

And this is why I think that "cheap and easy" may not always be a good thing.

Saturday, November 19, 2011

Programming and the fullness of time

Sometimes when writing code, the right thing to do is to wait for the technology to improve.

On a previous project, we had the challenge of (after the system had been constructed) making the programs run faster. Once the user saw the performance of the system, they wanted a faster version. It was a reasonable request, although the performance was sluggish, not horrible. The system was usable, it just wasn't "snappy".

So we set about devising a solution. We knew the code, and we knew that making the system faster would not be easy. Improved performance would require changing much of the code. Complicating the issue was the tool that we had used: a code-generation package that created a lot of the code for us. Once we started modifying the generated code, we could no longer use the generator. Or we could track all changes and apply them to later generated versions of the system. Neither path was appealing.

We debated various approaches, and the project management bureaucracy was such that we *could* debate various approaches without showing progress in code changes. That is, we could stall or "run the clock out".

It turned out that doing nothing was the right thing to do. By making no changes to the code, but simply waiting for PCs to become faster, the problem was solved.

So now we come to Google's Chromebook, the portable PC with only a browser.

One of the criticisms against the Chromebook is the lack of off-line capabilities for Google Docs. This is a fair criticism; the Chromebook is useful only when connected to the internet, and internet connections are not available everywhere.

Yet an off-line mode for Google Docs may be the wrong solution to the problem. The cost of developing such a solution is not trivial. Google might invest several months (with multiple people) developing and testing the off-line mode.

But what if networking becomes ubiquitous? Or at least more available? If that were to happen, then the need for off-line processing is reduced (if not eliminated). The solution to "how do I process documents when I am not connected" is solved not by creating a new solution, but by waiting for the surrounding technology to improve.

Google has an interesting decision ahead of them. They can build the off-line capabilities into their Docs applications. (I suspect it would require a fair amount of Javascript and hacks for storing large sets of data.) Or they can do nothing and hope that network coverage improves. (By "do nothing", I mean work on other projects.)

These decisions are easy to review in hindsight, they are cloudy on the front end. If I were them, I would be looking at the effort for off-line processing, the possible side benefits from that solution, and the rate of network coverage. Right now, I see no clear "winning" choice; no obvious solution that is significantly better than others. Which doesn't mean that Google should simply wait for network coverage to get better -- but it also means that Google shouldn't count that idea out.

Sunday, November 13, 2011

Programming languages exist to make programming easy

We create programming languages to make programming easy.

After the invention of the electronic computer, we invented FORTRAN and COBOL. Both languages made the act of programming easy. (Easier than assembly language, the only other game in town.) FORTRAN made it easy to perform numeric computations, and despite the horror of its input/output methods, it also made it easier to read and write numerical values. COBOL also made it easy to perform computations and input/output operations; it was slanted towards structured data (records containing fields) and readability (longer variable names, and verbose language keywords).

After the invention of time-sharing (and a shortage of skilled programmers), we invented BASIC, a teaching language that linguistically sat between FORTRAN and COBOL.

After the invention of minicomputers (and the ability for schools and research groups to purchase them), we invented the C language, which combined structured programming concepts from Algol and Pascal with the low-level access of assembly language. The combination allowed researchers to connect computers to laboratory equipment and write efficient programs for processing data.

After the invention of graphics terminals and PCs, we invented Microsoft Windows and the Visual Basic language to program applications in Windows. The earlier languages of C and C++ made programming in Windows possible, but Visual Basic was the language that made it easy.

After PCs became powerful enough, we invented Java, which leverage the power to run interpreted byte-code programs, but also (and more significantly) handle threaded applications. Support for threading was built into the Java language.

With the invention of networking, we created HTML and web browsers and Javascript.

I have left out equipment (microcomputers with CP/M, the Xerox Alto, the Apple Macintosh) and languages (LISP, RPG, C#, and others). I'm looking at the large trend using a few data points. If your favorite computer or language is missing, please forgive my arbitrary selections.

We create languages to make tasks easier for ourselves. As we develop new hardware, larger data sets, and new ways of connecting data and devices, we need new languages to handle the capabilities of these new inventions.

Looking forward, what can we see? What new hardware will stimulate the creation of new languages?

Cloud computing is big, and will lead to creative solutions. We're already seeing new languages that have increased rigor in the form of functional programming. We moved from non-structured programming to structured programming to object-oriented programming; in the future I expect us to move to functional programming. Functional programming is a good fit for cloud computing, with its immutable objects and no-side-effect functions.

Mobile programming is popular, but I don't expect a language for mobile apps. Instead, I expect new languages for mobile devices. The Java, C#, and Objective-C languages (from Google, Microsoft, and Apple, respectively) will mutate into languages better suited to small, mobile devices that must run applications in a secure manner. I expect that security, not performance, will be the driver for change.

Big data is on the rise. We'll see new languages to handle the collection, synchronization, querying, and analysis of large data sets. The language 'Processing' is a start in that direction, letting us render data in a visual form. The invention of NoSQL databases is also a start; look for a 'NoSQL standard' language (or possibly several).

The new languages will allow us to handle new challenges. But that doesn't mean that the old languages will go away. Those languages were designed to handle specific challenges, and they handle them well. So well that new languages have not displaced them. (Billing systems are still in COBOL, scientists still use Fortran, and lots of Microsoft Windows applications are still running in Visual Basic.) New languages are optimized for different criteria and cannot always handle the older tasks; I would not want to write a billing system in C, for example.

As the 'space' of our challenges expands, we invent languages to fill that space. Let's invent some languages and meet some new challenges!