Fitzpatrick's Fabulous Future: version control

Showing posts with label version control. Show all posts

Monday, May 29, 2017

Microsoft's GVFS for git makes git a different thing

Microsoft is rather proud of their GVFS filesystem for git, but I think they don't understand quite what it is that they have done.

GVFS, in short, changes git into a different thing. The plain git is a distributed version control system. When combined with GVFS, git becomes... well, let's back up a bit.

A traditional, non-distributed version control system consists of a central repository which holds files, typically source code. Users "check out" files, make changes, and "check in" the revised files. While users have copies of the files on their computers, the central repository is the only place that holds all of the files and all of the revisions to the files. It is the one place with all information, and is a single point of failure.

A distributed version control system, in contrast, stores a complete set of files and revisions on each user's computer. Each user has a complete repository. A new user clones a repository from an existing team member and has a a complete set of files and revisions, ready to go. The repositories are related through parent-child links; the new user in our example has a repository that is a child of the cloned repository. Each repository is a clone, except for the very first instance, which could be considered the 'root' repository. The existence of these copies provides redundancy and guards against a failure of the central repository in traditional version control systems.

Now let's look at GVFS and how it changes git.

GVFS replaces the local copy of a repository with a set of virtual files. The files in a repository are stored in a central location and downloaded only when needed. When checked in, the files are uploaded to the central location, not the local repository (which doesn't exist). From the developer's perspective, the changes made by GVFS are transparent. Git behaves just as it did before. (Although with GVFS, large repositories perform better than with regular git.)

Microsoft's GVFS changes the storage of repositories. It does not eliminate the multiple copies of the repository; each user retains their own copy. It does move those copies to the central server. (Or servers. The blog entry does not specify.)

I suppose you could achieve the same effect (almost) with regular git by changing the location of the .git directory. Instead of a local drive, you could use a directory on an off-premise server. If everyone did this, if every stored their git repository on the same server (say, a corporate server), you would have something similar to git with GVFS. (It is not exactly the same, as GVFS does some other things to improve performance.)

Moving the git repositories off of individual, distributed computers and onto a single, central server changes the idea of a distributed version control system. The new configuration is something in between the traditional version control system and a distributed version control system.

Microsoft had good reason to make this change. The performance of standard git was not acceptable for a very large team. I don't fault them for it. And I think it can be a good change.

Yet it does make git a different creature. I think Microsoft and the rest of the industry should recognize that.

Sunday, May 22, 2016

Small check-ins saved me

With all of the new technology, from cloud computing to tablets to big data, we can forget the old techniques that help us.

This week, I was helped by one of those simple techniques. The technique that helped was frequent, small check-ins to version control systems. I was using Microsoft's TFS, but this technique works with any system: TFS, Subversion, git, CVS, ... even SourceSafe!

Small, frequent changes are easier to review and easier to revert than large changes. Any version control system accepts small changes; the decision to make large or small changes is up to the developer.

After a number of changes, the team with whom I work discovered a defect, one that had escaped our tests. We knew that it was caused by a recent change -- we tested releases and found that it occurred only in the most recent release. That information limited the introduction of the defect to the most recent forty check-ins.

Forty check-ins may seem like a long list, but we quickly identified the specific check-in by using a binary search technique: get the source from the middle revision; if the error occurs move to the earlier half, if not move to the later half and start in that half's middle.

The real benefit occurred when we found the specific check-in. Since all check-ins were small, this check-in was too. (It was a change of five different lines.) It was easy to review the five individual lines and find the error.

Once we found the error, it was easy to make the correction to the latest version of the code, run our tests (which now included an addition test for the specific problem we found), verify that the fix was correct, and continue our development.

A large check-in would have required much more examination, and more time.

Small check-ins cost little and provide easy verification. Why not use them?

Thursday, September 5, 2013

Measure code complexity

We measure many things on development projects, from the cost to the time to user satisfaction. Yet we do not measure the complexity of our code.

One might find this surprising. After all, complexity of code is closely tied to quality (or so I like to believe) and also an indication of future effort (simple code is easier to change than complicated code).

The problem is not in the measurement of complexity. We have numerous techniques and tools, spanning the range from "lines of code" to function points. There are commercial tools and open source tools that measure complexity.

No, the problem is not in techniques or tools.

It is a matter of will. We don't measure complexity because, in short, we don't want to.

I can think of a few reasons that discourage the measurement of source code complexity.

- The measurement of complexity is a negative one. That is, more complexity is worse. A result of 170 is better than a result of 270, and this inverted scale is awkward. We are trained to like positive measurements, like baseball scores. (Perhaps the golf enthusiasts would see more interest if they changed their scoring system.)

- There is no direct way to connect complexity to cost. While we understand that a complicated code base is harder to maintain that a simple one, we have no way of converting that extra complexity into dollars. If we reduce our complexity from 270 to 170 (or 37 percent), do we reduce the cost of development by the same percentage? Why or why not? (I suspect that there is a lot to be learned in this area. Perhaps several Masters theses can be derived from it.)

- Not knowing the complexity shifts risk from managers to developers. In organizations with antagonistic relations between managers and developers, a willful ignorance of code complexity pushes risk onto developers. Estimates, if made by managers, will ignore complexity. Estimates made by developers may be optimistic (or pessimistic) but may be adjusted by managers. In either case, schedule delays will be the fault of the developer, not the manager.

- Developers (in shops with poor management relations) may avoid the use of any metrics, fearing that they will be used for performance evaluations.

Looking forward, I can see a time when we do measure code complexity.

- A company considering the acquisition of software (including the source code), may want an unbiased opinion of the code. They may not completely trust the seller (who is biased towards the sale) and they may not trust their own people (who may be biased against 'outside' software).

- A project team may want to identify complex areas of their code, to identify high-risk areas.

- A development team may wish to estimate the effort for maintaining code, and may include the complexity as a factor in that effort.

The tools are available.

I believe that we will, eventually, consider complexity analysis a regular part of software development. Perhaps it will start small, like the adoption of version control and automated testing. Both of those techniques were at one time considered new and unproven. Today, they are considered 'best practices'.

Tuesday, April 16, 2013

File Save No More

The new world of mobile/cloud is breaking many conventions of computer applications.

Take, for example, the long-established command to save a file. In Windows, this has been the menu option File / Save, or the keyboard shortcut CTRL-S.

Android apps do not have this sequence. In fact, they have no sequence to save data. Instead, they save your data as you enter it, or when you dismiss a dialog.

Not only Android apps (and I suspect iOS apps), but Google web apps exhibit this behavior too. Use Google Drive to create a document or a spreadsheet.

Breaking the "save file" concept allows for big changes. It lets us get rid of an operation. It lets us get rid of menus.

It also lets us get rid of the concept of a file. We don't need files in the cloud; we need data. This data can be stored in files (transparently to us), or in a database (also transparently to us), or in a NoSQL database (also transparently to us).

We don't care where the data is stored, or which container (filesystem or database) is used.

We do care about getting the data back.

I suspect that we will soon care about previous versions of our data.

Windows has add-ins for retrieving older versions of data. I have used a few, and they tend to be "hacks": things bolted on to Windows and clumsy to use. They don't save every version; instead, they keep snapshots at scheduled times.

Look for real version management in the cloud. Google, with its gigabytes of storage for each e-mail user, will be able to keep older versions of files. (Perhaps they are already doing it.)

The "File / Save" command will be replaced with the "File Versions" list, letting us retrieve an old version of the file. The list will show each and every revision of the file, not just the versions captured at scheduled times.

Once a major player offers this feature, other players will have to follow.

Friday, December 30, 2011

The wonder of Git

I say "git" in the title of this post, but this is really about distributed version control systems (DVCS).

Git is easy to install and set up. It's easy to learn, and easy to use. (One can make the same claim of other programs, such as Mercurial.)

It's not the simply installation or operation that I find interesting about git. What I find interesting is the organization of the repositories.

Git (and possibly Mercurial and other DVCS packages) allows for a hierarchical collection of repositories. With a hierarchical arrangement, a project starts with a single repository, and then as people join the project they clone the original repository to form their own. They are the committers for their repositories, and the project owner remains the committer for the top-most repository. (This description is a gross over-simplification; there can be multiple committers and more interactions between project members. But bear with me.)

The traditional, "heavyweight" version control systems (PVCS, Visual SourceSafe, TFS) use a single repository. Projects that use these products tend to allow everyone on the project to check in changes -- there are no committers, no one specifically assigned to review changes and approve them. One can set policies to limit check-in privileges, although the mechanisms are clunky. One can set a policy to manually review all code changes, but the VCS provides no support for this policy -- it is enforced from the outside.

The hierarchical arrangement of multiple repositories aligns "commit" privileges with position in the organization. If you own a repository, you are responsible for changes; you are the committer. (Again, this is a simplification.)

Once you approve your changes, you can "send them up" to the next higher level of the repository hierarchy. Git supports this operation, bundling your changes and sending them automatically.

Git supports the synchronization of your repository with the rest of the organization, so you get changes made by others. You may have to resolve conflicts, but they would exist only in areas of the code in which you work.

The capabilities of distributed version control systems supports your organization. They align responsibility with position, requiring more responsibility with authority. (If you want to manage a large part of the code, you must be prepared to review changes for that code.) In contrast, the older version control systems provide nothing in the way of support, and sometimes require effort to manage the project as you would like.

This is a subtle difference, one that is not discussed. I suspect that there will be a quiet revolution, as projects move from the old tools to the new.

Wednesday, November 30, 2011

Is "cheap and easy" a good thing?

In the IT industry, we are all about developing (and adopting) new techniques. The techniques often start as manual processes, often slow, expensive, and unreliable. We automate these processes, and eventually, the processes become cheap and easy. One would think that this path is a good thing.

But there is a dark spot.

Consider two aspects of software development: backups and version control.

More often than I like, I encounter projects that do not use a version control system. And many times, I encounter shops that have no process for creating backup copies of data.

In the early days of PCs, backups were expensive and consumed time and resources. The history of version control systems is similar. The earliest (primitive) systems were followed by (expensive) commercial solutions (that also consumed time and resources).

But the early objections to backups and version control no longer hold. There are solutions that are freely available, easy to use, easy to administer, and mostly automatic. Disk space and network connections are plentiful.

These solutions do require some effort and some administration. Nothing is completely free, or completely automatic. But the costs are significantly less than they were.

The resistance to version control is, then, only in the mindset of the project manager (or chief programmer, or architect, or whoever is running the show). If a project is not using version control, its because the project manager thinks that not using version control will be faster (or cheaper, or better) than using version control. If a shop is not making backup copies of important data, its because the manager thinks that not making backups is cheaper than making backups.

It is not enough for a solution to be cheap and easy. A solution has to be recognized as cheap and easy, and recognized as the right thing to do. The problem facing "infrastructure" items like backups and version control is that as they become cheap and easy, they also fade into the background. Solutions that "run themselves" require little in the way of attention from managers, who rightfully focus their efforts on running the business.

When solutions become cheap and easy (and reliable), they fall off of managers' radar. I suspect that few magazine articles talk about backup systems. (The ones that do probably discuss compliance with regulations for specific environments.) Today's articles on version control talk about the benefits of the new technologies (distributed version control systems), not the necessity of version control.

So here is the fading pain effect: We start with a need. We develop solutions, and make those tasks easier and more reliable, and we reduce the pain. As the pain is reduced, the visibility of the tasks drops. As the visibility drops, the importance assigned by managers drops. As the importance drops, fewer resources are assigned to the task. Resources are allocated to other, bigger pains. ("The squeaky wheel gets the grease.")

Beyond that, there seems to be a "window of awareness" for technical infrastructure solutions. When we invent techniques (version control, for example), there is a certain level of discussion and awareness of the techniques. As we improve the tools, the discussions become fewer, and at some point they live only in obscure corners of web forums. Shops that have adopted the techniques continue to use them, but shops that did not adopt the techniques have little impetus to adopt them, since they (the solutions) are no longer discussed.

So if you're a shop and you're "muddling through" with a manual solution (or no solution), you eventually stop getting the message that there are automated solutions. At this point, it is likely that you will never adopt the technology.

And this is why I think that "cheap and easy" may not always be a good thing.

Fitzpatrick's Fabulous Future