Wednesday, January 31, 2018

Optimizing in the wrong direction

Back in the late 200X years, I toyed with the idea of a new version control system. It wasn't git, or even git-like. In fact, it was the opposite.

At the time, version control was centralized. There was a single instance of the repository and you (the developer) had a single "snapshot" of the files. Usually, your snapshot was the "tip", the most recent version of each file.

My system, like other version control systems of the time, was a centralized system, with versions for each file stored as 'diff' packages. That was the traditional approach for version control, as storing a 'diff' was smaller than storing the entire version of the file.

Git changed the approach for version control. Instead of a single central repository, git is a distributed version control system. It replicates the entire repository in every instance and uses a sophisticated protocol to synchronize changes across instances. When you clone a repo in git, you get the entire repository.

Git can do what it does because disk space is now plentiful and cheap. Earlier version control systems worked on the assumption that disk space was expensive and limited. (Which, when SCCS was created in the 1970s, was true.)

Git is also directory-oriented, not file-oriented. Git looks at the entire directory tree, which allows it to optimize operations that move files or duplicate files in different directories. File-oriented version control systems, looking only at the contents of a single file at a time, cannot make those optimizations. That difference, while important, is not relevant to this post.

I called my system "Amnesia". My "brilliant" idea was to, over time, remove diffs from the repository and thereby use even less disk space. Deletion was automatic, and I let the use specify a set of rules for deletion, so important versions could be saved indefinitely.

My improvement was based on the assumption of disk space being expensive. Looking back, I should have known better. Disk space was not expensive, and not only was it not expensive it was not getting expensive -- it was getting cheaper.

Anyone looking at this system today would be, at best, amused. Even I can only grin at my error.

I was optimizing, but for the wrong result. The "Amnesia" approach reduced disk space, at the cost of time (it takes longer to compute diffs than it does to store the entire file), information (the removal of versions also removes information about who made the change), and development cost (for the auto-delete functions).

The lesson? Improve, but think about your assumptions. When you optimize something, do it in the right direction.

No comments: