Version control (pt. 2): Generations and intended use

In the previous post I gave a little introduction to Version control, explaining a few basics that help to understand the topic. This post assumes you know these things that version control systems have in common. So now it’s time to discuss what sets them apart – not individually, yet, but in terms of characteristics some of them share with each other.

Generations

Each version control system (VCS) has its unique advantages and disadvantages. However there are some traits which are common to several of them. And since the programs that share those traits were typically released rather close to each other, it makes sense to speak of generations of version control systems.

So far there are three of them with the first obviously being the oldest and the third generation the newest. What may surprise you is the fact that the earlier versions, even though being much more limited, have not completely disappeared. How come? Well, just keep it in mind while we take a look at those generations. Perhaps you can see where tools of older generations may still make sense!

The previous article discussed manual “version control” and its limitations. In short: It can work for you if your project is rather small, you’re working on it alone and you’ve got the discipline to always do proper file backup before you make bigger changes. In a world of more sophisticated software it’s quite unlikely that many projects meet all of those requirements. Therefore it totally makes sense to develop programs that would assist you in doing proper version control on your projects.

The first generation

The first generation did exactly that: It preserved any changes you made to a single file. Yes, it is one characteristic trait of the first generation that it works on a per-file basis. Version control is entirely separate for each and every file that you choose to record changes for. For each file it manages, the VCS creates a “history file” which contains all the differences from one version to the next plus a comment.

Originally, it was common to use VCS of the first generation on multiuser systems. If multiple users can work on the same project at the same time (via different logins), it’s quite possible that conflicts arise. If two persons make changes to the same file, the one who saves last “wins” – overwriting all changes that somebody else may have made in the meantime. To avoid that, locking was invented. Files can be locked while they are being edited. In case somebody decides to edit a file, causes a lock and then does something else, the file remains locked. An administrator can however break a lock if something like this happens.

Today VCS of the first generation are more or less obsolete. There are niches where they managed to survive and are still being used. One example is management of configuration files on *nix systems without centralized configuration management. These configuration files are only relevant to the system they exist on so that missing networking capabilities (which the second generation introduced) do not mean any disadvantage. And most of the time these configuration files are separate entities not related to other files and for that reason not even the limitation of only managing a single file is a problem here.

The second generation

There are two important aspects that set tools of the second generation apart from those of the first: They offer network capabilities and they can manage multiple files in one project! The later ability solves a whole bunch of problems which made 1st generation tools hard to work with on anything but very small projects. Managing each file separately does not sound too bad at first. But think about it for a minute.

Let’s imagine, we work on a simple project. Nothing too fancy: A few source code files, one header file. Currently the program is broken and we decided to go back to a working version. Good thing that we have version control, right? Right… Sort of. The file main.c is currently at revision 96, foo.c at revision 44, bar.c at 24 and baz.h at revision 7. See the problem? After we found out that revision 89 broke the program and we reverted back to 88, how do we find out which revision number of the other files belongs to revision 88 of main.c? Yes, we have time stamps and we can find out which revisions all of our files had when the program was working when the main file was at revision 88. Maybe it’s not even that bad when we only have four files. But what if we have 20? 100? It’s cumbersome and really a waste of time. Keep things like this in mind and you’ll definitely come to appreciate the ability to manage multiple files together in one project where the revision number increases whatever file was changed and however many files were modified!

Now that larger projects are possible because the whole project is managed together in one repository, it makes sense to use the network as well. This networking capability is achieved by providing one centralized repository which all project members (or even everybody interested in the project) can checkout to create a local working copy of the latest revision (or any older one if needed). Changes can be made locally and after committing them they are checked in back into the centralized repository. Since the tools of the second generation will only allow checking in if nobody else did a check-in in the meantime (if somebody did, you need to update your working copy first and merge your and the remote changes if at least one file was modified by both) locking is also not necessary anymore!

Today tools of the second generation still play an important role. Their attractivity is declining, however. This is due to a few shortcomings which the third generation tries to address.

The third generation

The big innovation that is common to all tools of the third generation is that they work decentralized. Users usually don’t checkout files from a central (probably remote) repository. Instead they clone the full repository and then checkout the files from their local clone. Since the local repository is exactly the same as the original one, there’s no longer one central repository – at least from a technical view. And while cloning requires to transfer a lot more data over the wire (especially for large projects), there are some huge benefits to it.

If you have a local clone, you can work on the project even when you’re not online. You can see the complete history and checkout earlier revisions if you need to – all without having to access a central repository. If you’re online, you can always sync your local clone with the original repo (pull down changes) or even the original one with your local repository (push up changes) if you have write access.

One of the biggest advantages of decentralized tools is that they make forking much easier. While forking a project has been something not well liked in the past, you’ll often see projects asking you to fork and play with their code today (e.g. the well-known “fork me on github”). Experience has shown that quite some people fork a project, add a feature they need and then give their code back to the project (this is done by creating a pull request which invites the administrators of the original project to pull in the changes if they want).

Which one to use?

If you like software history, there’s nothing wrong with trying out tools of the older generations. But if you’re just starting out with version control and you want to learn something now, it makes sense to choose a tool of the third generation. Which one would I recommend? I can only give the usual answer to such a question: It depends. Each one has its strengths and weaknesses. In the next blog posts we’ll take a closer look at some of the open source VCS of all generations. This might help you to choose the right one for your purpose.

Version control (pt. 1): An introduction

This post is the first part of a series on version control. It provides an introduction by explaining what that actually is, why you should probably use it and how it works in general.

Important terms

Version control (also: revision control) is a means of preserving various versions of a file or of multiple files. This can be done in a lot of different ways but over time some best-practices have emerged that are more or less followed in all modern version control systems.

There are lots of cases where version control makes sense. One of the most common ones is software development where using version control is virtually mandatory. That is why there’s even a separate term describing this form of version control: source (code) control or source code management.

We can group various version control systems together in two groups: local as well as network-based systems. The latter can be further differentiated between centralized and distributed ones.

In short: Why you may need it

Depending on the choice of the tool there can be various situation where you can benefit from version control:

  • Version control enables you to quickly return to an older, known-good state in case something broke
  • Version control gives you the possibility to precisely document any changes you made and thus provides both tractability and a quick overview of changes
  • Version control solves a lot of the problems that occur if multiple people work on the same files at the same time
  • Version control makes it simple to clearly find out the author of any change

Or as a former colleague of mine put it (in a very vivid way):
Why to use version control!

Manual version control

The simplest form of version control is working with a backup copy. You make a copy of e.g. a configuration file before making a change to the live file. Afterwards you test your changes and if they seem to work you either delete the backup or keep it for reference. If the changes had undesired effects, the original is overwritten with the backup (and the latter usually deleted again). This is actually a (rather primitive but sometimes sufficient) form of version control: Thanks to the backup copy you have two versions of the file at your hands!

Backup copy – we’ve all done it

Another variant is to make a copy after a fixed amount of time (or at random). Often people prepend the date to the file name. If all you want to accomplish is that e.g. the data as it was at of the first of each month is preserved for one year, that’s also a sufficient method (along with rotating the backup copies so that you don’t keep around more of them than you need).

Those means of manual version control are however pretty limited. And worse: There’s plenty of room for making mistakes!

Manual version control

How a version control system (VCS) works

Let’s think about a simplified versioning process by pretending to do things by hand. For each recorded change of a file you’d make a copy of the file and keep all the old versions of it. A VCS does not forget a single version of a file it monitors! That’s what it is meant to do, after all. Each new version of the file gets a comment that is meant to briefly sum up the changes that were made.

Keeping probably dozens or even hundreds of files around because one file has that many revisions, would really clutter your disk. Also it would not make sense to keep nearly the same file twice if only one line was changed! That’s why local VCS which also version on a per-file base, keep a “history file” around for each versioned file. That file records only the changes between the various versions as well as the comments.

Network-based VCS are able to organize a whole project (multiple files together instead of each one separately). They also record the changes instead of the full files for each revision as well as the comments. All of that data is collected in a so-called repository.

If a new team member wants to start working on the project, he or she first needs to get the files of that project. For centralized VCS this is done by checking out the most current revisions of all the project files from the remote repository. By doing so, a local working copy of the project files is created to be worked on by the user. When using a distributed VCS the remote repository is cloned instead (thus receiving the full repository with all revisions and not just the most current version of each file). The working copy is then checked out from the local repository clone.

At the beginning of a new project there is no repository, yet. In this case either an empty repository is created, checked out and the new working directory is populated with files. In a next step, those files are placed under version control (which means that the VCS is told to watch them and record changes). Then all changes (all of the files since they are all new right now) are placed into the repository by doing a commit.

After every change made to the project you do a commit again, recording the changed state inside the repository. Other project members can now get a current copy from the repository. This way it’s easy to work together on the same project without the risk of (unknowingly) get into the way of somebody else.