 When working on a software project, there are a few things that become really painful to do if we just manage all the files of our project manually. The first is just keeping backups so we don't lose our work, but beyond that we also want the ability to track and revert any changes we make. So we go through different release cycles of our software and we make changes, we fix bugs and so forth, and be really nice when we need to go back and look at old code. Or if we somehow introduce a new bug into our code, we want the ability to revert back to some earlier version that didn't have that bug. Manually managing all the files also really makes it painful to do just basic synchronization, that is synchronizing with other people. When you have many people working together on the same project, things get ugly really fast if everyone has to manually send all their changes to everyone else. And one thing that makes nearly impossible is keeping track of who really is responsible for what changes, tracking the so-called ownership for the code, aka the blame for the code. Who wrote this line of code? When did they write it? Why did they write it? That's all stuff you very often want to know. And finally, if we're manually managing all the files in our project, that makes it difficult to do what's called branching. That is to take your work and split it off into different lines of development. So in this branch, we're working on this feature, and in this other branch, we're just fixing bugs. And when the feature is done and is ready, we'll at some point then merge the two branches together, effectively integrating the new feature and all those bug fixes together into one version of the code. So making backups, tracking and reverting changes, synchronizing tracking ownership branching, doing all these things manually is extremely ugly and error prone. So instead, programmers use software tools called version control systems, which are so called of course because mainly what they're about is tracking the changes to the files and directories in our projects. In the terminology of version control, we have what is called the working directory or the working copy or the working set. I'll usually say working directory, which is simply the directory which is being tracked by the version control system. It's the directory which contains all the files of our project, and it's called the working directory because it's where we do our work. We go in there, we create new files, we delete files, and we edit files. And then having done a few hours worth or a couple days worth of work, good practice then is to take a snapshot of our working directory to record a version or revision of our project. And these revisions, these snapshots, they get stored in what's called the repository. Once a version is checked in, as we say to our repository, we can come back later and check out that version, that revision, and what that means is to copy the revision from the repository back into our working directory. So be clear that the version control system doesn't take a continuous history of our changes. It doesn't watch us as we work and record every single little thing we do. We have to manually check in a new snapshot. We have to make new revisions in the repository. But if we do this on a regular basis, then that's generally good enough. We can go back and work with our older revisions. Now there are many different version control systems out there, but they generally fall into two broad categories. Most of the older version control systems are centralized, meaning that in the typical workflow, everyone working on the project has their own working directory, but they all synchronize their work through a central repository on a server. In contrast, most newer version control systems are distributed, meaning that everyone has their own working directory, but then they also each have their own repository. And the snapshots through revisions created in these local repositories get pushed and pulled, meaning sent and requested, from one repository directly to another. So Andrew, working in his working directory, checks in his changes to his local repository, and then he can push that revision into, say, Lisa's repository. Or he can retrieve revisions, he can pull revisions from Lisa's repository. Now, even in a distributed system, if you have many people working together, you'll generally want some central repository through which everyone can sync. Otherwise revisions would have to be sent individually out to everyone else, which is inefficient and cumbersome. Notice in the diagram here that the central repository generally doesn't have a working directory of its own, because no one would use it. It would just be a waste of space. Now, as for the actual version control systems in use, here's a brief history. Concurrent versions system, aka CVS, was created back in about 1990, and it was the dominant open source version control system for about 12 or 13 years, until it was supplanted by Subversion, also known as SVN, which was released back in 2000 and quite quickly took over the market from CVS. Meanwhile, up until 2002, the Linux kernel project itself didn't use version control at all, really. They had a whole mess of patches which they would maintain, and that situation got progressively messier and messier until in 2002, the Linux kernel adopted a distributed version control system called BitKeeper. The problem there was that BitKeeper was a proprietary system, and though the Linux kernel was granted a free license to use BitKeeper, this license was revoked a few years later, and so Linux needed a new version control system, so Linus Torvalds himself actually sat down and created a new system which he called Git, which was first released in 2005, and has been used for the Linux kernel and many other projects since. In that same year, 2005, another distributed version control system was released called Mercurial, which is abbreviated as HG because HG is the atomic symbol of Mercury. Now, here in 2012, CVS is pretty much dead having been totally eclipsed by Subversion, and then Subversion, I would say, is actually now on the decline because it's being eclipsed by distributed version control, namely Git and Mercurial. Well, conceptually distributed systems are really a bit trickier to understand than centralized systems. In practice, it's probably best to actually start today learning a distributed system, and if we have to choose between Git and Mercurial, well, the two at their core are conceptually very, very similar, but Mercurial in total does have an edge in terms of simplicity. So Mercurial is the system we'll be learning how to use. Before we get into Mercurial, however, there's some conceptual ground we have to cover concerning what are called Diff's and Patches. Diff, short for Difference, is a standard Unix utility which analyzes two files and produces from them what's called a Diff or a Patch file, which represents the minimal set of changes it takes to get from one file to another. So you have two similar files, X and Y, which are close but not exactly the same. They have some lines in common but then other lines that differ. If you take the Diff of X and Y, what you get is a file that represents all the changes you have to make, two X, two produce Y. To understand this, let's look at how the algorithm which produces these Diff's produces these Patches, how it actually works. The most commonly used Diff algorithm works by finding the so-called LCS, the longest common subsequence. In case you're not clear on the math terminology, a sequence is a series of elements of data in a particular order. A subsequence is just a selection of elements from that original sequence but maintaining the same order from that original sequence. For simplicity, we'll demonstrate with sequences that are made up just of letters. Here we have some original sequence that reads A, B, A, G, H, B, G, G. Notice that it has repeating elements. It's possible to have multiple A's, multiple G's, multiple B's, etc. And while we could swap the position of any two equivalent elements, you could swap the A here for the other A. It doesn't matter because they're both the same value. Otherwise, though, you can't move things around and still have the same sequence. It would be a different sequence if we did. And then we have two example subsequences of this original sequence. We have A, B, G, G, G, and we have B, A, H, B. A subsequence is basically formed by taking the original sequence and deciding which elements we want to keep and which we don't and we just remove all the stuff we want to drop. And what we're left with in order is our subsequence. And understand that the full original sequence is considered a subsequence of itself and also the null sequence, the sequence of no elements. That's also considered technically a subsequence of any other sequence really. So it actually works out that the number of subsequences for any sequence is 2 to the nth where n is the number of elements in the original sequence because for every additional element in the original sequence, you're doubling the number of possibilities. So here the original sequence has 8 elements. So 2 to the 8, that's 256. There are 256 different subsequences for this one sequence. So now you should understand what a sequence and a subsequence is. As for the LCS, the longest common subsequence. Here we have two sequences, sequence 1 and sequence 2. And between them we have a number of common subsequences, subsequences which are found in both. For example, gh is a subsequence found in both sequence 1 and sequence 2. The longest common subsequence then is just the common subsequence which has the most number of elements. In this case though we have a tie. We have the subsequence bgh and the subsequence egh, both of which have 3 elements. For the purposes of the diff algorithm, it generally doesn't matter which we use. So we can pick either. So let's just go ahead and take the first one. Let's pick bgh as our longest common subsequence. You may notice though that in sequence 2, the subsequence bgh actually occurs twice. We can select the first g in sequence 2 or we can use the second g. It actually doesn't matter. So again we have a choice. We can go with bgh where the g is the first g or we can go with bgh where the g is the second g. We'll just go ahead and use the first one. If we then line up the two sequences such that the elements of the longest common subsequence line up, this is what we get. Just visually now, you can see the minimal set of changes you'd have to make to get from sequence 1 to sequence 2. First you'd have to drop AE from the front. Then you'd have to drop the A and replace it with CD. And then you'd have to add egj between the g and the h of the common subsequence and at the end you'd remove bgg. That is the diff, the minimal set of changes. Now express more formally to express the diff from sequence 1 to sequence 2. We would need to record where in the original sequence these changes need to be made. So quickly just inventing our own little notation scheme. Here 1 minus AE means at position 1 in the original sequence, drop A and E. 4 minus AE means drop at position 4, the element A. And then 4 plus means at position 4 in the original sequence, add in C and D and so forth. So we'd have to denote which elements, whether we're dropping them, whether we're adding them, and at what position are we dropping or adding them. And a subtlety being that when it comes to denoting where to drop an element, we just have the index of where that element actually is in the original sequence, but then when it comes to adding elements, the number denotes the position in the original sequence in front of which these elements are being added. So when we write 6 plus EGJ, that inserts those three elements at the position immediately in front of the sixth element of the original sequence, so right in front of the H. So again, given this diff, this list of changes to make, given only sequence 1, we could then reconstruct sequence 2. That's the whole point of the diff. And be very clear that diffs are not commutative. The diff of sequence 1 to sequence 2 is not the same as the diff from sequence 2 to sequence 1. While it has the same number of lines and those lines have the same elements in them, all the pluses and minuses get swapped and the positions changed because these numbers now denote positions in sequence 2, not in sequence 1. So here when we write 1 plus AE, that means add right in front of element 1, put an A and an E. And notice that the last line specifies a position of 9. Well, there isn't any element 9 because there are only 8 elements here, but we're going by the convention that elements are added in front of the specified position. So 9 plus here means to tack on to the end. So this is how you produce diffs. Given two sequences of data, first find the longest common subsequence and then record all the differences between all the other elements. Of course, what we glossed over was how to actually find the longest common subsequence given two sequences, but that's beyond our scope and it's really not important because the essence here is that it's the LCS which identifies the parts that don't change, so we can identify the minimal set of changes. When it comes to producing diffs on real data, we have to decide what are the elements that make up our sequence of data. When it comes to text files, it almost always makes most sense to treat the lines as the elements which make up the sequence of data.