 So, given these two text files, text1 and text2, and then finding the diff that gets us text2 from text1, the longest common subsequence has two lines. First, roses are red, and then, and so are you. So, we end up with a diff with one deletion and one insertion. First delete the line, violets are green, and then insert the two lines at the same position, violets are blue, sugar is sweet. Now, a somewhat unintuitive consequence of the way diffs work is that when you edit a file, and then diff that file with an earlier version, the lines which you've edited, which you've changed, they get expressed in the diff as both a dropped line and an added line. So, here when we have text1 where it says sugar is sweat, and then we have text2 where it says sugar is sweet, the diff from text1 to text2 reads first, drop the line sugar is sweat, and then add in the line sugar is sweet. Even though just a single character was changed, it was expressed very verbosely in the diff as dropping a whole line and adding in a whole line, which probably extracts you as pretty wasteful, but that's how it's done. Now, the format for diffs, which I've shown so far, is just a syntax which I made up. The actual format used by the diff utility looks like this. The line 2c2,3 denotes that the following changes apply to line 2 in the original file and the range of lines from 2 to 3 in the modified file. And the c here stands for change, meaning we're both removing lines and adding lines. And then the removed lines afterwards are expressed starting with a left angle bracket. And then after a divisor line with three hyphens, the added lines are expressed starting with a right angle bracket. Strictly speaking, there's some superfluous information here. The c denoting that this is a change rather than just an addition or deletion of lines, that's not necessary because the fact that lines are being added and removed is expressed by the angle brackets. The divisor line with the three hyphens isn't strictly necessary. It's just there for visual clarity, really. And then also the range of lines in the modified file, 2,3. That's not necessary either. It just makes the diff more readable. For another example, here the diff between text 1 and text 2 denotes two separate changes, first in addition of the lines Lance and John, and a deletion of the line Alice. Where it says 0A1, 2, A denotes that this is an addition. 1, 2 denotes the position of the lines in the modified file in text 2. And 0 denotes the position at which to insert these lines. In this format, the added lines get inserted after the specified line. So to insert something at the front, you specify line 0, even though, of course, there isn't a line 0. And the deleted line Alice here is preceded with 2D3. D meaning delete, 2 specifying the line in the original file which we're deleting. And 3 here specifies the position of the line before the deleted line, the line before Alice, its position in the modified file. So the line before Alice here is Ted. And in the modified file, Ted is at line 3. This format for diffs is actually just the default used by the diff utility. There are a couple others which are actually more commonly used. One of which is called the context format, so called because when it specifies lines to add and remove, it provides extra context lines around those additions and deletions. This extra context helps in two ways. First, it can help in cases where you're applying a patch to a file which is not exactly the same as the original file from which the diff was produced. So if you produce a diff, but then make a handful of edits on the original file, with this context format especially, you can often get away with still applying the patch, even on top of those changes. Because the extra context provided helps the patch utility make smart decisions when the original file is not in the precise state it was when the patch was produced. The other purpose of this extra context is it helps the patch utility detect when the patch is being applied erroneously to the wrong file. So if you produce a diff from A to B and then try and apply the patch onto some unrelated file C, the context format allows the patch utility to detect such erroneous cases. So looking at this example context diff, it starts out with a line with three asterisks and then the name of the original file and its timestamp. And then the second line has three hyphens followed by the name of the modified file and its timestamp. And then the actual content of the diff is divided into sections called hunks denoted by a series of asterisks. In the hung here, the three asterisks one comma four that denotes a section that's removing lines. And then the line with three hyphens one comma five denotes a sort of a section that adds lines. The lines which actually get removed begin with a hyphen or a minus sign if you prefer. And the lines which actually get added begin with a plus sign. And then all the other lines there where it says Ted, Yuri, Nadine, that's just context. One comma four between the asterisks is denoting the range of lines in the original file which this section denotes. And then the one comma five denotes the range of lines in the modified file which the following section denotes. So that's the context format. You'll note it has a good bit of redundancy. There's some context lines which needlessly are being expressed twice in different sections. So to fix that redundancy there's actually a third diff format called the unified format. In the unified format adjacent sections of additions and deletions get collapsed into one section. So here again at the top we're denoting the name of the original file and its timestamp and the modified file and its timestamp. Though this time quite confusingly the three hyphens denote the original file and three plus signs denote the modified file. And then each section, each hunk begins with a line surrounded in double at signs. And inside the two pairs of numbers one comma four and one comma five even though they superficially resemble the range as we saw in the other formats what it actually means is the first number is the starting line and the second number is the number of lines at that position. So minus one comma four means four lines starting at line one in the original file. And plus one comma five means five lines starting at line one in the modified file. So this unified format provides the advantage of the context format and has that extra context yet it's considerably less verbose. So it's this unified format which tends to be used most commonly these days. Now, when it comes to producing diffs for binary files things are a bit more problematic. First off it's not so clear how to logically group a binary file into a sequence of elements. When we find the longest common subsequence do we consider each byte to be an element of the sequence or do we consider some arbitrary sized chunks to be the elements of the sequence? Any choice we make would work it would produce a diff but the question is which would be most efficient which would produce the smallest diffs and require the least amount of processing. Where things get especially ugly is with compressed files meaning both zip archives and tar archives but also most compressed media formats like MP3s or H.264 video. The problem with compression is that by its nature when you make one small change to your data and then have it compressed again it's like say you edit your audio file you cut out a little snippet or you add a little snippet or something like that and then you compress it again. Well the resulting file bit for bit tends to change quite radically from the original. Basically small changes ripple out to the rest of the file. That's just the nature of compression. So if we're working say on a piece of audio data and as we work on it we keep changing it and we wanna keep a record of those changes keeping the record of those changes with diffs wouldn't actually really save anything you'd be best off just preserving each individual version in the whole. So it works out that binary diffs are less generally useful than textual diffs. Recording the version history of binary files with diffs generally just wastes processing power without really saving much space if any. The two cases where binary diffs do end up being useful is in first the case of diffing between executable files and in the case of remotely syncing a bunch of files. One popular binary diff tool is called BSDiff BS standing for binary soft core because BSDiff is optimized for the case of diffing files of machine code like executable files. The BSDiff algorithm doesn't work precisely like the LCS algorithm we saw in diff but it starts from the same basic principle of trying to identify areas of commonality between the two files. It just does that in a more sophisticated way and then having found the areas of commonality and the areas of near commonality it then from that produces a patch file. For remote syncing of files, a popular tool in Unix is called RSync standing for remote sync. And the idea of RSync is that you wanna have two directories or two files on remote systems and you want them to match up. If you make changes in one copy and then want to send them to the other copy rather than sending the files in whole, RSync will use a technique called a rolling hash to find those portions of just those files which have changed and it will send only those changes potentially saving you a whole lot of time and bandwidth. Of course, RSync doesn't solve the problem that if you change say a compressed file like an MP3 file just a little bit well those small changes could end up radically changing the content of the file bit for bit. So RSync may end up sending about as much data as the entirety of the file. In any case, just understand that versioning binary files is somewhat problematic so it's something we'll come back to at the end. Happily though, most software projects consist mostly of just source code files. Most version control systems are designed with primarily text files in mind. When we take two versions of the same file and resolve them into one, we call that a merge. Ultimately, there's no definitive way to programmatically perform a merge because when it comes time to reconcile the differences between the two versions we're merging that requires human judgment because say I'm merging two source files together is the resulting code correct? Well, only the programmer can touch that. In practice though, we do have merging tools which in most cases will merge together two source files the way that a programmer would have done it. Just understand that inherently it's a flawed process. In any case, how do these merge tools work? Well, there's different merge algorithms the simplest of which is what's just called a two-way merge. It's called a two-way merge because it's working with the two versions of the file which we are merging together. So say we have these two versions one reading Ted, Alice, Yuri, Nadine and the second reading John, Lance, Ted, Yuri, Nadine. Like in the diff, the merge first finds the longest common subset but rather than output a diff, it outputs a new version, the merge of the two original versions. And in this merger, the algorithm blindly assumes that you just want to take as much as you can from both original versions. So that's why the merger here has both Alice from one version and John and Lance from the other. And be very clear that a merge, unlike a diff, is a commutative operation. So the merger of file A and file B is the same as the merger of file B and file A which was not the case with diffs. What can easily happen in a merge is that you have portions of the two original versions which conflict, that is they have lines which are found in both except there's differences. Here for example, in one version between Ted and Yuri we have the line Alice but then in the other between Ted and Yuri we have the line Bruce. So the question is well, what should end up in the merge? And the answer is that the merge algorithm doesn't know. It can't resolve this itself so it actually has to prompt the user of the merge tool. It has to say, hey, there's this conflict here so you're gonna have to resolve this yourself meaning you're going to have to decide what should go there. I'm just a done computer program. I don't know what the code or data in these two files means so I can't make this decision. So what the merge tool will typically do in these cases is warn you that hey, there were conflicts that you have to resolve and in the output file it will put a little marker saying there's a conflict here that you have to resolve. So you go through the file and find those markers and replace them with what should actually go there. A merging algorithm that does a much better job of avoiding conflicts is the three-way merge algorithm which is so-called because it involves not just the two versions you're merging together but also the common ancestor of those two versions. By looking at the common ancestor as well as the two versions we're actually merging the algorithm can automatically resolve those conflicts in which in one version there's been a change since the common ancestor but in the other there hasn't and the presumption there is that you want the one that has changed because we want all the changes since the common ancestor basically. So say for example you and I are editing the same file of source code. Well we both started from a common ancestor that was our starting point where we both were and then in my version I made my changes and in your version you made yours and then when we merged them together if there are lines where I've made changes since the common ancestor but you haven't presumably we want my changes in the merger and this of course works the other way if you've made changes in certain lines but I haven't touched those lines then we want your changes in the merger. If however there are lines where I've made changes and you've made changes then there's a conflict and then the person doing the merge has to manually make that decision they have to decide what should really go in that spot. Maybe they take the lines from my version maybe they take the lines from your version maybe they end up putting something else there entirely. But again the general advantage of three way merges over two way merges is that the three way merge can automatically resolve many conflicts. Keep in mind though that it's really painful if you ever have to merge two significantly different versions of a large code base because you'll likely end up with a good number of conflicts and you won't necessarily be in a position to know how to resolve all those conflicts. So the general prescription about merging is merge early merge often. Don't let different versions of your code base diverge too long otherwise this could be a huge pain in the ass when you want to merge them back together. Last thing to say about mergers is you must be very very clear that even in cases where the merger of two versions of your code produced no conflicts or whatever conflicts there were got resolved automatically. Even in such cases of a conflictless merge it's quite possible that the result of the merger could be flawed. It could be introducing new bugs because changes I've made to the code and changes you've made to the code may not conflict at the level of text but in terms of logic what the code actually does that could be introducing a new bug because I didn't know what changes you were making and you didn't know what changes I was making and when we reconciled the two there could be a flaw. So once you've done a merger and there either were no conflicts to begin with or whatever conflicts there were you resolved well that's generally when your real work begins because then you have to test your code and make sure you're not introducing new giant bugs. No program of course can do that for you any more than you could have a program which actually writes your code for you.