 Hey, this is David Thomas, Solutions Architect here at GitLab. Some of us spend the next 20 or so minutes talking to you about a fun subject on Git Internals. I also like to call it Fear Not the Shaw. So it's pretty common when we show people how to use Git, you know, we talk about sort of just the usability, talk about it from the usability standpoint, right? You have a working area where you're working on your files. Some of those files get knows about. Some they do. You'll stage your work. You have a local repo where everything's stored. You have all the entire repository, entire database locally. And you also have your remote one, say, perhaps up in your GitLab server. You have the database up there, and then you have to keep them in sync. And so we commonly will show a graph like this, a picture like this and walk through the commands where you'll add your new files, all the files you're working on. You'll commit those into your local repository. At some point, you'll push them up to the remote server. You'll want to get changes down from the remote server from your team. And so you'll fetch them down or pull them down, you know, merge them in, etc. And so what's nice about this is after some trial and maybe a little bit of error, you know, usually folks can, you know, sort of grasp the contents, the concepts of Git contents. They'll grasp the concepts of Git. And it all goes well as long as it's going well, right? It's the moment that something breaks that the frustration sets in of not really knowing how Git works other than a handful of commands. So I recently did a training session for a company here in Silicon Valley. They were pretty well ahead of the curve in general as an audience on how the general usage of Git. So I decided the last minute was to spend some time with them and really show them how Git works. Now, let me take a quick step back here. One of the things that's common when you're working with a tool that helps you work with a Git repository, GitLab being one of them, is you tend to see these gobbledygook string of letters and numbers are all over the place, right? So here's 8E561CAD. If you click on that, for example. Now that happens to be a commit, and that commit, if you look down here, it's got this link to a parent, and it's also got some more of these character strings. If you go over to say like, let's say you go look at the graph of a branch, the branching structure, and let's say you mouse over one of the commits here, and you'll start to see that big long string there. And at the end of the day, this combination of characters, and numbers is all spattered across the tool, and of course it doesn't mean anything in human readable form, so we just kind of ignore it. And in fact, if anything, it probably just adds some confusion. But what I found and what I want to share with you here is once you fundamentally understand where those are coming from, once you're able to speak in Shaw, as I say, and fear not the Shaw, working with Git actually becomes a lot easier, and it's a lot easier to understand. All right. So just a little bit of background here, talk about Git internal. So internally, when you start to crack the hood of Git, Git has essentially three object types, three that I want to talk about. A commit, a tree, and a blob. So the commit object, as you would expect, is it's like D-Thomas, committed by files with a particular comment, and Git is going to keep track of that body of metadata. And it's going to, in that commit, it's going to have a pointer to a tree. And the tree is essentially going to be that next level down where it's going to start referencing the actual, like, the files and the subdirectories within your project. But ultimately, those trees will eventually point to, or one of the trees will point to a blob. And the blob is the actual file content. So if you commit text files or image files, small, medium, large, Git is going to treat those as blobs. And so if you see, and if you look in the picture there in the lower right, it's essentially the sort of theoretical model of how these objects refer to each other. The commit is ideally centered right in the middle of that picture, sort of intentionally, because it is sort of the center of the universe, as you'll see. So the commit, right, when you commit five files, it's going to reference one or more trees, and the trees are going to reference various blobs. And so if I go to a particular commit, I can actually figure out what files and what the directory structure is that should be on disk from that commit. Now, on top of that, there are tags and branches. What's really interesting is that in Git, you might have read that branching is cheap. It's essentially free. It doesn't do full copies like some of the old, prior STM systems. And there's true, as you'll see. But essentially the tags and branches are simple pointers to a particular commit. So it's just like a human readable name, if you will. And yeah, we'll leave it at that. There's also a special one called head, and that keeps track of where you are on a current branch. And so every time you do a commit, it always moves forward. So it represents the current state on there, yet another convenience marker. But essentially they're just pointers. So ultimately what it boils down to is, Git is a database of references. Essentially it's one giant graph, to be specific it's a directed acyclic graph, which means that if you go look at a particular commit and you look at its history, you'll never get back to itself. So you can actually figure out the entire lineage from the first file you committed to 10 years of history in the project just by following pointers. So it does it by all of the commits. Whenever you do a commit, it will have a reference not only to the files that it knows in that snapshot of work that you just did, but it also points to its immediate parent. And that commit points to its parent on up the tree. Something that's interesting to note is you'll hear the word master, and that's essentially the, I like to refer to as the canonical mainline branch. When you first start your project, if you don't give your first lineage, your first branch a name, master is the one that is just default. Nothing more than that. I mentioned head already, it's a special pointer that points to the latest commit on your line, and it moves automatically. And then whenever you do a clone, right, or you check out source code, it'll start from the current commit and it know that from that commit, it knows what information to get down. And of course, you can clone and pull or check out from like a particular branch, if you will. And it knows the branch will point to a particular commit, whereas maybe master's points to a different one, and that's, get knows what code content to put on disk. All right, so let's get a little bit and dive a little into the details. So fear not the shaw, right? So every object, right, we're talking about the commits, the trees, and the blobs is on disk, it's stored on disk. And what it does internally is it runs it through an algorithm to create the shaw one hash, and it creates a 40 character string. And based on the contents, and stores it on disk with that 40 character by naming convention. Now the way that the algorithm works, and it's very complicated, you can look it up on Wikipedia, essentially it guarantees complete uniqueness. So if you have two files that differ by a space or a character, their shaws are gonna be different. If I create food.txt and someone else creates, as you'll see, bar.txt and the contents are identical, they're gonna have the same shaw and it gets only a store at once. That is mind blowing, it doesn't actually store two copies of files. It will store other objects that have the right file names that point to the same object, but the actual file contents is only stored once. Think about that one. So what you'll see is those 40 character strings are quite unwieldy to work with. And so git actually is smart enough that you can actually give it a short subset of characters. I've heard as few as four or five or six, it depends on how big your database is. And I heard that the Linux kernel now has a minimum of 12 characters in order to work to uniquely reference it. But essentially you can just find in your project this small number to work with. To show, 8 is pretty common. So one thing as you'll see as I go through an example is that on disk, git will, tries to optimize how many files go into certain subdirectory within the internal database. And so it actually creates a subdirectory with the first two characters of the shaw. As you'll see in the example there. So it is 40 characters, but it's two and then 38. The first two are the subdirectory. The contents of these files is encrypted, essentially is encrypted. So if you just cat the file on disk, you'll see binary output. But if you actually want to see the contents, git comes with a command called cat file. You can give it a few flags. You can give it dash p for preprint or dash t for type, dash s for size. But essentially p is the common one I use. And you give it part of the shaw and you'll actually display the contents of the file in human readable if it's a text file. So essentially I'm just gonna forward through here. Essentially what's gonna happen over time, if you look in the picture on the right, is in that case, in this case you've got three commits that happen, right? So you come into the morning, you do your first commit on the left. And in this case, you committed three files in two subdirectories. And so it knows that little snapshot reference of here's my commit, here's my directory, here's the file and directory. Here's a subdirectory is another tree. And that tree has references of file in there. Then you go to lunch, you come back in the middle of the day and you do your second commit and you've made some more changes. You come back, you go right before going home at five o'clock, you do your third commit and maybe you made just one file change in there. But if I go to that third and final commit, I can actually follow the arrows that represents as part of that commit to represent the state of the files, the correct versions of files at that point in time and that's essentially this snapshot. So let's go through a quick little exercise here. This is one of my favorite parts about doing this here, this little session. So I've got a few terminals here. In the top terminal, upper left terminal is where I'm going to run some commands. I'm going to actually create a directory which is called foodproject. And then my far right directory is where I'm going to actually show the internals of the git database. So if I go into that directory, foodproject, I'm going to do a command to watch the contents of that directory. So we'll watch and show me difference and watch for every one second. Show me differences of the command find dot. All right, so right now it's empty. So I'm going to go in my foodproject and I'm going to create a new git database. Git init. That's right, my local repository. You can see on the right it created an initial layout of files with some template sample files in there, which I'm actually going to get rid of. So it gives me more room to show stuff on the right there. So I'm going to do room.git, let's start a sample just for the sake of the demo. Okay, so let's get to work. Let's see what happens when we add a file. So I'm going to create food.txt. This is food, and let's add it, git add food.txt. What we'll notice here is that on the right, we actually created an object on disk. Let's go ahead and take a look at that object. So if I do that cat.git, object 90, I think it's for tab completion to make my life easier. If I cat that file, it's just binary stuff. It's whatever, so it's compressed. So I have to use the command git cat file with pretty output of my shot. So 9b3a97, and it says this is food. What's interesting is that that file on disk, just by shot, has the raw contents. Nowhere in here does it talk about the actual file name. We'll put back that in a little bit. It's actually in the file called index. If you cat.git, this third entry up here, index, it's also going to be binary. But you can actually see here it's got a reference to food.txt. So it's a little internal temporary database in the staging area for work that you're getting ready to commit. It's having committed yet. In any case, let's go ahead and do a commit, food commit number one. Now you can see I've got three objects on disk. So two have been created. Let's figure out what those are. So git cat file dash p, and let's just choose one. Let's look at f8, the last one, f8a85. Now this one, next you just figure out what the type is, do a dash t, it's a tree. That makes sense. So this tree says, you look closely here, I know there's a file called food.txt. It points to an object with this SHA1 hash. And it's pointing, and that SHA1 hash represents a blob. And here are the file permissions on disk for that. So what is this 9b again? Let's do the cat file. Git cat file dash p of 9b3a97, what is that? This is food. So it's just, so git created a tree object that now maps the file name to a particular object on disk. Let's go see what the third one was. So we've done 9b3a, we've done f8a, let's look at the middle one here. So git cat file dash, let's look at the type first, 8d6154, 8d610d, totally messed it up. This is why they're not useful to do in practice, like when you're coding, d54. So its object type is a commit object. Okay, let's go take a look at that one. We'll run the same command but with a dash p. And if you look closely, it says here, here's the author, the committer, my commit message. But what's important is it has a reference, a tree reference. So it's referencing, this commit is referencing a tree, this f8a85, which we did up here on top, which that was actually referencing a tree. And tree was pointing to a text file on disk with a particular reference to the blog, the actual contents of the file. So now you can see that we have the commit pointing to the tree and the tree pointing to the blog. And it created that doing, by simply adding and committing a file, you get three objects on disk. All right, so let's do another one. Let's do bar.txt is bar. And let's edit foo again. This is second update on foo. Get add foo and bar. Keep an eye on here on the right because right now we have three objects. When I add them, we're going to go from three now to five and that makes sense. It's just going to put the new contents on disk. Now let's do a commit. Adding foo and bar. So on the right we have one, two, three, five objects. When we do a commit, the database is going to grow to one, two, three, four, five, six, seven. So added two more. One of those would be the commit object and one of those would be a tree object. Let's take a quick look. So get half file. Now you can actually see here, if you look at the commit message and again these are things that when you typically would get you just ignore because you don't care to understand it or it's just gobbled up. But if you look here, there is a reference in my commit. The response to the commit has this shaw here. So let's start there. Instead of randomly going to the right to try and find them, let's do get catfile-t of b five, three, five, two, five, six. That's what get came back. It says, oh, you're a commit. Well, of course, I just did a commit, so you came back with a commit object. Sweet. Let's take a look at what that actually looks like. So that commit object points to a tree of changes. It also points to a parent, which is the first commit that we did already. And you can see my comment adding foo and bar. Let's take a look at the, we already looked at the parent and it's changes earlier. We just did that. So we're one commit ahead now. Let's look at the tree, get catfile-p ae19, grabbing it from right here, ae19, bb3. So this commit points to two files, it points to foo, it points to bar, and it notes on the actual contents on disk of those. And as proof, we do get catfile-p, let's look at foo, 407285, it says, this is the second update on foo. Okay. Here's where we're going to come up with an aha moment. Let's add one file. Let's do one more, a third commit, but just add one file. We'll call it baz. This is baz, aha, get add baz, get commit, adding baz, aha. Now of course I could go and search the objects down there, but I'm just going to use the information that get gave back to me. And so there's the hash, copy it this time, get catfile-p, let's take a look at the pretty contents of this commit. It says, as you'd expect, this happens to be the third commit, it points to a tree of changes and it also points to its prior parent, which is the commit number two, and of course that commit number two will point to commit number one. But the interesting thing, the aha moment is when we look at this tree, get catfile-p, get five five, one six, voila. This is finally where you get to understand how Git is working and tracking changes. We did a commit of one file, yet internally the commit object maintains the entire state of every file and every version of the file by SHA at that point in time. So if I go and say, you know, if I want to go cherry pick a particular commit and have it put it on disk, it knows everything in that one commit data structure. So it doesn't have to follow anything, it just, it's like, boom, I've got it all right here. So when you did this, when you added this one file, I know that the state of the union of the entire project looks like this. And so when it goes to pull content out and put it on disk, like if you're switching a branch, and we're talking about branches, and lastly you're going to wrap up, it knows what versions of the blobs in the file content put on disk. That's glorious. All right, let's move to a branch real quick. Let's create a branch. So Git, check out, let's see what happens, dash B, foo branch. Now you'll notice I'm kind of running in a space here, but on the bottom, Git created a reference called foo branch. Let's take a look at what that means. So you go cat, let's just look, this one's not compressed, so you can just look at it. .git refs has foo branch. So what is a branch? We said they're lightweight, we said they're pointers, and sure enough, it is a pointer to what? Well, let's go find out, here's the SHA. What can we do with SHAs? Well, we can use the cat file command. Git, cat file, just as we've been doing. What's the type of that one? dc7678 commit. So the branch is pointing to just a particular commit. Well, of course, it's pointing to the third commit. Which commit is that? I'm going to go look at the contents. It's this particular commit, adding baz or baz. Aha! So the branch is going to start from that point on. So let's go in and make a change. Baz touched its little MC Hammer reference there. I don't know if that's how we did his initials, but that's pretty funny. It's kind of made that up. All right. Git, add baz. Oh, and remember, if you just do your quick spot check from a branch perspective, we're on the foo branch right now, not master. Now that it matters too much for the demo, but we're on foo branch. So let's do a git commit, adding mch, baz changes, boom. It just did a commit. And on the right, it's just adding more object files. And again, just through references and stuff, the git database is just sort of this flat list of objects, but has an internal knowledge of all the pointers to get the right versions. So what do we want here? So we have a, we have here, the commit came back with a shaw, which is cool. Let's git file-p1811173. And you can see that that commit is the adding mch, baz changes. So I'm going to leave it at that. What's really exciting with this, what's really exciting with this particular hands-on experiment is that, A, you can actually do it yourself locally. Git has the tools to inspect the data that's in there. It was really eye-opening to see that git just stores these really simple text files compressed, you know, but these text files on disk, that's the entire database. And so as you think about like working off a mainline or working off a branch or a branch or a branch or a branch, and you create these complicated, you know, branching structures, it gets just flat file of content and it knows how to point and reference the things. And when you get to a particular commit, that commit, the aha moment is that commit has knowledge of the entire state of the system at that point in time. So now when you go back and you start looking at your GitLab instance and start looking, start seeing all these, you start looking at the short shots here, hopefully this exercise has demystified those references and they do have a lot of meaning. And so when you start to use the tools, you know internally how Git is using them, and there are unique, you know, unique candles to point in time for that file in its history. So the last thing I'll leave you with is I made none of this up. There are a lot of really good resources for Git out there. There's the classic Git reference manual. And again, you can read a bunch of, a whole bunch of commands, but it really helps to know how Git works under the hood. So hopefully now when you go read some of these commands and understand like what it means to do a merge and what it means to do a rebase, it'll make a lot more sense. So there's some online books out there that are really good. They're free. I really like this Git internals PDF. I can't know why the color is yellow there, but this is essentially a white paper that gives really deep details. That's how I got my knowledge of this and to do a little example there. If you really want to know what's going on under the scenes, that that that PDF is fantastic. And then there's some other guides that are some of them on the more the basic usage side. So I hope you enjoyed the session. My name is David Thomas, Solutions Architect here at GitLab, and we'll see you out there. But wait, there's more. So after putting together the short video prior, I came up with a couple of really good examples kind of hammer home some of the concepts. So one of the first examples I want to do is I want to create 100 files with the exact same content and see what how Git treats that. So on the left, I've got my project. So I'm going to do for I in seek and create a sequence of one to 100. I'll echo the same word foo to text file. So you'll see here I've got 100 text files named zero, basically one to 100. Although we get add of all these files, you'll notice on the right that it only created a single object. Although we get commit adding 100 files, you'll see it created 100 files. If I scroll up here and look at the commit log, you'll notice here, here's the the actual shot one there of the commit. So let's do a, let's do a file dash p on that shot commit shot. So there it is. And let's take a look at the tree. Get that file that pv 4484. And you'll notice that the tree itself has on the on the fourth column, it has the name of the file. And it points to in the middle there, you can see the shot of the object itself. Now we know it's, it's one object, you can see on the right here, I've only got three objects in the actual database. But the tree object is a reference to 100 copies of those. So that's the first example. If you if you if you have a file, that's in your Git repository, that is the exact same contents give only store one copy of that. Let's go ahead and do another quick example. I'm going to create a branch of that. Let's do get checkout dash b. Let's just call it a foo branch. Okay, so now we have a new branch, you can see that it created a reference there. Let's go ahead and make a change to one of the files. Let's just change file 100. We'll add get add 100. Notice on the right, we have three objects so far in a reference. And now we have four objects. We'll commit that added bar to 100. Of course, we're gonna get a couple more objects here because we have to do tree object, we have a commit object. Here's the commit itself. Let's go ahead and take a quick look at that. Get catfile dash p. So here's the commit as a parent. And also the tree itself. Let's do a get catfile. Let's look at the tree. Let's look at d0af36. And notice that this tree object has, even though we changed one file, it references all 100 files. But you'll notice that file 100 at the end here. It now has if you look really closely, all of these Shah ones in the middle here all the same, except for this one at the very end here. Let's take a quick look at it. 3 bd1f0. Bam. It has foo and bar. So if I were to go and check out the branch called foo branch. If I were to check out foo branch, git would go and look at this tree would give me 100 files on disk, giving me the contents of the 99 of those files will be the exact same one on disk by with unique name. And then it would give me because this branch is a little bit different, it would give me the contents of this, this version of file 100.txt. So I hope those two examples were helpful in giving you some insights into how git works internally. See you around.