 How big should it be? Okay, so today I will be talking about Git internals. Basically, the data structures that you see within Git. So, the Git is a very powerful tool. It allows you to do a lot of funny things like rewriting history. So, in order to do all these advanced things, you need to first understand how Git stores everything. And so, Git, here's an example repo, and if you see the directory listing, there's a .git where everything is. So, everything is an object. Okay, let's look at the refs. Okay, so you see here in .git slash ref slash hits, this is all your list of branches. Each one is a plain text file that just gives you a comment hash. If you take this and do git show, you can see your comment. Yeah, hang on. Look at that, with this book. What? What happened? It's a book. Alright, nope. No, I think it's VJ, right? No. Unblocking? Unblocking? Okay. I think I forgot to keep this configuration just now, so it reset. Okay, so, like I mentioned, everything is an object. So, first of all, let's look at the commit tree. It's just quite a simple history. Just three commits, most of them are empty. And so, the first thing is that... So, in .git history, basically the branch always refers to the latest commit in the branch. It's just a sort of symbolic link that points to it. Each commit will have a reference to other commits, and this brings you to your entire history chain. So, if you look in this commit, this one over here, it shows the latest commit. And you can do .git cat file dash b. So, over here, this is what's internally inside the object for the main... I mean, the head commit. And you see a few lines. One is tree. Tree is... a tree in .git is a directory. The parent is the parent commit. And so, you see, the first eight characters in this are the same as these eight characters. And the author and committer are here. And then you have your log message. So, all these things put together in some kind of binary format and then hashed using sha1sum gives you your commit ID. So, all these objects in .git, they're hashes of their contents, which means that when you change anything, the commit hash changes. So, if we evaluate... I mean, if we examine the tree, this tree, then you see that there are blobs. Okay, so if you examine this tree, you see that there are three blobs and these are actually the files. So, we have commits, trees and blobs. Trees can contain trees which are other directories and trees can contain blobs which are files. And each of these are the contents of your file. You notice that file1 and file2 have the same hash. And the reason for this is because both file1 and file2 are empty. This hash is the magic hash for empty file. And if I were to, say, copy this file... This is the contents of file3. If I were to copy file3 into file4, you see that we now have the same hash for file3 and file4. This is because the hash of a block is only determined by its contents. Any file that has the same contents will get the same hash. Does it mean that only one copy is stored? Yes, only one copy of files are stored. If you have 100 copies of the same file in the same git repository, only one copy is stored. And all of these are referred to by the same hash. Would you have collisions at some point? Sorry? Would you have collisions at some point? Yeah, eventually one day there will be collisions but across all the different git repositories out there, but so far... There's a conspiracy theory that is... People say that it's a rainbow table generating device. Yeah, do you think GitHub uses this to... Possibly, yeah. This thing lends itself very well to... I mean, to deduplication, right? To deduplication for Fox? Oh yes, yeah, Fox especially. And so we have blobs and... You see there are... Sorry, talking about collisions in the git SCN website, they have this thing that the chances of collision is lower than the chance of all the developers in your team being attacked by wolves in the same light in separate rooms. Wow. Okay? Wolves, yes. Alright. Probably not a very big concern. Alright. But the thing is they still have plans for the eventual collision thing. So git has some leeway for upgrading everything to SHA256, which I don't know. I think it makes it a lot harder to collide. So... But that also means that you get a really huge hash for everything. Right? So, yeah. Inside... Alright. Inside this tree, right? You see that there are... One, two, three, four. Four columns. This is the hash of your file. This is the type. And this is the... What do you call it? I mean, this is the permission. So 644 means that it's... It's Unix permissions. It's basically... Actually git doesn't really store the difference... store the actual permissions. It only stores the presence of the execute bit. So if you have an execute bit, then it becomes 755 instead. So it's compatible with Windows? No. The Windows has... Windows and HFS or Mac have this issue where you can have... I mean, files are case-insensitive. So if you have one file called capital blah, and one file called... I mean, one file called lowercase blah in the same directory, it's a different file on Linux, and git will track it as a different file. But in Mac and in Windows, it's the same file. When you try and check out your directory, it starts overwriting it with the contents of each other. And it gets very fun. So if you want to troll your Windows using friends, this is what you do. Oh, Mac using friends? Yes, and Mac using friends. Yeah. I'm looking at you. So, um... Right, so I've covered trees and blobs and... Oh, yeah. So, like I mentioned, everything is hashed. So if I have an identical tree that looks the same as this, I think it should have the same hash. So if I say make there blah. Okay, git cat file. So... Let me expand this a bit. So, just now we saw that our commit had a tree of this 0.6 something. This, right? And now it has changed. So, this is our root tree, which we will see that we have a directory called blah. And because it has all the same files as the previous root tree, it now has the same hash. So, basically what I've shown you is that git is really just a content tracker. It's a gigantic hash table. Where does it store the path to the new files? Okay. So, you see, right? In the hit commit, we have this new tree, right? Over here, right? So, this directory shows you that you have these four blobs as before and you have one new item, the tree. This is your sub directory. You see here? It says the tree. And... Same as before. So, that's how it stores the... So, I wanted to cover some other things as well. So, over here... Over here, I have a history that shows a merge commit. Right? So, if I take this and I examine it, you see that now I have multiple parents. And the order in which these parents appear is important because you can refer to them as git log. Git show, hit... Whatever this thing is. One. And it shows this. The first parent. See? And hash here. If I do two, it shows this, which is the second parent of hit. Git log. Git log. Yeah. So, over here, this is basically your revision log, starting with the latest commit, which has two parents. One is here, one is here. And then these both have the same parent, which is this and so on. Oh, yeah. Besides my window. Smaller? To the right. Yes. Okay. Yeah, so... Oopsies. Yeah, one last thing. I'll just show you the first commit. This is the first commit. The difference between the first, the very first commit in your history and your, any other commit is that it has no parent. See? It's just a tree. It has the same old metadata as before and has no parent. So, when you do a git log, what happens is that git will look at your current branch, which is merged, it will look in dot git slash rev slash hits slash merged, which is this commit over here and then it will look at this commit, it will show you this, has these two parents and then it will just keep traversing down git path dot dash p, this, of this, parent of this, all the way down until you have no more parents and that's your very first commit in your history. I think that wraps up everything. Any questions? Yes. That's why you have to be very careful when you're rewriting history. So, the thing is, when you have a history, everybody believes that the history follows this path, right? But if, let's say, you do a rebase and then you rewrite history, so it follows a different path all the way down, then someone tries to merge your tree, everything goes wrong. That's why there's one rule that people usually follow during rebase. Any history rewriting thing and that's, do not rewrite history that has been published. Anything that's still within your own system hasn't been published, you can rewrite. Anything that you have shown the world stays like that. Exactly, it's a linked list. So, this also brings another thing, right? Because the commit hash is hashed out of the contents of the entire commit, right? This entire bit here. Therefore, all these have to remain constant in order for this hash to stay as it is. Which means if I sign this commit with a tag, yeah, let me show you the git tag dash s. Version v1.0 So now I have a tag called v1.0 and a tag is an object of its own. You see here there's a lot more data. There is basically actually I don't know what object this object is. So, a tag can point to a commit. Actually, it can point to any object. You can tag a file. You can tag a direct tree, which is a tree. And you can tag a sub-module or whatever. Anything that has a hash can be tagged. So, in this case, we have tagged the current commit, which is this. And you see that the tagger has my name. There's the tag name here. And this is the tag message. This part is what's interesting. This is what you call a PGP signature. PGP is this public key encryption thing where you have a key pair. Everyone will have a key pair. One public and one private. I keep my private key private and I give my public key with my public key I can sign messages and with my public key you can verify that the message will sign my name. And if you tamper with it, the signature is broken. So, by signing only the top commit, right? So, this signature guarantees that this tag is correct. I mean, it means that I have certified that this tag is correct. And this tag refers to this commit. It refers to this commit which refers to everything. So, by certifying only the top most commit, I've certified the entire history saying that everything is valid because Shahwan is still what you call a cryptographically secure hash. It is not currently possible to reverse it within any reasonable time frame. So, that's the security introduction.