 Okay, well a few questions while people are still coming. Who uses Git at least once a week? Well, great majority. Second question is who wants to learn about Git internals? You guys are just checking like if you didn't come here by mistake so you know what you were looking for. Okay, yeah. So I find Git internals quite interesting and I would like to share it and use the Python to help me explore them. Why should we even look into Git internals, right? Well, why not? It's a good enough answer for me, right? But I need to convince you why it's a worth a subject of investigation. Think one of the main reasons is better understanding the tool. If you know how internals work, right, it can lead to more efficient use everyday and more easy understanding of stuff you read in the internet about Git. Then you can learn the ideas because there are a bunch of quite interesting innovative and creative ideas that were put into Git design and implementation. And we can brought our horizon and we can use and adapt them in our own project. And of course, healthy curiosity. I like to watch on YouTube those kind of videos like how stuff works, where they explain, you know, how everyday objects work and operate. So, yeah, that's kind of this talk, right? How stuff works but in software world. Please raise your hands who think that Git is hard or at some point of like their life said, damn, Git is hard. Well, again, the majority, right? So I think the main problem why Git is hard is that because Git is largely misunderstood. This is from one of the early Git readme files. So when Lino Storvals wrote the very first version, he described the tool as the stupid content tracker. The keyword here is stupid, right? So it implies stupid not smart but stupid as, you know, simple, right? So it was designed and envisioned to be simple. Let's look at a very brief history of the first version of Git. So Lino Storvals himself began development of Git on the third of April. Year doesn't matter. It's about days. And then basically he had a working version in three days and then he started using it to host the Git source code itself in one more day. Then basically within a month he managed to achieve his performance goals, right? Isn't it incredible? It doesn't sound like a hard tool. It's something that basically one person or maybe he got some extra help but it was mostly him managed to do in one month, right? So what can be actually like a heart about it, yeah? It should be simple. It was designed to be simple but then something went wrong. So the normal ways of using tools whether it's software or anything, machines, you know, devices is that normal people first learn how to use the thing, master, you know, how to operate it and then maybe, maybe learn internals, right? You don't really need to know how the car engine works or how transmission works internally to be able to operate cars and most drivers don't actually know that much about the internals, right? But the Git was envisioned and designed other way around, right? It's important to notice that the Git was created by the Uber Geeks, right? Not like, you know, everyday Geeks but Linux kernel developers was designed by Linux kernel developers for Linux kernel developers and their idea was that if you create simple building blocks and people learn them, like the usage will become self-evident, right? They basically thought that if you first learn the internals then you will automatically understand how to use the tool. Well, and my guess that's where the things went wrong. Something that works for Linux kernel developers doesn't maybe works for, you know, wider programming community. So that's why I think it's quite useful and important to look in the Git internals. So what is Git? It is the quote of a pro-Git book that I recommend. It's free official book on Git. So Git is fundamentally a content addressable file system with a version control system interface written on top of it. So it implies that there are layers, that there are layers by design. So you might be wondering what is a content addressable file system? I will skip this for a second and I'll come to it later. So let's decouple Git into layers. So basically on the very basic layer the Git is key value store, which is a content addressable. Then on top of this key value store is a file system build, right? And just on top of that the version control part is built and then on top of the version control part the collaboration tooling is built. So speaking about comments, right, if we are talking about version control level it is about comments like Git checkout, Git commit, Git branch, this stuff. And collaboration level is Git push, pull, fetch and while working with remote repositories. But this talk I will speak about first two levels, right, that have their own comments, have their own structure but nobody really uses them every day directly. So key value store, right, the lowest level of Git. So usually if you have a key value store, whether it's a Python dictionary or some database, the fundamental principle is that you can provide a key and store any arbitrary value by that key, right? So this key could be anything, right? You generate it. It doesn't necessarily need to be a part of content of the value. That's normal way. So I said that Git is content addressable file system or key value store. So what does it mean? It basically means that the key to the value is value itself, right? But it looks like it doesn't make a lot of sense, right? So content cannot be the key of itself. It doesn't sound useful and it's not really going to work. So what the Git designers created, they came to idea that you can use a hash of the value is its key, right? It basically means that key depends on the value and that's what content addressable part means. It means that you can refer to the value by its hash. And, well, in Git terminology the hash is object and value is object. Object here has nothing to do with object oriented programming. Object is just any thing that Git stores in its key value store. So there are two important implications here. First of all, content is immutable. Basically because if you have a file it has certain hash and then you change the file content, it has changes, right? So basically key changes if the content changes itself, right? It means that if other objects are linking to that original file, right, once you change the content, the key changes and all the links get broken. Basically that's a very important implication. So all the objects in Git are immutable by that sense. But there is a good news. There is no content duplication, right? If you try to store two identical things in Git key value store, there will be not two copies of the same thing, right? Because it has the same key and there is no need to store the second version because it is the same object and same version essentially. Those are two very important implications and they affect a lot of things in Git. Okay, so a small interesting fact. So basically what is repository in Git when it comes to your file system is just this .git folder in your project folder, right? All the files that you have beside the .git folder, well, basically your source code, they're actually themselves not part of a Git repository, they're called a working tree. And there is just a way to interact for Git with a very normal file system. And an important fact that you can actually set up a Git repository without having worked tree at all, it's called a bare repository and that's what is used on Git servers. Okay, now let's look what's inside this Git folder. So I'm gonna do this, I'm gonna run a bunch of comments and then use Python to explore what actually happens inside the Git repository itself. So let's first create our repository. Okay, as you can see it says initialize the empty Git repository in .git. That's what I meant that it doesn't really include the files. So let's just go there. So the storage for this key value stuff in Git is located in .git .git slash object. So let's see if there is anything in it. There's a bunch of stuff. Those are directories. There are no files yet because it's empty repository. Okay. So as I said, for every level in Git there is its own set of commands. So there is this command that will compute the hash of the object. So an object as I said can be any string. So let's create this. Python is cool. Right? And then I'm gonna pipe it into Git hash object which hashes the object from sd in. Okay, it just prints the hash. Well, according to the how Git calculates the hash. Well, in the best traditions of Git there is if you add a flag to this command it will do something completely different. So if you add slash w it will not just calculate the hash but it will actually store the file in the key value store. Right? So I think this one, the problems with Git. So a lot of commands have flags that completely change, right? What the commands do. And I think that was one of the biggest design mistakes. But let's try it. So it also prints the hash. But now it stores our object. So let's see what's there. I will use type files because directors don't really matter for now. Okay. Now we see there is one more file. And its path happens to be, well, basically a broken down hash of the object. It takes first two letters and then the rest of the hash. So, you know, let's try to write this file again. Right? So as I said the content is immutable and deduplicated. And let's see if there are more objects appearing. Well, nope. There is no need to sort this file. It's already there because it has the same hash. So, you know, let's do this. I'm gonna store one more object there. Let's see if there are more objects. 17 is cool, right? And now let's see if there are more objects. Right. There is one more object because it's a different thing. Okay. Now let's explore what are actually those objects. So I created this Jupyter notebook that will help us to explore the content of the Python. So here I just set up some helper functions that gonna help me to show some stuff and, you know, just type less code. And let's use a glob to, you know, iterate over all the file objects. Bum, there are two objects in our GitHub repository. Let's create a function that will, you know, list all our objects because we will use it for later. And just see if it works. Yes. It clears up all this, you know, the common path of the files and it will show just them, basically hashes. So, okay. Let's pick the first path, right? And I wrote this function that reads the file, basically just outputs its contents in one line. Oh, okay. That seems like some binary gibberish. Okay. Basically the thing is that Git zips the contents of the object. So we need to unzip it, right? Okay. Now we see the content of our object. So we can see the first counts the word blob, then some number, and then zero byte, null byte, and then actually the content of my object. So what is this blob? Blob is object type. So Git stores object types inside, well, it's key value store. And then a content. This number is a content length, basically how many characters are there. Okay. So now we know that object has type. So let's create a named tuple where we're going to store our parsed objects, which will have path type and the content itself. Okay. Now let's create two small functions that will help us to read objects and do what we use so far. So read object function basically reads the file, unzips it, and then using a regular expression it takes the type of the object and the content. We don't need length, basically. Length is just, you know, for informational purposes. And then let's write the function that will iterate over all our object paths and parse the objects. And let's store it in a variable. All right. I forgot to run this part. Okay. Now I created this small function to more easily plot tables with the named tuples and some other content. So okay, we can see that there are two objects on our repository. Both of them has a type blob. And we can see the content here. It's one is, yeah, Euro Python and another is Euro Python 17. Okay. And now we have some stuff on our repository. Now let's go back and do something more, you know, what we do every day. Okay. So let's create a file with the end store, like just the current date in it. Right. Let's see what's inside. Well, nothing interesting. That just did. Quite predictable. So now let's add this file and commit. And commit is done. Now let's see what changes in our repository. Right. Okay. Now let's get the list of objects again and see what's inside. Okay. There are three more objects. We can see the types. Okay. Now, beside the blob, now we see two more types. A tree and a commit. Well, commit, we're gonna, it's explored later and tree is some sort of also binary gibberish of sorts. And then we can actually see the blob, the file that we just put as a part of our commit. As you can notice, the blobs of the files, they don't have names. They're pretty much nameless, right? So you put content itself. So basically it, blob doesn't store the name of the file. It's important and you will see why later. Okay. So let's extract that commit object and see what's inside it. Now we can see that's actually pretty readable text, but just it's separated by new lines and let's split it by new lines to see what's there. So we can see that commit object is pretty much a bunch of textual metadata, right? It has headers, right? It has a header tree, author and committer that happens to be me. And actually the text, the comment of the commit, right? So what's interesting here is what is this tree because we can see that tree is points to hash, right? Let's create small function that will convert a commit headers into dictionary. We're gonna use multi-dict. Multi-dict is just some third party dictionary that can actually hold several values for same key because those headers are not unique. Some of them can appear twice. So let's use this helper function to parse headers. Okay. Now we have a dictionary of headers. Now, right now the most important thing here is the tree, right? It seems like a pointer in terms of it's a hash. Let's see what kind of object it points to and what it contents are. Let's extract it from the headers. Let's have this function that will, you know, convert the hash into the file path. Just nothing interesting. Okay. Now let's load this tree object and print its contents. Okay. We can see that it has some numbers, then the file name, and then just some binary stuff. Basically, what tree is, it is a list of objects that, you know, it contains the metadata for objects, for blobs and for other trees. So let's create small function that's gonna retrieve from our collection of objects by hash. So we don't have to type this again because we'll use it later. And now let's parse this tree file. Again, we'll use regular expressions. Actually quite cool that you can use regular expressions for binary files as well. It's actually quite easy way to parse binary files. So this tree object consists of entries, which I defined as the first, there comes a bunch of numbers which actually happen to represent the Unix permission bytes. We don't really care about them. Then goes the file path. It can contain letters and dots and slashes. And then there is a hash of the file that it refers to in binary format. That's why there are 20 bytes because in hexadecimal format it's 40 bytes of the SHA. Okay. Now let's try this parse tree function and see what's gonna output. Okay, now that's something we can read. Now we see that a tree consists of only one file and it's pointer. So basically a tree is something that contains metadata for other objects. And basically tree connects the blob with the name of the file that it actually corresponds to. So you can see it's d2 something, right? And where was it? Here, right? d2. So it points to this object. Okay, now that's how pretty much Git implements file system by this tree object. Okay, now let's do more stuff with our repository. Let's do a created directory called newdir. And then let's just do the same stuff. Just write the date to the file. It's gonna add this file and commit. And now let's see what kind of objects appear in our repo after this. Use the same function and show it as a table. Okay, there are more stuff. It's already a little bit hard to keep track of it visually. So now let's try to learn what are actually branches in Git. There also the general term for branches is a ref or reference. And they live in this path. It's dot Git slash ref slash heads. And we have only one branch and it's master. So basically what a branch is in Git, right? It's just a file that just points to hash of the current commit. So we can actually go and see what do we have in our ref. Boom, it's just a hash of the latest commit. So it's important distinction. So branches in Git are not objects. They are just merely pointers with the name to commit. And because there are not objects, they are actually mutable unlike the objects themselves. Okay, now let's extract this hash pointer and just, you know, strip it from all this extra bytes and stuff. Okay, that's it. Let's try to find the commit object that master points to. And let's look inside it contents. Let's look inside it headers. Okay, that's where master points it. So commit object has the tree that we learned previously, but now it has a new field, a parent, right? The first commit didn't have a parent obviously because it was first commit. So basically, the parent points to previous commit in this line. Okay, let's convert these three lines of code into function because we might use it later. Okay, now let's look at the tree, the latest tree of the latest commit. Okay, now we see beside the file dot txt, it also contains pointer to a new directory. And it's a hash. So basically it's just sort of a representation of the file system just inside the git. And now let's, you know, let's compare it with the parents. Okay, now let's do this. Let's add one more. Let's make one more, dear. Let's say newer, dear. And let's do the same trick. Let's just write some more file. Let's just write the date, git add. Okay, now let's see what, let's see what did we add there. Okay, let's check again. Wow, the list of objects keep growing, growing, growing. Let's read master again. Let's it's not necessary. And then, okay, let's see the master's tree. Okay, now it has three entries. Wow, that's interesting. Okay, and now let's compare it with the previous, its parents tree. And it has, the previous had two entries because, well, I just added this new one. So what we can see here that the file dot txt was not changed in the latest commit, right? And since it wasn't changed, it's the same content, right? It doesn't store one more version of this file because it is the same version, right? The only thing that's got different is that it has the new entries. So that's how did application works. It never stores the same thing twice because it's the same, it has the same content. That's what is the meaning of content addressable. It is that the content itself serves as an address for an object. And if it's not changed, it's not changing. Okay, now let's, like our list keeps growing. Let's just, you know, store the number of, the current number of objects. And now let's do something more with our repository. Okay, let's create a branch and see what will happen. What kind of objects will appear? Okay, I want to work on the branch feature. I create the branch. Now let's move to this branch. Switch to the branch. Okay, now let's iterate the objects again and see if the count of objects changed. Well, surprisingly, it didn't change. As I said, the branches are not objects. When we create a branch, what happens is that we just, we just create one more file that points just to the latest commits. It's the same file, it's basically the same commit, so there is no need to create anything new there. So refs adds feature is just one more file. Okay, let's, you know, let's iterate over it. So I said there are two of them. And just let's read files and compare. Boom, the feature and the master are the same, they point to the same file. But as I said, branches are not objects. They can have different name for the same stuff, basically. Okay, now let's actually do something in our branch. Well, let's familiar technique. Two more, I know something. Add. We're sort of working on a feature here. Okay, now let's see if the branches change it. Okay, we can see that one of the branches are pointing now to a different commit. Well, because that's what we did with it, a new commit. And when you add a commit, it resets the pointer of the branch to this newly created commit. You can see the old one didn't change, of course. Okay, now let's go back to master. And then let's emulate that something was in the meanwhile done in master. Of course, I didn't create the file first. Okay, let's edit it again. Okay, now let's do what? Now we're going to merge our feature branch into production. Okay, now we have a merge commit and my Git offers me to write a text. Well, Git generates for us automatic text, right, with that stuff. Okay, just let's just use it. Okay, now let's investigate what is the current master commit? What's special about it? Okay, I'm extracting the master pointer again and let's plot the headers of the current master's commit. We see a bunch of familiar headers, but there is something new. There are two parents. And that's actually what makes a merge commit special. It has two parents. And because it's the parent commit that we were in the previous branch and the commit from another branch, that basically the ability to have several parents inside the commit is what makes Git not a linked list, but a tree and more specifically a cyclic directed graph. Basically, it doesn't contain any more information. It doesn't even say in the headers what branch it was merged from. It just says the first commit and the second commit and the contents are the file changes. Okay, that's enough coding for today. Let's just do a small recap of what we learned. A file is a blob. It doesn't even have a name or any metadata. It is just a piece of content. A tree is a list of blobs and trees. It's a recursive structure. It can point to files and other trees. And trees are the container of the metadata. It's implemented the file storage level. Now commit is a tuple of essentially three things. A some sort of file system which is implemented by linking to the tree at that time. The parent commit hash or parents commit hashes and some metadata like author and comment and that kind of stuff. So it captures the file system changes. And branch itself is just a file. It's not an object. It's just a file that points to the last commits hash. And it's important aspect that when Git doesn't store the full, the Git stores the full blob. So when you change the file it stores the whole file again. That's why the check outs are very fast because it doesn't need to resolve deltas and all of the stuff and it just gets filed by the hash immediately. That's why it's actually all this checkout, branches and operation are that fast. So what is a merge? Merge is a special kind of commit with two parents. That's it. That's the thing what makes a merge commit special. And that's what makes it not a linked list but a tree. Okay now I said that mutability has very important practical implications. And yeah, changes of content changes the hash and changes the hash means that linking gets broken. And it has a very practical aspect of it. History in Git is immutable as well. Not just objects themselves. The history is immutable because history is based on content linking. So let's say we have this very, very simple history, right? We have three commits of those two numbers. It's just very short version of hash. And the master pointer branch points to 6B. Let's say we decided to change, I just want to change a file in A4 or command or basically anything in A4. So if we change something in A4 it is essentially becomes a new object with its own hash, right? And it will point to the same parent commit 12. But since the content changes it's a different object. And since Git is content addressable then 6B doesn't point to F7 because it knows nothing about it. Basically at this stage F7 is unreachable commit. So if we want to change the history we need to rewrite history in Git. So what we're going to do we're going to create the commit that has the same stuff that 6B but points as a parent points to F7, right? It's the same content but use a different parent. And then we can reset the master reference to our newly created object. And we can do the same thing with the previous 8.5 commit. And basically if we want to change history we need to rewrite it, right? And then kill the previous one. But the references, the branches are actually mutable and we can switch them later. So after that we can write the run Git GC and it will kill unreachable objects like A4 or A4 or A5. And then we can sort of well, Git because if we change the object even if it's large object it will store just two versions of the same object. But that's actually part true. On a key value on the logical level there are indeed no deltas but it's quite inefficient, right? If you have to do this but they are not on logical levels they are completely transparent to these key value storages. And yes actually if you run Git GC which stands for garbage collection it will pick up the similar stuff and they do it heuristically. They don't do deltas based on their relationship as a commit, as a history, they will just pick up similarly looking blobs and they don't do deltas. And it's actually works quite fast. And otherwise any large repository will be like unbelievably large. Okay. Now we covered the basics of Git internals and I think I hope you learn something and I hope it will provide you like a fundamentalist to learn more, and yeah. That's just the beginning I hope for you learning the Git internals because there are many more interesting things, how Vassar send over network, how deltas actually work and all that kind of stuff. I highly recommend the free and well publicly accessible book ProGit. It's official Git's book. It has the chapter about Git internals which I use the main source of inspiration and I use their icons. Their license permits non-commercial use of their icons. That's it. Thank you very much for your attention. Thank you so much. So do we have any questions? Was everything clear? Okay. Thank you very much for this nice journey through Git. Can you comment on how staging is represented in this object's world? I knew, I knew. That's why I prepared mini-presentation of this as well. Okay. Yeah. That's the second thing that's sort of mind-blogging for people who start with this. What the hell is this stage thing, right? And the worst part that's called by three words that seemingly have nothing in common. Staging area is also known as index, also known as cache. So it is intermediary between your file system and Git repository which is used by Git tooling. Why I honestly think one of the worst parts of Git, look, there are two commands. The first one will check out the branch with the name branch and second command will purge all your changes, local changes, right? So the problem that Git seemingly does to completely unrelated things in two commands. But they're actually not that unrelated. So I think all of you see in this diagram, right? You have this working tree if you want to commit something, you first do Git add into staging area, then you do Git commit and Git into repository. But this picture doesn't, the diagram doesn't tell the whole story, right? Staging area is not something independent, something in between. So when you check out the branch, your staging area contains a cache of all the files in this branch, their current versions, right? So when you do Git add and do Git add, what it does it compares the contents of the file with the contents of the file in staging area, the cache, that's why it's called the cache. And if it sees there are differences, it moves the staged part of the files into this staging area, right? And when you do Git commit, it goes all over the files that are marked as changing the staging area. So you have a huge list of files, basically a tree of your changing things. And then it picks all the things that were changed and then creates a new tree from it and then this tree goes into your commit. So basically when we do checkout dot, which purges everything, it basically means checkout all the files from staging area into our working tree. And if we do checkout branch name, we say checkout files from the branch. And if we do that, we will save the cache in the staging area. Okay, I hope that answers your question. Thank you so much. Thanks for a great question. For a great extra presentation. That was great. Anyone want to ask? Okay. Hey, thanks for the great talk. Should we think that stashing actually works the same way? Staging? Yeah, on the side, right? Yeah, it has really nothing to do with basically the repository itself. Just git stash it says. Okay, let's save those changes into, you know, just a stash file and that's it. It's actually not part of the version control system. It's just basically just like copy paste area or something, just more advanced. It has really nothing to do with it at all. Thank you. So you basically mentioned that branches aren't objects, but if I remember correctly, tags are objects. Can you comment on that? Right, right, right. Yeah, indeed, I skipped this part because I think it's, well, not that crucial for a start. Yeah, indeed. There are actually two types of tags so lightweight tag is pretty much same as a branch. It's just a file with the name of the tag plus the commit that it points to. This is lightweight tag. And then there is, well, non lightweight, I don't remember the exact name, I know, heavyweight tag. And yeah, thank you, annotated tag, right. So annotated tag is one more object type which points to the commit that it tags plus you can add a comment or other metadata. And that kind of tag is an object and as a result is, of course, immutable. Yeah, there are two types of tags that have nothing to do in common except that they both are called tags. Thank you, we have more questions. Thank you and thank you for discussing, rewriting the history of Git. Is that something actually you would recommend doing? Like let's say you accidentally connect some large binary file and it's always with you and it's always with you. Yeah, thanks for the question. It's something that I would not recommend doing at all. Because if you're the only user, if you reset your branches, if you're the only user, it's okay. But if you push this, then when other people pull, their new master has nothing to do with the old master. And then you will need to do Git push hard or force, I forgot the key. I'm reading it at all costs. By the way, it's more remarked. So basically because the history is immutable, it shares a lot of similarities with the technology blockchain that's used in cryptocurrencies, right? The block is linked to its parent block and that's why if you can temper with all transactions because you can easily verify that hash doesn't match. So that's one of the ideas I think one of the most intriguing ideas of Git. Thank you. Any more questions? We have time for like one or two more. There's one in the front. Thank you. I understand how master pointers, branch pointers works, but how it works when you write like head, tilde, two or head, it's a recursive look up in the tree or something like this. It's a like a special reserved reference so we can see it. Yeah, it is a pretty much head. Sort of like a branch, but not really a branch. So we can see it's just a file, but not in the branches folder, but just in the root folder. And basically head is a pointer to a current branch. And that's what is known by detached head, right? So if I want to check out a specific commit, I say check out and the name of the commit and then the head instead of pointing to some branch, it will point specifically to this commit. So yeah, head of sort of like a reserved branch name of source. It operates pretty much a similar fashion, but it's just not part of branches. Thank you. Yeah. Thank you very much for the presentation, very nice. You mentioned something about Merge and how it's represented internally. How does this compare with Git rebase? Sorry, what's compared to what? Can you speak a little under? Merge with the rebase. Oh, that's something I'm not going to touch. Because I can dedicate one more whole talk of the difference between Merge and rebase. So rebase, okay, I'll try to do briefly. So Merge is just a commit with the rebase and if Git automatically tries to merge it and if it cannot automatically merge it will say, okay, sorry, but there are conflicts in each result themselves. So how rebase works? Rebase basically replace the changes in the other branch on top of the current branch, right? So it tries to see what changes, what deltas are in this branch. Then we will try to apply the same actions on the current branch. So yeah, it results the same basically in the result you will have commit containing the same information, but it's just but it will not have any information on its, basically all the rebase commits have one parent, right? It will result with the same in terms of content if there are no conflicts, of course. But in terms of structure it will, all the rebase have a single parent. But yeah, that's really messy and complicated subject and I personally don't even like rebasing. Yes, okay, thank you so much.