 And we're back. You're sitting in Brian. I hope you've had a nice, a nice, a nice talk. And if you, if you don't know about it, we'll be having sprints tomorrow and on Sunday. So feel free to sign up and sign up your projects, sign up to do it, everything. Do we have Matteo here with you. Hey, hello. Can you, I think, I think your song is a bit low. No, it may be good. Maybe good. Okay, like that. Yeah, yeah, that's that's good. Cool. Awesome. And you're going to, what are you going to teach us? You're going to teach us automation, right? Yeah. Oh, you can use Python to automate your daily life and your daily task. And we will use the git command as an example. We will create a 90 line script to remake the git command. I need this in my life, Matteo. Let's, let's do it. Okay. Hello, everyone. Welcome to this talk. So by presenting myself, my name is Matteo Bertucci. I am a first year student in a French engineering preparatory class, Polytech Marseille. I've been enjoying Python for more than two years now and use it daily to automate the new aspect of my life and have fun on larger projects. In my free times, I'm also part of the staff of the largest online Python community, Python Discord. We aim to foster a welcoming atmosphere for newer and older Python enthusiasts. I highly recommend you to join us at discord.com.gg, sorry, slash Python after the event if you want to continue the discussion or just chill out. This is my first time presenting a talk and I'm very glad to be here. This talk will go in two parts. In the first one, we will explore how the git database, the core file storage of a git repository is structured. In the second part, we'll write a 19 line script to replicate how the git commit command works. We'll go over some simple string manipulation and byte handling techniques and how to quickly make a Python utility script. During the last part, we discuss into more detail how we got to the script and why. Every command I will use in this part are made to be reproducible at home. I will highly recommend you to try this command and even deviate from the main path and try your own stuff and have fun. So what is the git database? Well, in each git repository, you have a hidden .git folder that contains all the data git needs to remember. It will contain remotes, hooks, logs and what we are interested in today, objects. As you probably already know, git gives each commit a 40-character hexadecimal identifier. This identifier is actually the hash of the commit. But what is the hash you may ask? Well, a hash is produced in an hash function. You pass a block of data of any length to an hash function and it will always return the same hash of the same length. Git uses the Algorithm SHA1, which stands for ShakerHash Algorithm. The thing is, commits are just a visible part of the iceberg, as we will see later on. What you need to remember for now is you can give any block of data to the git database and it will give you back a unique 40-character identifier that you can use later to access the data again. Let's try with an example. We will start by initializing an empty repository using git init. Using the find command, we can see that the folder .git slash object, the git database itself, is empty. Then, using git hash object, we can store arbitrary data in the database. Here, using the command echo, we will provide the string example blob followed by a new line. Do note that we use the dash w flag to actually save the object to the disk. Here, we got the hash starting with bf. Using the same command as previously, we can see that a new object exists in the database with starting with the same hash bf. Then, finally, using the command git cat and the dash p flag to have a humanable output, we can check that we've indeed stored the string example blob. Something I would like to point out too is the structure of the database. As you may have noticed, the blob is actually stored in the sub-folder called bf, which are the first two characters of the hash. Cool. Now that we know about the database itself, we can look at what it actually contains. Let's create a test repository for that. We'll start by running a few commands to have a reproducible environment. Since commits include metadata, such as the name of the author, if it isn't the same as yours, we'll have different hashes. We can start by creating a new repository, creating a src folder, and free or bogus files are with me, a license, and a script. We'll add that file to git, and we can immediately see that three new objects have been added to the database, starting with 0, 0, b0, and c9. Using git hash object once again, we can double check that each file maps indeed to the content of your repository. Something that you may have noticed too, the name of the file is not included anywhere in those objects. The objects inside the git databases are all anonymous. Another object will have to give meaning to them. So let's now make an example commit. Let's call it example commit because it is very original. And look at the content of the database once again. As you can see, we have now three new objects, 2, 7, 7, 6, and 8, 4. Let's investigate what those are. We'll start by looking at the commit itself as we can see on the terminal section 7, 6. As you can see, the structure is really quite simple. We have free field, the reference to the tree, the author, and the commit identity. They are followed by a new line and the commit button. Besides that, the author and the committer will always be the same unless you use something like the author flag on the git command. Most of the seven new versions will probably notice that the tree ID is also part of the database. Let's see what it contains. It's also quite simple with only three lines. Each line starts with the mode of the file, which is an integer saying what this file actually is, followed by the type of the file, so h, hr, blob, or tree, and its hash and name. The last century points to another tree, which has contained the script file. Git will use nested tree object to represent larger directory structures. As a side note, this representation isn't actually how the tree object is stored on the disk as we will see in a few minutes. So let's now write some code. We can start by opening a file and yeah. As it turns out, objects are un-stored in plain text to save space. They are compressed, deflated if you prefer, using the right term, using Zlib. We can use the build-in Zlib library to inflate it. You can also notice you have a header at the beginning of the file. It contains the file type, blob here, the size, a little byte, and the actual content. The file hash can be very simply calculated by using the hashlib.sh1 function on the whole content of the file. I would also just like to stop here for a second and talk about byte handling in Python. When you create a string of characters, the computer must store it somehow in memory. For that, we created a table mapping an actual English character to a chain of bytes in memory, like ASCII. Python uses unicode plus utf8 for a string, which by default allows more special characters to be used from foreign languages. The problem with strings is they don't like random bytes, such as we saw previously with the blob that was still there. For that purpose, through the type exists the bytes and the byte array. The former is unmutable, just like string. Once the letter is, you will usually use a byte array when you need to modify part of the array at runtime. We won't need it today. You have two main ways of creating a byte object. You can either prefix a string with a letter B, which sadly means you can't combine it with a format string. Or the other method is to convert it back and forth with a string object using the encode and decode method. Both take an encoding parameter, which is usually utf8. With all that knowledge in mind, we can create our function, who will start with some import and by creating a pass constant, which is our database folder. If you don't know what passlib is, it is basically a fancier version of a web.spath that uses classes much easier to work with. This function will take in the type of the block as a string and its actual content as a byte. And return the hash as an hexadecimal string. Next, we can construct the block of data we are actually going to store that follows the git model, who will start by putting the type, a space, the file size, a null back represented by a backslash and a zero, which is just a short end for a backslash queue and for zero, and the actual content of the block. We use the legacy style formatting here, because we can then choose a b string in the format string, as I've said previously. What happened here is each % has for string and %d for digit is replaced by its element in the tuple, like so. We can now use the hashlib function to compute the hash and the lib to compress the block. We can use, we will use hash underscore to avoid colliding with the hash built-in function like we did with the type element. As you may have remembered, the object has stored in a subdirectory, which is the first two letters of the hash. We can represent that by using object pass forward slash hash underscore. We make sure this folder exists and write the composite data down and return the hash. Now, we can write a quick small function to take any parameter, any pass, sorry, as a parameter and store the content of the find database using the type block. This is quite simple, we just open the file in read binary mode as we want to handle stuff like images which are in text and call write object. Easy, right? Now it's time for the less easy part, writing a tree or a folder if you prefer. Each line in the tree object will be the mod of the file, a space, its name, a null byte, and the hash stored in word binary. We'll also create a constant that will be all of our inert folders. We don't want to be actually committing the git database or the git database or package folder or event or own commit implementation. Then we can list every sub file or sub folder in the target folder and sort the hour as it is required by git for some reason. I will do a quick stop here and talk about recursion. Let's take this example. The goal is to make an example for option six take two arguments. The first one being the level of nesting of the list return and the second one being the length of each list. We'll start by a simpler version. Here we just want to generate a list of duckies over certain length. This is quite simple, we just make a for loop and add this string many times the list and return it. Let's do a stock conversion. We now tune a nested list. The second argument will be a Boolean saying if the list should be nested or not. The second argument is set to false. We can just have the same logic as before. But if this is true, we could use a true nested loop to represent this nested structure, right? But that will be just repeating the same code once again. So what if we instead make the function call itself to return the non-nested list, the second argument set to false, and add it to a larger list. This way you will have the nested structure you want. Now we can move on to our final step, generating an arbitrary amount of nested list. Well, we use almost the exact same logic. Instead of using a Boolean, we use an integer saying how many levels which still are left and decrease it by one is the time we want to call the function. If we are on the last level, one, we can just return our final list. If that isn't clear, it is a visualization of what happens when you call the function with argument 3, 3. The function will first call itself free time with the argument 2 and 3. Each of those calls will then call itself again with the argument 1, 3. And those last calls will yield a list of free duckies, creating this nested structure of 27 duckies. Now the question you may be asking is why the heck did that talk about this crazy technique? Well, if you look at it this way, our list is quite similar to folders, don't you think? We can have an arbitrary amount of nested folders and want to handle that properly. So let's get back to your code. Here we have a list of every children files of the starting folder. We can iterate over each file if it is part of the included path. We just move on to the next iteration using the continue keyword. If it is a directory, we call the write function once again, making a recursive call. This means if there is another directory inside another one, inside another one, we will still be able to handle them. If it is a file, we simply call writeBlock. As you may have noticed, I store every time the hash of the new object and a mode value. Once they are saved to the database, we can create the robot object from 40-character hash. We yield a 27-byte array and generate the list, the line as we saw previously, and add it to the list. Once all of your files are processed, we can stick all the lights together, write the object, and return the hash. This is our last function that we enter with the git database. It is the one to store git commit, which is really quite simple. We write down the tree of the current folder. We create the commit according to the template we saw earlier, and a few constants we just defined, and encode it into a byte object, write it, and return the hash. The first thing I'm using here is just to make my life easier when we're templating. It's totally safe to do so, since all the characters inside the commit object will always be valid UTF-8 characters, or at least we can assume that they will be. This assertion isn't true for other objects, and we could have run into a decoding error. We're close to the end of the script, so all that's left to do is to create our script interface. We want to be able to call our script on the command line by hiding a commit message as an argument. Each of those arguments, including my underscore commit.py, will be placed as a list in sys.rv. One noteworthy thing is how the shell handler parameter will normally split them at space and encode around, like in this simple example. We don't want to put codes around, because we simply don't have any reason to do so. We don't have any special flag, we don't have many parameters. For that, it's super easy, we just have to join them with space. A good idea to follow when creating a script is having your main script in an ifname equals main branch with all of the underscore. I call that the main ground. You may have already seen it to understand what he really does. We need to understand what the name of the render represents. We can install two little screens with start being around and import being imported. As you can see, inside imported, the value of main will be imported itself. But in start, it will be the render string main. What this render actually is, is it is always the name of the file that is being run, except for entry point, in which it will be the render string main. The goal of this way-looking branch is, imagine that I want for any reason, make a new script that will rely on this little function we just created. We will run to import our script, and this branch here will be evaluated to false and prevent our script from actually triggering a new commit because it simply knows what we want. We just want to access the utility function. So let's get back to our code. We will add an import sys at the end of our file, beginning, sorry, and add a main ground. We'll start by checking that we actually have a second argument. A good way of time to follow when making a command like tune is error should be handled gracefully. Imagine you want to use this tool you built two months ago, and you have a cryptic index error unless you dive into the source code only able to understand what actually happened. That's why error-running is so important. If there is an error, we follow the unique standard by writing the script name followed by two dots and the error message. After that, we exit with the non-zero code signalling to the initiator of the script that it failed. We can simply create the argument, the message from the argument, make a commit, and write a message. We're finally done with our script. Let's start for our final test. Here is our original commit starting with the hash 76. We will delete our .git folder, initialize the database, and run our script. As you can see, we have the same commit hash, meaning that the database is the same, our job has been successful. In reality, creating this script took me less than an hour from researches to the actual finished product. I would just like to take a minute to discuss how we structured the script and why choosing Python to do so. In the screenshot, you can see our entire code zoomed out. Each little part of our script is divided into more or less tiny functions. This way, if I ever write another script I need to access the .git internals, I can just snipe some functions from the script and save time. This is the main reason we are in the main ground to be able to reuse our utility function. Similarly, it will be easier to navigate if you ever go back to your code and have to change it. Modularity is more often the key than it isn't. Code documentation is also important, both through doc string and actual commands. I will also recommend you to follow a code style such as PEP8 to have a common style across all of your files. Type-hating is also a good way of communicating what a function expects and spills out. All of this will combine, allow you to still understand your code weeks after you work it, which in my opinion is quite important for small scripts or even larger projects. Last thing I would like to stop on is why choosing out of all the possible language Python. Well, simply because we aren't at a row by row worst, it is as simple as that. I'm just kidding. For real, Python is an excellent choice for a quick string team in my opinion because it is fast to write. You don't have to stop and wonder what kind of computer science madness you are going to use to make the compiler accept your types of anything. That typing is quite powerful here. Additionally, having an interactive shell is quite perfect to explore your data like we did at the beginning with the Git blob. Plus, it doesn't get we are all more or less familiar here. All of this makes Python a very powerful language in my very own opinionated opinion. That's it for me. Thank you for listening for me today at Python 2021. You can flash this QR code to have access to the final code or it will pass the link into my tricks in a few minutes. Thank you so much, Matteo, for the talk. I have one question. Why did you build this in the first place? What were you trying to automate? Well, this is more of an example. Script doesn't have anything useful. Let's say it is just a way to have fun with the Git database because I feel like looking at the antennas of an existing tool is also very interesting. I often do that with Git, Docker and stuff like that. And it is also quite short to do. We could have it in 90 lines but it is quite good for a talk in my opinion. The best way to learn about something is to write a client for it. Awesome. Thanks again for coming. We will be going to the break now since there are no more questions. Thanks again, Matteo. Thanks for your time.