 OK, it's the top of the hour. Everybody can hear me? OK, thank you. OK, hello, everyone, and welcome to another deep dive of the Create team. OK, here. I'm Francisco Javier Lopez, sound for short, and I'm a senior bucket engineer in the Create team. Today, we are going to talk about Git LFS. Sorry. We're going to talk about Git LFS, but there are several topics related to Git LFS we could talk about. So we are focusing on what is Git LFS, how you can use Git LFS. The specification behind Git LFS and how we have built our own LFS server here at Git LFS. OK, this is the table of contents, especially what I have just said. First of all, I want to say that I think you all know this, but Git doesn't track binary files like audio, video, or image files the same way it does with text files. Any change in a binary file requires to upload to push a new copy of the whole file in the repository. Yes, Git can generate these files for binary files. So that means that with a big binary file, if we made some modifications or update to it, the repository can grow really quickly. So the history will also grow bigger and bigger. That means that usual operations like flow and fetch or pool will decrease the speed. OK, let me change the speed. So what is Git LFS? Git LFS is an open source project. It was developed by several companies, I think GitHub, Bitpocket, and so more. But Git LFS is a Git extension. So it's not a binary, it's not an external tool. And that also means that when you install Git by default, you don't have this Git extension. It's not only just a set of tools. It also sets a definition which we can use to create our own Git LFS client. And that Git LFS explicitly replaces binary objects with text pointers. These text pointers will go into the Git repository while the binary files will go to the LFS server. Let me show you something really, really quick. OK, this is a repo with only one file and two commits. OK, the first commit is where I added the image. And the second commit is where I updated the image. OK, if we take a look at the size of this repository, the object directory is almost 4 gigabytes. And the file size of the image is almost 2 gigabytes. So that means that the two changes pushed to new binary files to the repository. OK, now let's take a look at a repository with Git LFS. I have the same image. I did the same and added the image and the repository size. As you can see, the size is only kilowatts, because we only store text pointers there. The binary file is in a different storage. OK, this is how a Git LFS pointer looks like. There are only three mandatory params. The first one is the version, means the URL that identifies the spec used to generate this text pointer. The second param is the OID, which is a hash identifier of the file. At the moment, only SHA-256 is supported. And obviously, two identical files get always the same OID. Finally, we also have the file size in bytes. OK, let's take a look at the repository with LFS enabled. Let's create some code with Git LFS. We can get how the pointer is going to look like. OK, this is not a pointer in the secondary repository. It is what the pointer of this file is going to look like after Git LFS that is magic. But how Git LFS works? OK, first of all, we need a new LFS entry in our Git repository conflict file. This entry stores the URL of the LFS server. So now we have two different concepts. We are going to have the usual remote for Git repositories. And we will also have this new LFS entry for the LFS server. But if all when you install the Git LFS extension, the LFS URL is going to be the Git repository URL. As you can see here is the entry of my LFS server. OK, then once we have installed the Git LFS extension, we have to tell it which files to track. It is really easy. It uses a pattern matching style. So if we want to track only PNG files, we have to say, Git LFS track on this pattern. It only works for new files. So if we already have existing PNG files in the repository, we have to run Git LFS migrate before. Once we execute Git LFS track, a new file called .git attributes will be created in the repository. And that file has to be pushed through repository as well. In that file, we will store the filters used or the filters that Git LFS will use to track which files replace for expointers. I mean, it doesn't track if the file is indeed a binary file. I mean, you can track text files if you want. It's just a pattern matching. It also provides file locking capabilities, Git LFS. So that means that if you set a lock on a file and you push that lock to the repository, nobody will be able to update that file in the repository until you remove that lock. OK, but how Git LFS does this process through Git hooks? OK, it's really simple. If we go to this repository, now in these repositories, we have some default hooks. Let's check, for example, the pre-push. As you can see, there is a warning first. And then we call Git LFS with the command, OK? So under the hood, every of these hooks will call the command Git LFS. OK, how does this work conceptually? OK, first, we perform the Git push. That Git push gets to the LFS hooks. And those LFS hooks will detect if you have any file to track. If it finds any of those files, it will generate a pointer. With that information, the pointer, it will also replace the file with this text pointer. So these pointers will go to the repository, and it will send the real binary files to the LFS server. And in a Git pull, the process is quite similar. We perform the Git pull, then the repository will provide the pull data and the LFS pointers. This information will go to the LFS hooks. The LFS hooks will access the LFS server with all the LFS pointers that it had found. And then the LFS server will return the binary file. With this information, we will return the pull data and the binary files to the user by having. Authentication, I have to say that Git LFS, although I think it's almost three or four years old. I don't remember. It still uses HTTP basic authentication. That means that for security reasons, we should use HTTPS or use SSH. OK, but where do these credentials come from? It can come from the Git remote of the LFS URL. I mean, you can edit the config file of the repository and put your credentials there. They can come also from the usual Git provincial key chain. And if the remote is through SSH, the process is a little bit different. When we perform the pulls, for example, Git LFS will connect to the repository through SSH. Then it will run the command git LFS authenticate. This command is handled by Git Lab Shell in our case. Then Git Lab Shell will connect to the internal API to the LFS authenticate endpoint. And that endpoint will return a token. That token has an expiration time inside it. And with that token, we will return a response. With that response, we'll have a header section with an authorization header. Then when Git LFS sees any authorization header, it will use that header in every of the following requests. In our case, the authentication is handled by the controller git HTTP client controller. Basically, what it does is checks if the request has any authorization header. It will decode the content in that authorization header giving us the login and the password or the login and the token. And with that information, it will call the method findForGitClient in the auth class that will iterate over each of the authentication methods we already have in order to get a successful response for those credentials. And then it will return the result class. OK. Git LFS defines two main APIs. One is the batch API used to handle objects, to handle binary files. And the other one is the file locking API, used only for logs. So one interesting thing in this endpoint is that it is used only to request ability to get information about LFS files. It's very important the concept of request the ability. If the user is not apprised, of course, no information about the LFS files will be sent. If everything is OK, in the response, we will have some information about where we can find the LFS file. Usually, that URL, the action to access the LFS file is done in a different endpoint, not through this one. This endpoint is used for both uploading and downloading. And any of the request and the response involved in this Git LFS process must have these headers. OK. This is totally mandatory, this MIME type. OK, this is how the request to the former API look like. We have an operation field, and it can only be upload or download. We have the transfer field, which is a list of the client transfer adapters. And only basic is used at the moment. We also have the rest, which is a reference in the repository, and the main param object, which is an array of LFS objects to interact with. OK, so in this example, what we want to do is upload this object. This is how we tell the former API, the former endpoint, hey, I want to upload this object. How do I do it? OK, then in the response, we have the transfer, again, which is the same transfer adapter, and it's the same we use in the request. And the main field, the param object, is a list of objects with information, the OID, the size, authenticated, that indicates if the request is authenticated or not, and finally, actions. In actions, we see which operation can be performed. HREF, which is the URL where you can access the LFS file. We have the headers, which it will usually be used to store the authorization headers. And finally, the expires in or expires at, that indicates when the transfers will expire. So in this case, what we are going to do, or what this response says, is if you want to download this LFS file with the OID 1111, you can access it in the HREF, HTTPS, gitlab.com, OK? OK, how LFS downloads looks like in gitlab. First, the client, the smiley face, the git LFS, will perform a post request to this endpoint. That request will reach to the LFS API controller. That controller will check the authorization, and if everything is OK, it will return a response saying, OK, if you want to download in the HREF, this one, you can access the LFS file. Then the user will perform a git request to the URL found in the response, and that will go to the LFS storage controller. Again, that controller will perform the authentication of the user, and if it is authenticated, it will send a response with the header xsend file. And the content of that file will be the path in disk of the LFS file. This is important because we are not sending any binary data from Rails to git LFS, OK? We only send a request with that header. That response will be intercepted by Workhorse. Workhorse will see that the xsend file header is set, and then it will get the content of that header to access the file and stream it, finally, to the git LFS client. OK, and how uploads work. The same way, we perform the post request, the LFS API controller, we authenticate this request. It will return a response with an HREF URL, and then the git client will perform a put request to that HREF. That request will be intercepted by Workhorse. Workhorse then will access Rails to get an authorization to perform that upload. If everything is OK, Rails says it's OK to upload that file. Workhorse will then upload the file and save it to disk, OK? And when the upload has finished, Workhorse will send a request again to Rails saying, hey, I have finished the upload. That request doesn't have any binary data inside. The file has already been saved to disk. It only has the OID and the size and basic information. So in the method verify, verify, finalize of the LFS storage controller, this controller will check if we already have this LFS object in the database. And if it is, we will link it to the current project. If not, we will create that record in the database. And now the file-locking API, OK, this is really simple. This API has four endpoints. The first one, the first one with the POST method is to create a lock. And this is important because this is a new project. Only single branch locking is support. That means that if I lock a file in master, it doesn't mean that if somebody check out that branch, it doesn't mean that the other users can update that file, OK, in that branch. We also have an endpoint to list all the logs. And another endpoint, the unlock endpoint to remove logs. And this is interesting because usually what you expect is that you should remove only your own logs, OK? In the request of this endpoint, we can set the param force to true. And that means that the user wants to remove a lock from other user, owned by other user, OK? We allow that only if the user who makes the request is a project maintainer. And it makes sense, right? OK, another, and I think the final endpoint, and it's really interesting, is the verify endpoint. This endpoint is used to check if any file in your Git push match any of the logs set in the repository, OK? Usually when you use a Git LFS, a warning is raised, saying, hey, if you want to enable this functionality, please execute this command, OK? So you have to enable it. The response returned by this endpoint is basically into two fields, hours and days. The field hours will store the logs created by the user that makes the request. And there, obviously, will be the logs owned by other users. So when you perform a Git push, if any of the file matches any of the logs of the user in the field hours, the Git push will succeed. And after the Git push, Git LFS will display the logs involved in that Git push. In order to tell you, hey, in this Git push, these logs are present. Maybe you want to remove them, or you are aware that they exist. And the second case is if any of the files matches any of the logs of other users, the Git push will hold, OK? It's the expected behavior in this case. OK, so let's take a look at the response. Here, as you can see, it's a very simple response. In hours will be an array of the current logs owned by the user who performs the request. And there will be the logs used by other users. And now let's code that. I want to start first with the roots, OK? In the root Git HTTP, you have this three group of roots. The first one would be for the batch API, the second one for the locking API. And the third one will be for the object storage, OK? So as you can see, in the first one, in the LFS API, we have the batch, we have this one are deprecated. And in the locking, we have the logs. We have the unlock, the verify. And here, in the LFS object storage, we have the root path to download LFS files. And we also have the upload authorize and upload finalize used in Git pushes, OK? So I mean, this method will be called by Workhorse. So let's take a look at the LFS API controller. And before anything, what do we have to do? We have to authenticate the user. So that functionality is stored in the parent class, in Git HTTP client controller. Here, we have method of the hook authenticate user. And here is where we handle the authentication of LFS files, specifically here. Here, as we saw in the slides, we call GitLab auth. And the method find for Git client. And we pass the login and the password that we want to use for that authorization. And the result would be a result object, which will allow or not this operation. OK, let's suppose the user is authenticated, everything, it's OK. We are going to check if the request is a download request or an upload request. How do we do it? Just checking the operation from. OK, remember where is it? Here, here is the request, the request, how our request looks like for the batch API. We have the operation transfer reps and the objects. Here, this is what we are checking. So the process is really simple. If it's what the user wants it to download object, we will iterate over each of the objects in the request and then call download action. This download action is basically create a hash that we will use in the initial response with the project URL and then this root. OK, this is the root we are using for the LSS object storage. You have to notice that this root is the same root we have configured here in our roots. OK, the same goes for the upload action. We will iterate over each of the objects, then call the upload action. And in this upload action, we will generate the proper response of that action. OK, again, with the same URL and with the same LSS object storage. Notice that the URL in this case is the URL when we download and the URL when we want to upload is different. OK, in uploads, we usually provide or we have to provide the size. Why? Because we also use the size to check if the file provided in the upload has the same file, the same size that the file that we requested. OK, what else do we have? OK, let's take a look at the LSS storage controller. Remember, that controller is this one. It's the one that will download or allow the upload to the LSS object storage. So this is the one that the Git clients will talk to when sending files and getting files, right? Mainly when you download. When you upload, you don't really talk to this endpoint. The only workers talk to this endpoint. OK, but yes, you use the same URL. And as you can see, let me go really quickly. Workhorse uses the same URL. It's only at the suffix authorized or in this case without it. But only workers talk to the LSS storage controller for uploads. First of all, what do we have to do when somebody reaches a controller authentication? OK, the same way the LSS API controller did. This class is a true class of the Git HTTP client controller, so it's the same method again. It calls the authenticated user hook, and then it's called the GitLab of class. So if the user is authenticated, then we will check if the LSS file exists. And if it exists, we will call send upload. Send upload, basically send file upload. OK, this is a concern. And it will reach this file, this line, sorry. This file, in this file, we will tell Rails to call the send file method with this path. This method is the one that will, well, not this method. Let's say this method does that. This method will send the header x send file in the response, OK? So here we are not sending any data to the Git client. We are sending the path of the binary file to Workhorse. Then Workhorse with that path is the one that will open the file and string the file to the final user. OK, now we have two methods. We have upload, authorize, and we have upload, finalize. In upload, authorize, Workhorse is going to say, hey, somebody want to upload this file. What do we do? Can I do it? Can I authorize it? And then based on the response, we will allow it or not. And once the upload has finished, this method will be called. Again, it will be called by a request sent by Workhorse. We will go to store file, and what this does is first check if those objects exist. And if it exists, we call link to project. Link to project, what it does is to check first if the LFS file exists. And if it exists, it will link the LFS file to the project. And if not, it will create the LFS file in the database. Because what we have is a network. There is only one LFS file, but it can be linked to different projects in order not to repeat the same record in the database. So the main problem here is that if we delete a project, we only delete the relationship between the project and the LFS object. If all the projects are deleted, the LFS binary file will remain in the database forever. So that's why we have a service that runs periodically in order to clean orphan LFS objects. And finally, let's go to the LFS logs API controller. This is a really simple controller. We have the basic operations. We have create, we have index. And then we have also verify and unlock. Create is used to create the log, unlock to remove the log, index to list them, and verify to check if the LFS, if the log exists. Here you can see that the result of this finder is splitted into the response hours and days. And that response here is what we saw in the slides. I mean, this Git LFS is, by now, the specification is really simple, basically because we don't have many functionalities yet. For example, we can look a file across branches. We can only encode using SHA256. But as you can see, it's really easy to create your own client, your own Git LFS client, and your own Git LFS server, in this case. OK, and I think that's all I wanted to talk about because I mean, I can go into details with LFS objects, LFS logs, but I mean, they are pretty simple. So I think I prefer if you have any questions to talk and answer them. So let me stop. OK, let me check the chat real quickly. Yeah, LFS, OK. And we still write to this for temporary storage, even when we are configured to use a temporary storage. Is this your standard as well? Yes, I think so. I mean, I'm not an expert here because I don't know what workforce that's under the hood, but it's what I think. I can solve your doubt about HE configurations, sorry, Gert. I have no idea. First, no, just be straight on the first and we're going in the background with the model. OK, OK, now in the document. Yes, that would be a recording in YouTube and filter, I think. I think that's an interesting question. I mean, if the token expires, every request has the authorization header. So in that case, I think it will have to re-authenticate for that. Yes. How many files are stored in Elephone? OK, you are asking and responding and answering yourselves, Christian, you can talk if you want, because it will be, I don't know, it's a very large version. Yeah, it's a simplification, please. Yeah, I mean, it's not really how it works internally, but it has simplification concepts, a conceptual simplification of how it works. I mean, it will be interesting. I don't know if you would be interested in knowing how the KID LFS client specification is, because we don't have to deal with that yet at KID Lab. I mean, if you are interested, I'm happy to help you with that. OK, any other question that you can reply? So if not, I think we're done. Thank you very much for attending. I hope you enjoyed. See you all soon, OK? Bye, everybody. Thanks, fam.