 Is this on? OK. Yeah. My name is Lars Schneider. And I'm the technical lead for Git at Autodesk. And I'm also a Git and Git LFS contributor. And today I want to talk about Git LFS or how to handle large Git repositories. But first of all, let's discuss what is a large Git repositories. Git repositories can grow large in different dimensions. And one of them is they can grow large by the number of files that they contain in their head revision. Why is this a problem? Well, usually Git looks at every file in your repository in order to detect if you have changed that file. And if you have, let's say, more than 100,000 files in your repository, then this process starts to get a little bit slower. So Git is not as snappy as you know it. There are several solutions to that problem. One that is already built in in Git is a concept of that's called sparse checkout. With sparse checkout, you would tell Git to only look at certain directories so it would ignore others. Another approach to that problem was just recently introduced by Microsoft. It's called GVFS, the Git virtual file system. And GVFS basically turns this problem around. Instead of Git asking the file system what has been changed, the file system more or less talks to Git and helps Git to figure out more quickly what files have been changed. And this way, Microsoft was able to handle Git repositories with millions of files in a very responsive manner. Another way how Git repositories can grow large is by the number of commits. Because many Git commands, they look at the history of your repository. For instance, the very popular Git blame command that you can use to figure out who changed a certain line in your source code. But usually, you don't run into this problem because you need at least 100,000 commits. That's what I've seen before you even notice a speed bump. And even if I talk about speed bumps in Git, that really means it maybe takes half a second longer. But the real problem that I want to talk about today with large Git repositories is all large files. So why are large files a problem for Git? In order to make this a little bit more tangible, I want to use this 100 megabyte video file as an example. And let's take this file and see what happens when we put it into a Git repository. So here, we have an empty Git repository. We just call it Git init. And the size is just 0 megabyte, right? So we add a bunch of source code files here that in total are roughly 1 megabyte. And we add our large video file with 100 megabyte. We commit that change. And after the first commit, our repository grew to a size of 101 megabyte. Then we change our mind and we change the color of the video file. And we add that file and make just another commit. And this will add 100 more megabyte to our repository. Why is that the case? Well, a Git repository always contains every version of a file in the history. And that's why the repository grew by 100 megabytes because we have the second version of the video file in our repository. And of course, this is also true if we change our video file once more. Then we are already at 301 megabyte. And you can imagine at companies that have hundreds of developers working on projects with thousands of files, of large files, Git repositories can grow even faster to even greater size. And this is a problem. And why? Because, see, when you clone a Git repository, you need to transfer all history. And if you have many engineers, then all these engineers need to transfer the history to their local machines. So they need space on their local machines. And they need to transfer all the data through your company network. So basically, if you have large Git repositories and many engineers, you will need a lot of bandwidth. And that's one of the biggest challenges that I've seen in my company that impedes the adoption of Git. And Git LFS is a way to solve that problem. So how does Git LFS solve the problem? Well, let's look at the very same situation. But with Git LFS enabled. So again, we add some source code files and we add our large video file. So Git LFS will detect this video file and it will upload it to an LFS server. And the file itself is not added to the Git repository. Instead, Git LFS will add a pointer file that is very small, usually just three lines, that will contain just the location of the large file on the LFS server. So in total, after the first commit, our repository has only a size of one megabyte with source code. And this might sound complicated, but Git LFS handles all these things for you. So for you as a user, you would interact with the Git repository as you would normally interact with any normal Git repository. So if we change again our mind and change the color of the video file, then we add the video file to the repository, Git LFS will detect that, will upload the file to the LFS server and will adjust the pointer file in our repository. So after the second commit, our repository still has just a size of one megabyte. And of course, this is also true for the third commit. So as you can see, if one of our engineers clones this repository, then only one megabyte of data needs to be transferred. But of course, if you clone this repository, you have these pointer files on your machine. So on checkout, Git LFS will actually detect that there are certain pointer files that you wanna have right now and it will resolve these pointer files. That means it will download the actual content from the LFS server and place it on your machine in the right location. So that means, if we clone the repository and check out the master branch, then we would need to transfer 101 megabyte in this example. And that also means we don't need to download the two previous versions. So in just this simple example, we already saved 200 megabyte or 60% of the bandwidth. And you can imagine if you have large, large repositories in big corporations, then the bandwidth savings are even greater. So this is kind of the schematic view, how Git LFS works, but how would you use it as a user? So let's look at this. Here we have a normal Git repository. We have one source code file, code CVP and one readme file. And now we generate our large video that we place in our directories tree here. So the first thing that we need to do is, and we only need to do this once, is we need to tell Git LFS what files should be handled with LFS. And we do that by using the Git LFS track command. Usually people use it in that way that they define a certain extension to track with LFS. And this is an important detail. Files are tracked based on their file name with LFS. So the file size is not really relevant. This is a consequence of the way how Git LFS is integrated into Git. So when we call this Git LFS track command, what will happen is, Git LFS track will generate a .git-attributes file. This file contains all the, this file helps Git to understand what files are actually tracked with LFS. So after we have created the, after we track the files, then we can just add, then we can just use the ordinary Git-attributes command to add our files to the staging area for the next commit. And one thing that is important here is, we not only add our video file, we also need to add the git-attributes file to preserve the knowledge of what files are tracked by LFS. And of course, then we make our commit as we would normally do and we put our changes to the server as we would normally do as well. So that's it. That's how you add files to Git LFS. So we at Autodesk, we use Git LFS and we use it for more than a year now, very successfully for pretty much all our development. Just to give you a little bit of a perspective, so who is Autodesk? Autodesk is best known for AutoCAD. This is, and AutoCAD is 2D and 3D computer added design software. We're in business for more than 35 years and we have more than 4,000 engineers working on hundreds of products that consist of terabytes of code and asset data. And we use Git LFS mostly for integration test data. In our case, these are mostly 3D models. We also use it for auxiliary data, that is documentation, images, videos and that sort of thing. Some of our teams use it even for build artifacts. These are like compiled libraries and things like that. We don't recommend that because we have a special solution for these kind of build artifacts. That's called Artifactory that we use to, yeah, that is a central place where all compiled binaries are stored within our company. So we don't recommend our engineers to use Git LFS for build artifacts. So as I said, we're using Git LFS for more than a year and in the reminder of this talk, I wanna share what we've learned so far. And I wanna share that from a perspective of two different personas. First, from the perspective of the developer and second, from the perspective of the administrator managing the repository. So let's start with the developer. And with developer, I also mean, you know, designer, tester, pretty much anyone who is interacting with a Git repository. So when I introduce Git LFS to a team, pretty much the first question that I get is, so what is too large for Git? So what files do actually need to go into Git LFS? And in order to understand that, I wanna explain it a little bit further what files should go to LFS. In general, files that do not compress well should go to LFS. Why is that the case? Well, Git is made for text, for source code. So if you have a large text file, let's say a 10 megabyte XML file, that might not be a problem for Git because Git compresses all content and a large XML file can be compressed very well, usually. So if you add it to your repository, your repository probably won't increase in size that much. However, if you add a 10 megabyte video file to your repository, then Git can't compress that video file any further because the video files are usually pretty highly compressed already. So that means if you add 10 megabyte video file to your repository, your repository will grow by 10 megabyte right away. But you might say, okay, 10 megabyte, that's not that big of a thing, so that's not a problem, that's true. But if you change this 10 megabyte file every day, then you would add 10 megabyte every day to your repository. And after a couple of weeks, you probably will have a problem already because your Git repository grew to an unmanageable size. So the bottom line here is files that do not compress well and change frequently should go to LFS. And in order to simplify that for engineers, I also tell them don't worry about files that are smaller than 500 kilobyte. Because they are usually fine and that's not a problem. All right, so now you know what files you should put into LFS. Now let's talk about how you track them. I showed you earlier that usually people use this star.extension pattern to track files in Git LFS. And that works usually pretty well, but there are actually cases. So let's consider you have a couple of big screenshots that you will add to your repository, maybe for the help pages of your application. So you see these big PNG files and you think, okay, I tracked them with LFS like that. And after you've done that, you realize your repository got a little bit slower. So what happened? Well, it turns out you added your very large screenshots to LFS, but you also added your 10,000 small icons that your application has to LFS. And this can be a problem, because as I showed you earlier, LFS adds some kind of indirection in your Git process. You have this pointer file in between, right? And this indirection that causes, you know, requires more computation. So it slows down things a little bit. And you would notice that if you have, yeah, well over 10,000, 15,000 files in LFS. This got recently much, much faster. So with Git 2.11, I contributed a new way how Git can talk to LFS and that made the communication around 80 times faster. So usually you shouldn't run into these problems anymore today, but keep an eye on it because if you have really a large number of icon files in that situation, then that could be a problem that you need to watch out for. But what could you do if you run into this kind of situation? Well, one idea is to use smart tracking patterns. You don't have to track the extension of a file. You could add like a custom thing like .LFS. into your file names and then track this .LFS.pattern like this. You could also just create a directory and track all the content within a directory in that way. And of course, you could also track the files by their name directly. I generally don't recommend that because if you rename the file or if you move it to another location, then you wouldn't track the file with Git LFS anymore and the content would actually bleed into your Git repository. That's something you don't want. Okay, one more thing to know when you track Git LFS files is that the tracking of LFS files is usually case-sensitive. That's because in Git, usually everything is case-sensitive because it comes from Linux, right? So the thing is on Windows and macOS, you wouldn't notice that because Windows and macOS by default, they have file systems that are case-sensitive. So if you track the lower case PNG files on Windows and macOS, it would also track the upper case PNG files, but on Linux, it wouldn't, okay? So what can you do about that? Or in other words, this can be a problem for cross-platform teams that work on cross-platform software. So what can you do about that in this case? You could use glob patterns to tell Git LFS to track the extensions in all kinds of variations like this. And when you use that glob pattern, then you will track the upper case PNG files correctly on Windows, Mac, and Linux. And if you are unsure what files are tracked, then you can use the Git LFS LFS files command to check what is being tracked. All right, now a few gotchas that our developers have run into. So the first one is when you have a lot of files in Git LFS, then your Git clone operation will get slower. And the reason is that right now, Git clone can download LFS files only in a sequential manner. So if you have 10,000 files in LFS, Git clone would go and download each and every file one by one, and that takes a long time. So that's why I teamed up with the GitHub folks and we came up with this Git as clone command, which will speed up these clones dramatically by processing these files in parallel. Unfortunately, yeah, this rubber command is still required. Right now I'm working on a patch for Git core to make this rubber command obsolete. So hopefully this gotcha and this tip will go away soon. The next gotcha is that you as a developer you should set up your Git credential helper. Because when you use Git LFS, then you actually make at least two calls to the server. First to the Git server, and then on the checkout process you would call the Git LFS server. And if you don't have set up the credential helper then you would need to enter your password multiple times, which can be annoying. So set up the credential helper and Git to go around that problem. Third, as I said earlier, Git LFS should be used for files that do not compress well. And these are usually binary files. But if you use Git LFS for non-binary files, for text files, you run into this kind of problem that Git LFS does not perform line ending conversions on files. So what does that mean? Let's say you add a text file to Git LFS that looks on macOS like this. And if you check out this file on Windows then the line endings will be broken because Windows uses another line ending format than macOS and Linux. All right, the next gotcha is sometimes, or if you're one of your engineers in your team does not set up Git LFS correctly then you might end up with these kind of messages. That basically means that in the Git attributes file Git knows that certain files are managed by LFS but Git LFS can't find the pointer file. Instead it finds the actual content of the file. So the content is in the Git repository instead of the LFS server. Yeah, in order to fix that, just add the file again with properly installed Git LFS and then that problem should go away. Okay, now let's look at our learnings from the perspective of the administrator of a repository. So our first learning is set up Git LFS properly on all depth machines. That's super important because if you don't have LFS installed on a machine, then your engineers will only see these pointer files and of course they don't make sense and then the engineers will be confused. And also as I just showed two slides ago, if LFS is not installed, then the engineers might add large files to the repository instead of Git LFS. And in order to make this setup easier for a large company like we are, we came up with a tool that we call Enterprise Config for Git which is a tool on top of the Git config mechanism that checks your Git installation and makes sure that Git LFS is installed, that it's installed in the right version and if it's not installed and then it will install it and it will configure it properly. Next topic, the versioning. So Git and Git LFS are both very active projects. They are constantly in development and every new version brings speed improvements and bug fixes so you should really distribute the most recent versions to your developers to make sure that they have the best experience. The next step for the administrator is configure a file size limit on your Git server because sometimes, especially in the beginning, people are not yet, or the engineers are not yet familiar with Git LFS and they might add accidentally large files to their Git repository and you can reject them on the server so that you can ensure that the engineer is being made aware of the mistake and that the engineer gets the chance to fix the problem and add the file properly to Git LFS. So in summary, what are the takeaways here? Well, first of all, Git LFS solves a real problem. It works at scale and it makes large Git repositories work. Use smart tracking patterns to track exactly the right files with Git LFS. Speed up your clones with the Git LFS clone command at least right now. I hope this will go away soon with my upcoming patch. And last but not least, reject large files on your Git server. So that's it. Do we have any questions? What if you've already done a bad thing and you have a repository that has lots of large files in it? Is there an automated way to recover from your mistake and migrate to a Git LFS? Yes, there's a Java application. It's called Git LFS Migrate that would basically fix that. You would tell this application what kind of files you wanna have in LFS and then it would process your entire repository, put them into LFS and put the tiny pointer files into your repository. There's one catch though. When you do this kind of thing, then you would rewrite the history of your repository. So every commit has changes. So don't do this kind of thing lightly. You really need to communicate that to all your engineers and it needs to be a planned operation, I would say. I think, do we have more time for questions or? We are at your 25 minutes. As he's setting up that you could unplug, we can get Christian set up. There's the Twitter handle, if you have questions, just ask me there or just approach me here. Thank you. I think while he's taking, once you have your microphone off. Are there any more questions? Hello, everyone.