 And what is that we use it? Our course is building the first artificial intelligence that understands code. And the first step to this is analyzing all online open source. We are currently analyzing 70 million repositories. That's every repository on GitHub that is not a fork. And we are now creating a new pipeline to analyze more than 60 million repositories. So that includes all GitHub forks, but also GitHub repositories hosted everywhere else. So this is a test that requires an intensive use of Git. So our first problem was choosing a Git implementation. The usual choices are the Git command line interface, or Libit 2, which is a library that has bindings for many languages. And JGit, which is also a very nice library in Java by Google. But the language of choice has sort this goal. So none of these as they are fit our stack nicely. So when we started our choosing goal where Git to go, which are the shell bindings for LibGit, which are really complete. But it requires a goal. It's not that nice for us. And it's not possible for us to extend it in the way we need it from Go. So that was kind of discarded. There's also some nice command line interface wrappers. They work really fast because, well, it just takes a good Git. The official Git implementation is awesome. It's really fast. But they are very inflectable. We can extend the underlying Git behavior. And there were a lot of Go implementations. But most of them were abandoned just at the very early stage. Because I guess that a lot of people start a Git implementation but then figures out it's actually much bigger than what he thinks. So our choice was doing it from scratch. That's how Go Git was born. So we wrote the full implementation in Go with idiomatic API for both high and low level. In Git technology that's known as plumbing and porcelain commands. And we have a focus on extensibility. We'll talk more about that later. So we are to be compatible feature-wise with LibGit 2 and LibGit. We are still not there. But we are working towards it. We have already put two years of development into it. So we are on track. So why do we want to extend... Why do we focus on extensibility? One of the first things that we need to extend is storage backends. Storage backends in Go Git define how Git objects, references, configuration and so on are stored. So we provide two implementations with Go Git itself. One is just in memory storage backend that you can use for testing or also for processing repositories really fast if they fit in memory. And then a virtual file system implementation that uses a virtual file system abstraction. The open source is you can check in this Go Billy. But you don't plug it around. In fact, we have implementations that use databases as Cassandra as a backend. We have implementations that are storage backing wrappers for doing Gatsy. So there are a lot of possibilities to this. Another interesting extension point is our work piece as virtual file systems. So you are not limited to take out a commit into the disk. You can check out in memory or database or anywhere that you implement a virtual file system interface. And we can also plug transports. So we implement all standard Git transports, SSH, HTTPS and so on. But we can also plug our own. We actually use it for performance use in some cases. So let's go through our use case. The first building block for analyzing all repositories online is having a local copy of every repository in our cluster. So we can proceed faster and on demand. So we are fetching repositories from GitHub, GitHub, Bitbacke, Samana, self-host repositories that are all from Google. We are trying to be very extensive here. But this leads to the first problem, which is folk redundancy. The vast majority of repositories that you find online are forks. In fact, in GitHub, there are like 70 million repositories that are like 60 million if you add forks. So if we store master branch or the master branch of all it have known forks, we get around 72 bytes. If we add forks to that, we have 240 terabytes. That's only the master branch, so this is a lower bound. But most of the data is just redundant. Most forks are just a copy of the original repository plus a couple of commits. So if blue represents forks, we don't really want to store all of those copies that are 90% redundant data. They just take too much space in our cluster for nothing. So we call it by looking forks just as a namespace set of references of the original repository. So for us, we can see the forks. It's just the same as the original repository, but it just has some branches that are pointing to a different part of the history. So we store all the forks of the same repository together in the same log repository in our storage. So these are all objects. If a commit or a file is present in multiple forks of the same repository, that commit, that file is not stored twice. We super fix each reference or branch with the idea of the repository that it was fetched from, so we can still distinguish between them. Let's see an example. Let's say that this is the history of the go-get repository. The real one is actually longer, but we have three comets here. And we have a couple of forks, and they just share most of their history with the original repository, and they just add a commit, probably for doing a full request. So if we store them together, it just looks like this. A single repository, three branches, and we prefix each branch with the original repository so we can still distinguish. But there's still a problem. So how do we say that the repository we fetch is a fork from another? We could rely on name, but there are forks that have different names that original repositories, and there are repositories that have the same name and are completely independent. For example, there are five repositories that are called go-git, and they are completely unrelated to implementations. We could rely also on the GitHub API, but that works only on GitHub, of course. And even on GitHub, it only works if you created a fork using the GitHub web interface. So that's going to fly for us. What we do is take advantage of the fact that a fork starts at the same set of initial commits as the original repository. So it might have longer history, it might have more branches, but if you go to the first commit that was done to the repository, the hash of that commit is actually the same for all forks. So in our local storage, we don't store a repository by removing the name of the repository, but we create a repository for each hash of the initial commit that we saw. And then we fetch a repository, we push each branch to a local repository that corresponds to the hash of its initial commit. So this works nicely across all Git providers because just relies on Git and not in any realistic or any external API. So this is actually how our local storage looks like. We don't store forks, we just store their incremental parts, and we merge together, see if these threads are forks from the same repository. We just store one of them containing the information of all of them. So then we reduce the required storage to minimum, but still have access to all information. The next problem for building this mirror and using it for analyzing is that it's where the whisper rate. So we need a storage that is oriented to batch repository analysis. So when we open our repository, we open it for analyzing the whole history. In our use cases, it's unlikely that we want to open a repository to add just a commit. So we distributed file system for this. We have used ADFS and Google Cloud Storage. The good thing about these file systems is that they have really high throughput. They have the counterpart, they have high latency. And Git repository is composed of many files. If you look into the .git directory of your repository, you will see one or more files. You will see index files, reference files. You will see a config file. So we want to analyze our repository, we need all of that. But in some cases, the time that we spend, the latency of accessing all of these files adds up. At some point, for some repositories, we spend more time just waiting for input operations to start rather than the actual data transfer. And this gets worse every time we update the repositories because every time we pull, we get more files. So this is performance care that just gets worse over time. So how do we solve it? We create an efficient argument format. Efficient for this case. It's called SIVA. We archive each repository as a single file. We have a custom transbackend, bake it to go Git, that operates on archive repositories transparently without extracting them. And performance updates to them in the most efficient way are both SIVA and HDFS Google Cloud Storage. So we baked all the important details of this into this storage backend and now we can just go Git transparent with this kind of repository. You can read more details, presentation details about this in our blog. All the details about how this format is designed and so on are there and it's also open source, so you can use it for other stuff. So what I told you up to now is one of our main use cases. I would like to highlight a couple of projects that use Go Git and that might be interesting for the Go audience. One is Git URL, which is a SQL interface to Git and you can just use it to run it on any Git repository and you can write SQL queries against the repository. Very soon it will be implemented as a code database driver so you will be able to connect your ORM to your Git repository. It's a pretty crazy thing to do, but you can. We also have a ghost table. It's a service, it's a self-hosted service. It's like the GoPackets.in service. Who of you know GoPackets.in? Right, less people than I thought in the audience. So with GoPackets.in you can provide URLs to Git repository that go into specific branches and you can use that in your Go imports for making imports that are stable and will depend always on the master branch. GoPackets.in is the most popular service for doing that but you can just go stable to run your own on your own domain with public and private repositories. So it's recap. Go Git is working with Git and go very simple and idiomatic. We have been using it for more than one year in production with millions of repositories. So even if the feature set is not complete there's already a very solid base there. And we have implemented a lot of advanced use cases in Go that would otherwise be quite a health implant in, for example, like recently a new kind of Git storage was presented by Microsoft and they do something that is similar to our storage backends but using the command line interface. So they had to implement a kernel driver with a new virtual system for doing that in a pretty complex way and in every case that was just like 300 lines of Go. So here you have the project. URL like Git have the stable version import, the development version import. You see what Go stable does. We have this fancy URL for our imports. Well, that's all. Thank you. I think we have quite a lot of time for questions. Yeah, of course, lightning talk. Hi. How big is the project currently? Maybe a number of lines of code or something and how much we'd grow in the future to get to the same capabilities as LibGit or JGit. Yeah, so size, I would have to check out. Does any of my colleagues know what the size in landscape code? Well, the thing about having time is that... Sorry? Yeah, this is quite tricky because my screen is blacked out. So I have to type like this. But I don't know, what's maybe 30,000 lines of code? Maybe? I'm not sure. So we implement... So up to now we have implemented everything that we need for reporting analysis, which is not what most users want. So we can clone, fetch, pull, push. We have client and server side for both of those. We can traverse repositories. We can generate this or all that kind of stuff. So it's pretty complete on that side. One of the major missing parts is creating commits. So if you actually want to create new commits programmatically, that's not there yet. But it should be the next stable release. And when we will be on par with JGit or LitGit 2, so maybe fully on par with fully prodded years. So yeah, not anytime soon, anytime soon. But for features that we implement, we implement a lot of possibilities for extending the behavior that you do not have with those libraries. Does that answer the question? Yeah, I'll give you another mic. Okay, thanks. Thanks, Vartugo. You said that you are identifying the repositories for the press commit. How do you deal with the boiler press repositories that are used just based for many projects? Yeah. So there are a couple of cases. One easy one is the boiler plate that you copy as an initialization, such as the gitignore file. But in those cases, the hash of the first commit is different because the commit includes your author name, the timestamp, and the commit message. So even if you start a repository with the same content as a different one, well, you have, like, WordPress or this static page generator that you use clone. So currently, we just store all of them as forks, which is actually what Git has done internally. And we don't differentiate among them. We see all the webpages completely independent webpages done by different people as forks of the same project. We analyze all of them, but we interpret them as forks. So we didn't get into the problem of solving this. But we might try some stuff in the future, such as trying to detect a point in time where the code base diverted a lot in them. I guess it would be definitely possible to do that. The only thing is that that's very computationally expressive. So in our case, for example, but only because of our use case, we are focusing on code and we don't care that much, for example, the usual example of static generated pages. Actually, they are bad for us because we are trying to understand whole people codes and contain a lot of auto-generated code. We don't want to interpret your auto-generated code as if it was writing by you, because you are not writing code. But yeah, I guess we'll have to do something in the future. So I have a question regarding your fork detection algorithm. Do you support detecting Git graphs when you graph two repositories in one? Yeah, it's really tricky. I saw a simple example where there was only one initial commit, but if you take the initial Git repository, it has seven initial commits or seven root commits. In our case, in our application, we make the distinction of root commit, which are all roots have no parent, and we made this arbitrary definition of the initial commit, which is root commit, so for our reference, is the root commit that you reach by going through the history only following the first parent. So when there is a merge, we only follow the first parent and we always reach the same initial commit. So Git has seven root commits, but only one of them is initial commit. It's actually the proper first Git commit, and the other were like tools and things that were merged into that over time. And we store all of that in the initial commit. If you have a fork that, in the case that you merge two different repositories, you still have some forks that might have a different initial commit even if they share part of the history. They will store independently in our storage. And we are now trying to figure out, so now currently that duplicates that information in some cases. Now we are trying to figure out if we use Git alternates to even share information in those cases. That's still a working progress. Thank you. So my question is regarding the extensibility of this. So you mentioned that it's easy to write, easy to write new search backends. But Go only supports static leaking, right? So you have to kind of work the project and create it and make a kind of compiling plugin or something like that? I mean, Go Git is a library. It's not meant for writing a command line client. I mean, you could write a command line client with Go Git. Which in fact we are kind of doing, but just for as an example. So if you do your own storage backend, you will be using an application. And in that application you will be also importing Go Git. You will be importing your storage backend and you will be passing it to Go Git. So how does static leaking affect that? Yeah. I mean, that's the use case. It will be much like tricky if you want to do the Git command line where you can plug a storage backend as a plugin and runtime writers and create that. We don't try to do that at the moment. That would be a good contribution. Cool. Any other question? All right. So thank you.