 When I submit this talk to RailsConf, it is in the track of we are living in the distributed world. But I'm surprised to find that I'm the only talk in that track. Seems like there are no other tracks that talks about scaling of Rails applications and distributed systems. So I think the reason might be that as Rails developers we are following some best practices so that making our app distributed or scale our apps doesn't seem that hard, doesn't seem that problematic. But this one, this GitLab thing is a bad boy, I would say. It really got some problems and I mainly got to talk about how we fix those problems. So thank you very much for coming to my talk. My name is Minky Pan. I came from China. I work for Alibaba Group and that is my GitHub account and my Twitter handle. Welcome to follow me. So what is GitLab? Well, GitLab is, well, let me say it secretly, it is just a GitHub clone, open source clone of GitHub but nobody likes to say that. So a better way to think it is a Git box that you could deploy on your machine. It is installed on premises. So just a quick survey, how many of you use GitLab in your organization? Two of you, thanks. So GitLab is, if you see it as a black box, it actually exposes two ports. One is HTTP, the other is SSH and HTTP are used on two purposes. You can clone a repository via HTTP and you can push content to a repository via HTTP. And also, more importantly, as a Rails application, it provides rich user interactions with the web page. And on the other hand, the SSH only allows you to clone to Git operations. And in the back end, from a very simplistic point of view, stores this content on Git and that is what makes this thing a monster to scale, very problematic on that part. So if you look closer, it also uses some other stores on the back end. One is Massacre. Actually they also support PostgreSQL because they use ActiveRecord, which abstracts the actual implementation of the DB so it's changeable. And the other is Redis, use it as a queue for delayed tasks and also as some cache. And the other is FileSystem, they use FileSystem to store the Git repositories. So that's the black box. If we open it up to see what's inside, then you could see it's basically structured like this. It's all open source, so you could also download the source code and see it. When you deploy it on the front, and there are two parts, Engines and OpenSSH server, well the reason why those components are inside GitLab is because GitLab has a omnibus package that you can install and they actually depends on those two other packages. Engines is for HTTP and OpenSSH server, as we mentioned, is for the SSH port that opens. When some requests came, for HTTP requests they came to the second layer, Unicorn, is for the ordinary Rails requests, but for the request for Git, like clone and push, it goes to GitLab Workhorse. It's another service written in Go to make it fast. And if it came as a SSH request, it goes to the third part of the second level, namely, GitLab Shell. And on the third level, the third level is called by the second level components. Mainly Rails was responsible for operations on page. And GitLab Git is a wrapper around Rocket, and Rocket is a wrapper around Libgit 2 on the fourth floor. And SideKick, yeah, that was for some task handling. And on the lowest level it is Git and Libgit 2. They utilize both implementations of Git. Libgit 2, if you don't know about it, is actually a rewrite of Git in a way that is portable, embeddable, works as a library. And Go to name Git 2, they see it as the second generation of Git, but with Lib as a prefix because it's a library. So this structure works really great for small teams, but the company that I work for has 30,000 employees. This is from the physical year report of last year. They just published a new one this year, days ago, the day before yesterday. And the stock price went up, it looks good, it's a public company. So let's scale it. So how we do this, well, we first consider about the problem on the front end. When the request came, it's either HTTP or SSH as Rails developers. We are most familiar with HTTP. And on the server, it's actually run as unicorn instances. And that's something we are very familiar with as well. We just put the engines in front of them, set upstream in the configuration, let them point to the unicorn servers in the back, and we are done. But for SSH, how to deal with this is a problem. So I started a project called SSH to HTTP, it's open source on my GitHub account. It basically eliminates all those SSH requests because the way Git interacts with the server between HTTP and SSH is very similar and the request to SSH could be easily delegated to a Git request on HTTP. And as we could see from the slides later, SSH is actually such a pain in the ass. There are more complications to this. So I guess that is the reason why GitHub nowadays has set HTTP as a default. When you go to a public repo on GitHub and the Cologne URL, as far as I remember, is defaultly as HTTP URL instead of SSH one, there are actually complications to the architecture that makes the SSH access a little bit slower than HTTP one. But actually in Alibaba we did not use my approach, my approach was this slide, but actually we use this slide. What we did was we are not using engines as the front end. We use something called LVS and it is a feature from the Linux kernel and the specific part of it that we're using is called IPVS which expands to IP virtual server and LVS stands for Linux virtual server. It is actually a layer for switching service, unlike engines which operates on layer seven of the TCP IP stack. It does load balancing on the transport layer. So it supports all communications as long as they are TCP IP. So the difference between HTTP and SSH are illuminated. But there comes as a cost as well because when you go down to the fourth layer, you lose the ability to do health checking with the status code returned by the request because on the seventh layer you could see actually what's the status code of your HTTP requests are and mark some server as health or not healthy. But on the fourth layer you cannot see those. You can only see packets, you can only see the date. And you are rewriting, you lose that ability as well because there's a layer seven as well. And like I said, that comes with complications because SSH protocol involves some security mechanisms that checks with your keys. And if you have more than one machines in the back end, their keys are not the same by default. So when you deploy the application, you first have to copy the host keys across the whole cluster to make the host key the same. Otherwise, when you connect to more than one servers, the client will complain saying that, oh, the SSH key was different, this is a security vulnerability, you got to check it out, and it will not connect. And secondly, if you remember, you could add SSH keys from the client via the web pages when you clone a repository like on GitHub, and same thing happens in GitLab. So when you add your SSH key to the server, it has to dispatch or copy all of that keys across the entire cluster to make every machine accept your key. Specifically, they add a line in the .ssh directory slash authorized keys. And they have to do it on every machine. And we did that via, well, you cannot do that via SideKick because SideKick only, you know, only one machine in the cluster fetches that job and the other will ignore the job. So you have to do it in a way that broadcasts all the keys across the whole cluster. And we did that via Redis PubSub data structures. And that goes the back end, well, the real trouble begins at the part that GitLab stores its repository on the FS. And I want to pause the moment to remind you of the 12-factor app. So the reason why GitLab is such a bad boy, unlike other Rails applications, is because it wireless the fourth rule of the 12-factor app. That is some principle advocated by Heroku, where the fourth rule says back end services should be treated as attached resources, like Twitter service, Amazon service, MySQL service, they should all be configured as a UIL that could be easily attached and detached. But GitLab stores some content on file system. That is the source of all evils. The content they store are firstly, Git repository, and secondly, user generated attachments and avatars. Now we are going to move them to the cloud to make a scale. Well actually, standing at this point, you have a lot of choices. The choice that I am going to elaborate might not be the best. I want to, you know, analyze the options that we have. So if you, when they run into a Rails application that has a similar problem, you could evaluate those options as well. So the first option is some feature provided by GitLab Enterprise Edition is called GitLab Geo. And that doesn't really solve the problem. You see how GitLab Geo does things is they make full replications of your GitLab instance across servers. It assumes that each machine of your cluster has enough file system storage to hold all the content of your Git repositories and they make 100% copies across them. It's officially supported, but it really didn't solve our problem at Alibaba because the overall size of all repositories are big. We don't want to store them on one single machine. They are not enough disk space to hold them. So from a distributed system point of view, GitLab Geo is a one master enslave full replication design and in CEP theorem which says consistency, availability and partition tolerant cannot be achieved at the same time. You can only achieve two of them. So GitLab Geo achieves a NP of those three parts and there are no disaster recovery supported and absolutely no sharding because it's full replicated. And the other option that we could use, it's seemingly a very perfect way to solve the problem. Well, first of all, we eliminated SSH by that gem written by me called SSH to HTTP so that we could forget about the problem SSH and focus solely on HTTP. Seemingly, there is something we could take use of. It is the, you know, every repository stored on GitLab could be routed using namespace slash repo name and that part appears in almost every UI or of every request. Like when you see the repository commit history on page, the road format contains that part and when you clone it, when you push it, they all contain that part. So why not use that part as a routing key and make some routing logic into Nginx to make a sharded GitLab. And by doing that, every request after coming to Nginx will be sharded. For example, if we are going to have a cluster of size three, we could invent some hash algorithm that distributes, that hash the namespace slash repo name into the cluster, into any one of those three machines. So seemingly it's perfect, but can you, you know, spot some problems inside this? Actually, one problem is, SideKick does not have sharding. Actually it does, but you have to dig into it and see how you could do that. You know, each shard of those three GitLab shards could spawn some SideKick tasks which needs to be consumed by corresponding SideKick shards as well. So when you start the SideKick shards, you have to start it with special queue names as well. There's one complication and there are others. Changes have to be made on application level as well, because it's not every page on GitLab falls into a single shard. Like in the admin page, you could see a list of all the repos with their sizes. Well, if that request falls down into only one single shard, you will not get that information, because some repos reside in some other shards. So major changes will be introduced to the application level as well. And also you need super user authentication, because the SSH requests are not designed to access all repos. Their user authentication layers in front of them is also another application layer, logic layer change that you have to be introduced. This is actually not ideal. Every way of solving this comes with a cost. So let's then think about how to deal with the file system storage. Well, we got a lot of options. Well, first, we could make it a 12-factor app by making the file system attachable. There are some vendors who provide such solutions, like hardware network attachable storage. They usually call it NAS, and there are software NAS as well, like Google has GFS. And also we could use remote procedure calls to only make shards on the FS level instead of on the application level of the entire GitLab. And also we might consider killing it. We could maybe use Amazon S3 to replace DFS as the backend for Git stores. Well, we evaluated all those options. It turned out that NAS is not for Alibaba. Hard NAS, well, Alibaba do not buy those things because it has no IOU policies. And soft NAS, Alibaba does not have that yet. Like Google have GFS, but Alibaba does not have AFS. But I have to remind you that those two options might be good options for your organization. If you want to scale GitLab, you know, they are really good means to solve the problem because it introduces very little change to your application level because all the change are confined on the lower service that got attached to GitLab. But I did not try them, and they surely come with a cost as well because software NAS tends to be very complicated. As far as I know, there are some good solutions called CFFS, which just came to stable about a month ago or days ago. And if something happens on that layer, you need to have some very talented operation or dev-op engineers to solve those problems. And also by attaching a NAS, soft NAS, you will also lose performance because each IOU to the FS is now networked, and they are added latency to each network IOU. And you are replacing the thing on a very lower level, so the added cost will be much. So those two operations, if you have a chance, could dig into it. And RPC? Well, that is a good solution. I looked up how GitHub solved their problem. It seems like they are doing RPCs. They are dispatching access to some RPC calls into Git shards instead of Git lab shards. It's a shard on a different level. They does that. It surely looks like a good solution. And what we did at Alibaba is use the fourth option. We kill the FS and use the cloud. What clouds we use? Well, it's called Alibaba OSS. Well, it's something not that well-known, but you could thought of it as the same thing as Amazon S3. It's object storage in the cloud. And how we did that? Well, the rest of this talk will become a little bit technical. It turned out that GitLab has three ways to access Git repositories, namely Libg2, Git, and Git. Grit is a very old gem. It's written in Ruby. Well, we found that it could be eliminated making the whole problem easier because it's only used in the Viki part of GitLab. And it's used in a gem called Golem. And Golem was designed to have Git access part plugable. So we unplug Grit and we plug Rocket, which uses Libg2. So that makes this only Git and Libg2. And we compare those two projects, Git and Libg2. Well, Git was pretty old. It's probably written by, started by Linus Torwas. And it did not consider the problem of backend to plug and unplug. So it's backend is hard to replace. All of the code are written to access content from the file system. But Libg2 is very modern. I don't know how their creators think about the problem, but they designed the backend as replaceable. You could write your own backends. So the basic idea is we write our own backends. We write the backends that actually stores the content on the cloud storage. And also, well, the grid has been eliminated. Also, we have to implement Git on top of Libg2 because Git cannot easily replace its backend storage, but Libg2 could. So cloud-based backend, what's that backend looks like? Well, that emotes some details about Git. Git has two parts to store its content. One is called ODB, and the other is called RefDB. ODB is for the chunks of data that you put inside the repositories. And RefDB is the branches, tags that you put in the repositories. And for the ODB, there are also two parts, two kinds of ODBs. The first is loose ODBs. Those are, you know, Git is fundamentally a content-addressable file system. The content-addressable being the SHA1 value of the object that they are trying to fetch. So loose storage actually stores each SHA1 values. I will open up an example. Maybe I will .js. That's a Git repository. And if you go into the .git directory and you see tree, you could see there are some, like those files, those are called loose stored files. And there are also packed stored files. Those are the packed stored files. That's what I mean. So we wrote a cloud-based backend to store both types of those files. The basic idea is for the loose files, it's pretty straightforward. When you read, you make an HTTP request to read it from the cloud. Oh, I forgot to explain the refDB. It's very similar to loose files, where you can see it's under the refs directory. All of your branches are inside the refs has master. And master will tell you a SHA1 value. So it's basically K-value store. And that translates to HTTP requests pretty straightforwardly. You see, each refDB read, we made it an HTTP read. And each refDB write, we made it an HTTP write. Each loose ODB store, we make it an HTTP port. And each loose OSS store, we made it a HTTP read. So that's the simple part. The complicated part is the packed content of it. Because if you only store those loose content, it will be as slow as SVN. The very reason why I get it so fast is because it has a very good design of packs. Pack files are used both as a way to transfer content from between server and the client and as a way to store the content you are repository on disk. It's both a transfer file format and a storage file format. The way we write those packs are easy. We just translate them to port request to HTTP. But the way we read it is complicated. You see, every pack came with the index file. And that index file tells you about if you are looking for some object in the pack, where to start. So each request will be translated into a lot of ranged HTTP requests. First, it will read the IDX file to find the next range to read in the pack. And then it read only that small portion of that file using the range header from the object store. So as an example, if Git need to read this content, then first byte will be blah, blah, blah. And it will binary search in the index file. And it will get an offset to begin in the pack file. And in the pack file, it will see if this content stored is a delta or not. If it were a delta, then it has to continue looking for the base of that delta. And the whole chain continues, continues, until you find the root. And by combining all the deltas with the base, you get the object that you're reading. And here's an example. It's a reward example. The chain is as long as file. You have to jump inside the pack file to actually got the thing that you want to read. Because each time you read it, it's actually only a delta. So that is a real problem to us. Because if the IO pattern inside that pack file is not good enough, then you will end up having a lot of range requests on the HTTP. That will make the thing awfully slow. But the good news is the inventors of Git, they made some very good heuristic algorithms to when the pack files are generated so that those IO patterns are not that bad. So when we make a range request, we could actually make the range bigger than we needed. Therefore, we could fetch bigger content from each range request. And that content would be sufficient to fetch all the way to the root of that object. And by this good characteristics, we reduce many HTTP requests to make this whole solution not that slow. That's one part of it. And the other part of it, as I said, you have to make Git talk to libGit2 because Git does not have the backhand replaceable. It turned out that this is pretty easy. Actually, the inventors of Git, they're pretty smart folks. They write Git in a very unique way. All of the commands, they call each other. Like in Git fetch and Git clone, on the server side, they first got called was Git uploadPack. And Git uploadPack will then call another command called GitpackObjects. And for the commands that deal with the transmission protocols, we will not touch it. And it's complicated. And we do not touch it. We only touch the thing that does IO from the disk. So we only need to replace the GitPackObjects. And in the GitPort scenario, we only need to replace or re-implement Git unpack objects. And implemented on top of libGit2 is very easy. It's no big task. And also, there are other scenarios where there are two scenarios when doing GitPort. There are small data got unpacked right away and got written to the loose storage. And their big data didn't got unpacked because unpack consumes time. And they directly create the index for it and write those pack. So in this case, we need to re-implement Git index pack, which is pretty easy task turned out to be. All right. So after all of those changes, let's see how the performance looks like. It definitely going to be slow because you're still changing a fast file system IO to some slow HTTP IO. So let's see how it looks like. Well, the text fixture we use is a repository called GitLab CE. It has more than 200,000 objects. And when packed, it weighs more than 100 megabytes. And GitPort, well, about same performance because file system, we write it directly to FS. But on cloud, we write them directly to HTTP. And there are not too many new operations created. It's just only added a small amount of time to each of those two operations. And GitPort's delta also, like I said, there are two scenarios. When you push large content, it only stores the pack. So this is a large content scenario. And if you only push a little content, it got unpacked and stored loosely. And this is the delta case, also not too much time added. And GitClone, well, it is actually 100% slower because when you do clone, the range operations happen. And that's what makes it slow. And also GitFetch, it got way more slower because this is a delta fetch. This usually happens when you do GitPool, when your coworkers updated the wrapper. And it also has to go through the whole process of the range operations that I mentioned. So it's really slower. But the good news is it's not that slow. The user has to wait longer, but it's not as something that they cannot wait. And also on the page, it got way slower. All of the Rails operations were affected because we are operating on a deeper level. And the Rails we'll call rugged, rugged called libgit2. Libgit2 is slow, so Rails is slow. Like on this page, we're listing a file list. And the show actions now take five seconds to run. Well, let me see. All of those benchmarks are all without cache. So the real world scenario will be better because we have cache. And like this, this is another Rails operation. And before the change is 50 milliseconds and after is about five seconds. So that's the reason why we have to add much cache to it. We added cache on multiple layers. Like those Rails layers, we added them. I'm not going to elaborate on all the cache that we add. But for some interesting aspect of this, this is something interesting. Well, libgit2 was designed in a way that it could have more than one ODB backend. And you could even set a priority to it. So we basically made a hamburger structure of that backend. We added two new backends to it, which is the cache backend. You know, the servers that we deploy those things to still got a file system to use. And we use that as an on-disk cache. If we read some content once, we'll store it on the file system so that the next request hit it could just read the content from the file system instead of making remote HTTP calls. And the good news is the ODB of git never change. You can only put data into it, but you can never modify data. So we are free from the problem of cache expiring. And also, the refDB could cache that we already is. But that's way more complicated. That might not worth the effort. I might remove it in the future because you have to expire the cache. RefDBs got updated all the time. When you commit a new commit to master, say, the refs slash heads slash master got updated. And you have to expire the cache. So not going to go into details of when the cache got updated. And lastly, I want to say something about the future work. For right now, it seems like this idea works more or less acceptable. And if you guys love it, I will try to do a AWS S3 version of it because it's currently working on OSS, which is not so widely used. And there is some need for this. The reason why there may be some need for this is because GitLab cannot be deployed to Heroku at this moment. And if we could make this backend for AWS S3, then the users of GitLab could have a chance to deploy it to Heroku. And also, GitLab still has many direct calls to Git for the history page of the commit history page of a repository. It actually spawns another Git instance to fetch the result. So we could eliminate some direct calls to Git. And if we develop that backend for AWS S3, we could add settings for the user to choose which backend he wants to use. It could be either file system or AWS S3. That would be perfect. And Gollum, we could do some work to make them use rugged as the default. In Libgit True itself, we found it slower in many scenarios compared to Git. So we could improve its performance in the future. And I will be actively do those jobs on my GitHub account. So if you're interested, you could look into my account and see how it goes. Thank you very much.