 Thank you. So I guess we can just jump right into it. This talk is about scaling version control systems and if you're in here you're probably in the right place. A little bit about me. My name is Ben Carrow. I'm the SysAdmin, the primary SysAdmin for Mozilla, primarily dealing with version control systems. I interact with a lot with the release engineering team and this is a really old picture of me. And so what's this talk about basically? Well it's pretty simple. It's about version control systems but it's about scaling version control systems and it's about some of the stories and some of the examples from trying to do that. So it is primarily dealing with Mercurial. That's the main system that we use at Mozilla. We use a lot of others but and I'm gonna talk about more than Mercurial this session but that's basically it and as you could guess there's probably some Heisenberg related activity here. So there's headaches and Heisenbergs and there's getting kicked off of GitHub for excessive usage which I didn't know could happen but apparently it can. So a little bit of background on this. We'll start with some statistics. Like I said we primarily use Mercurial although Firefox OS uses GitHub a lot for development. Let me see. We have a couple thousand repositories many of which are unique and likewise we have a lot of commits but a lot of them are only about one in a thousand are unique. So well 1 million and 32 million I guess. And that leads to a lot of more interesting problems which I can get to in a bit. We do about two terabytes of transfer a day from these systems which ends up being a thousand clones and it's actually a lot more than this if you consider all the smaller repositories that we're serving. These are a thousand fresh clones a day versus several thousand updates and refreshes and things like that. We're our own biggest customer. The release engineering department is responsible for what would now be called continuous integration on all of the systems and we test on over 12 platforms. So several different versions of Windows, recent and legacy versions of OSX, different architectures on Linux, etc. We also use these things. We get quite extensively as well especially for mirroring a lot of our Firefox OS stuff because GitHub doesn't necessarily have the best uptime. The others we host but we don't particularly use them extensively and haven't really had to deal with scaling problems of them. Interestingly, we still have the same RCS repository that's backing the mail aliases file from when we were Netscape. That file just hasn't changed in decades. Bazaar still uses or Bugzilla still uses Bazaar but they're migrating over to Git and our update system still uses CVS although we're trying to kill a lot of these with fire. So next up this is basically as complicated as that I'm going to detail for Mercurial but there's two SSH servers and there's 10 hosts that mirror HTTP traffic. Unpictured is a load balancer that actually balances all the incoming traffic and strips off the SSL going to these hosts. Also unpictured is the NFS server that acts as the backend data store for the SSH servers. Recently we moved the web mirrors to local disk and that's created a pretty big performance improvement and greater availability since we're not just running all of these on identical hosts. So you're saying get to the story. So the first story is about knowing what you're hosting and the perils of what can happen if you don't. So I was just sitting there as you usually do at your best right operator control center and snooping on emails. I don't actually snoop on emails and deleting user files and I was minding my own business and this bug comes in and this bug says I used to host this repo on GitHub but they disabled it since Saturday and I need to host it somewhere. It's sort of big but I made a bundle out of it which is 1.7 gigs. Can you just host this for me? And I was thinking now why would GitHub kick something off that's 1.7 gigs? I've never personally tried to host something that big there but maybe they're just trying to crack down and resource usage. This should have been my first warning. So I go look at the page and I haven't seen this before but this is still up there. It basically says you used too many resources and we aren't going to host your repository anymore. And the funny thing is I talk to the developer afterwards and he can't actually delete this so this is just a cone of shame that he has to wear on his GitHub account for the rest of its existence. So I do it and I'm thinking maybe I'll just defer it for a while and then he responds to the bug. I think that becares the person who said get to Mozilla.org. So fine. I guess I have to do it. I'm being called out. Maybe I'll put it off until next week. But no, I should mention that this is repo being used to maintain our Gitmare. So this is pretty important. Now this is before and quite possibly the reason why I understood the difference between importance and urgency. So conflating the two, I did it right away. I looked at the host and I said, wow, this really isn't loaded at all. There's like a load average of .5. There's a couple of processes. There's 24 cores here with 60 gigs of RAM. There should be no problem for yet another repository that just one user is going to use. So I said, all right, here's your repository. Here's some details. You can upload your code. Let me know if you have any issues with it. So I dusted my hands off and I went away and I thought this was done. I'd never hear from him again, another happy customer, until this happened. And this happened on the very, very morning of April 22nd, which I think is a Saturday, but this guy is kind of crazy and he's European and he's working in a European time schedule. So this happened when everybody else was sleeping. And so we didn't really know what was going on. Nugget started reporting that there were critical levels of swap used. There were availability problems with the host. It was basically bringing the whole thing down. And this would happen and then maybe 45 minutes later, it would just finish whatever it was doing and then go back to being a normal host. So I SSH in and I do the repository and I see that it is not the 1.7 gigs promise, but 208 gigs used. So I do a little more looking. I do Git log and I see this kind of thing. And what this tells me is that this is kind of an automated repository that he's using and he's mirroring a lot of the mercurial commits in it. And this was the first attempt that we had to do a multi master set it between mercurial and Git. So he's taking a lot of these huge mercurial shame sets and moving them over to Git. And the problem is is that these these pushes had objects that were about 122 megs apiece. And it was doing multiple at a time. And then so that's fine. As long as they're not too often, but they are kind of often they're happening every 30 seconds or sometimes a minute. And so you can kind of see what's going on here. So this is basically what it was doing with the server. I don't have an htop interface to show you but it was basically like this. I the meme can be in the notes. The funny thing it wasn't the host wasn't very responsive but and you couldn't actually get an interpreter but you can SSH in an issue of command and that still had priority to run, which involves some really conflict like really weird debugging processes. So I disabled the repo I set it read only and then I started hunting for options that I could I could use to fix this. Now coming from a past of being a gentoo racer I'm used to reading nice big man pages looking for tuning options. Still this one was particularly dense that only the get config man page itself is 2600 and one lines and I didn't really know what I was looking for. I knew it had to do something with garbage collection or packing files but I didn't really know what. Additionally, this was happening on a live system. So if I mistune this I could just send the box deeper into deeper into problems. And this is when it was just running on one host at Mozilla. So we really needed this host to be up. So I started hunting and I found some things. I found this, which was the pack window memory operation. And this is kind of useful. It's pretty cool. But the problem is, is there's no limit on it. It defaults to zero, meaning that it can eat as much RAM as it can possibly use for this pack operation. And so this is something that if we're going to have commits pushes every 30 seconds, this is probably something we should set so we don't out of memory the box like we were doing before. Additionally, there was this option, which was the garbage collection when it runs. And the interesting thing about this is that it was defaulting to 6700. Setting it to zero disables it, which is good for performance, but it's terrible for disk usage and eventually will be bad for performance when you have to grab all these individual objects. One of the interesting things about this is that it determines when to run the garbage collection whenever you issue a bunch of weird various commands. Like you can issue a garbage collection if you're just trying to do get status on a repository, which is unexpected, at least from me. So we set this down from 6700 to 1,000. And then we r-synced the repository to a spare host so it wouldn't affect availability. And we told it to do a manual pack in garbage collection. And then we waited. And we waited some more. And then 18 hours later, it was done. And we looked and it was much better. It was 28 gigs. So the pack operations had an effect. It's important to note here that if your repository is just one big pack file that also has a lot of performance implications, fortunately they're not on the server. So it's not our problem anymore. But it can increase clone time for people who are trying to grab it. Fortunately it didn't appear for us after this. And so if we looked at the load graph, we had this problem. And then ever since then, we haven't run into any hardcore availability problems with it anymore. So few. There were no more load spikes. I mean, there were issues because Git and particularly GitLite that we're using for this aren't perfect. But at least we didn't run into any problems like this anymore. So that's the first story of this. So what else do we got? So my next story, I have to go back a little bit into Mozilla's history. And this begins in, you'll see what I mean in a bit, but I have to go into history of managed code and CI at Mozilla. So the year is 2003. GitHub won't be invented for another four years. Just the idea of continuous integration was invented in 1998. And I think there was only one piece of software out there that you could use for it. And it was called cruise control or something like that. But developers still exist. And we had a lot of them. And they needed some way to be able to check their code in and have operations performed on it and run testing. Because with a small project, you can only have a couple developers. And it works fine. But when you're trying to manage 100 developers and the code that's coming in from them and making sure that they'll collaborate with each other's changes, it's a particularly difficult problem to solve, especially with software available in 2003. So we had a CI system like this. We learned about the new hotness. We did not have a light like this, which is kind of cool. But this is GitHub's light. And it's a pretty common project to do if you're at a software org that uses continuous integration. If you go grab my slides and click the link, this is to their build. That's a link to the blog post documenting their build. The way it works is a typical kind of CI workflow. Developers write code. They generate patches. These all get bundled in mercurial change sets. And then those get bundled into change groups. And then you push them off to a server. And then it builds it for you. This is weirdly enough there's a Wikipedia page for these build light indicators. I don't particularly think they need them. And apparently now that is Wikipedia. But I found it interesting that it was there. So anyway, this is why we called it TRY. This was a system for developers to host their code and then push it out to a server. And they don't have to introduce it into the main repository and then potentially break everything in the world. And then it gets pushed. And then it runs on something like this when a developer is not playing X-Plane on it. There probably should be some max in there because we do a lot of testing on max as well, which is another particular source of difficulty for us. So our developer checks some code in. And then it either comes back negative or it becomes positive. And then it's all good, man. They go to a site later. They look at their build status. And everything is fun. The problem comes from the server side of this. And its particular detail is about one of the mercurial implementation details. It's about changes being immutable. For the first probably eight years of the mercurial project, they decided that immutable history was a fantastic idea. And it meant that nobody could ever screw around with a repository and you would have to deal with the consequences of it as an end user. This has a lot of advantages like that. Unfortunately, scalability is not one of these advantages. So in Mercurial 2.1, they introduced this feature called phases. And if something is not in the public phase, if it's in the draft or private phase, you can delete it and not worry about people downstream of you. Being grumpy at you that you perform some kind of interactive rebase or some history deleting rebase and basically mangled their entire repository. Unfortunately, at Mozilla, we had a lot of difficulty upgrading this simply because our continuous integration environment uses so many different platforms that trying to debug a new version of Mercurial on Windows and Windows on ARM and OSX and Linux and dealing with case insensitivity was really difficult. So we were stuck on a version that didn't include this feature for a very long time. So when a developer creates new changes for this, they push it out to the tri-repository, they create what's called a Mercurial head. And that is the equivalent of a get branch. But for this talk, it's equivalent. There are some technical differences like they don't have to have the same root node. But for this, they don't matter. But as you might expect, the people who originally created this system never really gave thought on how to clean this up. That's IT's problem. So you end up with something like this. This is a tri- repository after a little while and it has the equivalent of 29,000 branches on it. So I don't know about you, but I'm not sure that this is an expected use case for version control systems. And it's, I don't know if it's an acceptable use case from the upstream project either. At first, when we told them about this, they didn't think it was an acceptable use case. But I don't know, I think they took it as a personal challenge that they should try to improve this and support this. So now it's kind of becoming accepted. So the way we arrived at this number is basically we have 100 developers, which was true in the mid-2000s. And let's say 30 of them push four times a day, or four times a week. And those people are pretty hardcore. They're pushing quite often to see the results of their changes. And there's 70 that are pushing twice a week. If we sum these up, that's 260 a week, which is about a thousand a month, which is about 12,000 a year. Now the problem comes at 10,000, 10,000 heads. And then the problem looks like this. At around 10,000 heads, some of the operations just start to fail. Now it's interesting because most of these heads have no more operations that happen on them. They just are pushed to once, and then they're left idle. They get checked out once, and then afterwards they don't do anything unless someone's doing some code archaeology on them. So they have some problems but not others. Back when we had the web hosts and SSH hosts running on the same machines, or when they were the same machines, they had availability problems for developers to push to, and they couldn't load the pages to go pull the code either. When we separated those out, only the push problems remained, but the mirrors that are using the local disks on the separate hosts don't have any problems serving us up anymore. So we have defeated the problem with this. The problem was that pushing would take so long, and it would upset so many developers, and basically give them, they weren't very confident in the system that we had to actually create some documentation for them, which looked like this. This was an entry, and it still is an entry in the wiki page that I've blurred out a little bit because I'm going to explain it later. It says if you experience excessive wait times exceeding 45 minutes, please file a bug to IT. And I don't know about you, but 45 minutes is slightly excessive just trying to do a push operation. But this happened so infrequently. It only happened once a year that it was never so fit to put forth the engineering effort to try to solve this. Whenever it happened, we would just, IT would just go manually and fix it. But because this happened once a year, they poked IT, and IT became a grumpy bear because no one around really remembered how to do this, so that I have to go look at documentation. And then it caused release engineering to be a grumpy bear, too, because they had to go cancel all the build jobs that were trying to go check out code that wasn't there anymore. And it made developers grumpy, too, because they had to keep their changes around and push them again, just because we deleted it last time they tried to do it. And so this is basically how it worked for several years at Mozilla. But the problem was, is that the code grew over time. So this is a graph of the source lines of code in the Mozilla Central Project, which is a Firefox browser and Thunderbird email client. And from about 2012, when this was true, to about now, the code doubled in size from about 6 million to about 12 and a half million. So the symptoms of what this kind of looks like is that developers have problems when they're pushing to 10,000 heads. The symptoms are 45 minutes to return, sometimes they never return. Some more details are that the when you SSH into the server and look around, there's one HD serve process running. It has one core peg, and it's just using CPU. There's no S-Trace outputs. You can't really see what it's doing. There's no L-Trace outputs, so it's not making any sys calls or library calls. Killing it yields no nice trace back that you could use to figure out where it was. And if it killed, if it's killed, the exact same thing is going to happen on a, on most subsequent runs. Sometimes it didn't, but we're not quite sure why that, why that was. So this is kind of a Heisen bug. We didn't, it kind of defied scrutiny. We wanted to get to the bottom of it and solve this problem, so we needed to try harder. So what we did was we had to compile some packages. These are painfully rel6 hosts, so we had to have this debug info package, which conveniently was not provided to us. We need to have GDB installed, and I wrote a nice little GDB script that we could use to dump the S-Trace from it. GDB script looked like this. It attached to a process. It ran the BT, which just gave us a system backtrace. It ran the pybt command to give us a Python backtrace, and then it attached from the process and quit. And this mostly worked, and this is what the output looked like. You could ignore the first two lines. Those are just system things that say it is running inside the Python interpreter. It's doing some compares on it, but if you look down below, you can see the ancestor line here, and it's performing some iteration options in there. It's using the iter generator, and it's happening inside the, the branch map function, and it's trying to update cache. So, we eventually figured out that what it's trying to do is it's trying to update the cache. Let me see. Yeah. And with 10,000 heads, this is not very scalable at all. So, now it can go unbler that little section of the wiki, and it says that if a developer, if a fellow developer has canceled their push as in control seed it, then they've just saddled you with the cost of rebuilding the cache. So, what was happening was that if a developer control seed, it was invalidating the entire cache for the repository, and on the next subsequent run, it had to generate that from scratch each time. When you have really huge repositories like this, that can take upwards of 45 minutes. So, then we asked ourselves, why is it updating the cache? And it turned out to be a, or why is it invalidating the entire thing? And it turned out to be a mercurial bug. So, we put this bug in with mercurial back in May, and we've been working with them to try to fix this. It's confirmed status. So, we're still trying to whittle this down to a very good example case. So, what do we do after, because we can't fix this, or we're in the process of fixing this? We filed our upstream bug, and that's good. We switched to a better compression format. This general delta format was introduced in mercurial 1.9, and it provides about a 10x increase in speed and compression for some of the operations that it's doing, like storing metadata. We have to find some ways to change the caching behavior. And the way this has worked now is that we have some extensions that we're using on top of it to be able to manually control this caching behavior. And that's worked very well for us. And then we wanted to plan a new and more scalable system. Since this was good, but this was the solution for Mozilla in the 2000s and not the mid-2010s. So, what we've essentially done is added some duct tape to our wheel. So, you hit that bug for 10 years? Yeah. It was a very long-standing bug. And it didn't actually crop up for us. It would crop up about once a year, and then when we hit it in the head, and no one really thought about it very much, because it didn't think it was worth the engineering effort to fix. And now that we have more than 100 developers, we're hitting this bug every couple months. So, it's actually worth our time to go bother and fix it and create a more permanent solution, which is what we're doing now. So, instead of duct taping our lunar rover and getting coaxing more life out of it, we can go and create something that is more analogous to what's used today. So, this is a new hotness. We're going to use it to replace the old system. It needs more web scales, preferably with some MongoDB. I'm not actually going to use MongoDB for this. It's going to use something closer to a pull request model. Mozilla was avoiding this for a very long time, but we're basically creating this homemade solution with review board and a couple other tools, because we wanted to integrate with Bugzilla and be multi version control system compliant and then have multi master with Git and Mercurial for the repositories. We want it because we can do easier multi homing for more disaster recovery. NFS shares aren't particularly known for working very well across data center, which is something we want to avoid of. This is going to leverage Mercurial bundles, which are analogous to Git bundles. So, instead of a developer taking their changes and then pushing it to a remote repository and it living live in that repository, we've created a custom server that takes their bundles and then pretends to be a server and says, all right, cool. I got a dude. And then it goes and shoves it into an object store. And then we can reference that later and trigger build jobs off of it and then it can live there forever because this is a purpose built object store for storing things like this. And ideally, this should require minimal tooling from other groups. There's a lot of legacy platform code because we do a bunch of testing on like Windows 7 and Windows 7 Pro and Windows RT with ARM and some smaller platforms like I think we're still building on Mamo stuff. So we don't want to have to go change a lot of this code to get it to work there. So this is something that's really important to us as well and is part of the reason why it's so slow going. So in review, know what you're hosting and ask questions until you're certain that you know exactly what you're getting yourself into. Don't host everything together if you can help it because oftentimes your developer can give you something. And if you try to host it on the same server as production code and his thing has a problem that affects the host, then you have problems affecting your production hosts as well and nobody likes that. Don't put all your eggs in one basket. Don't assume the approach you're going to use is going to work forever. Like this approach that we had with the tricerver, it worked for the better part of a decade, but it had some problems that started becoming more aggravating over time and over time we had to replace it and now is about that time and it's run a little long to be honest. And then the last one is you don't live in a vacuum. The Mercurio community has been really helpful for us and they're really communicative on IRC and on mailing lists. So you can just join the channel and say, hey, I have a problem and give them a pay spin or something like that or describe it and they'll actually work with you on it. And oftentimes that can be so far as to go through stack traces with you and tell you to add breakpoints or tell you to actually profile it and then do this interactively with you. And particularly in the Mercurio community, there's a lot of other users that are trying to scale this out like Facebook and Google are also trying to do this and they're creating they have a lot more engine engineering resources than we do. So they're actually working with the code itself and creating a lot of speedups due to faster operations and things like that. So for some further reading, if you're interested in these sorts of things, you can check out the Planet Release Engineering blog at Mozilla. It's a blog planet. It has a bunch of different bunch of different topics. There's maybe 25 people working and maybe 20 contribute to this blog and there's posts every couple of days. This one is the latest and it was only five or six days ago now. And it's a lot of stuff that happens there. The one other person on my team, Greg Zork, he writes about this fervently. His his big thing is developer productivity. So he tries to streamline the whole process. We had a blog post pretty recently about what the difficulty in the pain points that developers go through every time they want to contribute to a project, which is something a lot of us don't think about. Things like you need to have an account here and you need to have an account here and you need to have check out code here and what's SSH and things like that. And it's a lot of stuff that as developers we don't think about. But if we're trying to attract new new people to contribute to these projects and it's something that I think we do have to think about. Lastly, I have a blog up at bke.ro. I write about some things. Sometimes a lot of it's personal, though. So if you're interested, check these things out. This is me. The slides are up here. And you have feel free to get a hold of me if you have any questions. Thank you. Thank you. We actually have quite a bit of time for questions. Feel free to ask anything that comes to mind. If anything does come to mind. This might not be something you've had much involvement with, but. Have you actually found that some of the tools coming out of the Mozilla services group help with all of this process? Like I know that in particular, HECA has been a huge improvement in the performance and general overhead room to play on for servers and processes I've been working on. So any of those things been particularly useful that we may or may not have found because they're small and new? Yeah, definitely. We use a couple of tools from them. HECA is one of them. We use it to aggregate aggregate a lot of logs. And instead of the typical Elk elastic log search Kibana cluster, we're using HECA in there instead, and that's been really useful for us. Additionally, their circus thing, their circus tool, which is kind of like Supervisor D, but a little bit on steroids and a little bit more intelligent and plays nicely with HECA is something that we use for controlling these WSGI processes. Does that answer your question? Oh, cool. Hi, one of the things that you didn't talk about too much was DR and backup. How do you handle those? So the question is, how do we handle disaster recovery and backup type situations for this? And the way we do it is we have incremental backups hourly through the NFS share. It's this big scary NetApp thing. And sometimes we can just and those are for the canonical copies of the repositories so we can wave at that and say this is good enough. Additionally, we have seven copies of it on all of our web mirrors that we can use if we need to use those. And from that, we also have backups that run on those hosts as well. I think we're using we're using Amanda to copy all that off. What kind of storage engine are you using for hosting all these repositories? And have you done any analysis on file systems or what other system actually works better? Yeah, so the question is what kind of storage systems we're using. And if we think that something would be better than we're currently using. So right now, these are just local file systems on the web mirrors. And those are just using XFS without too many tuning options on them. The the hosts themselves, the masters are using basically in NetApp backed NFS mount. We have experimented with and we might go in the direction of using one of Facebook's products, which is called HGSQL. And it actually keeps the repository data inside an SQL database, which is good. Unfortunately for us, it doesn't really solve a lot of performance problems that we've been having recently. But nonetheless, it's still a pretty cool project. Yeah, Megan, I just wanted to ask about what's it called global replication. So obviously, Mozilla, you've got offices all around the world. And you mentioned, you know, with the new architecture, you hope to sort of, you know, do data center replications so you can, you know, hopefully make it easier for those devs to work. Well, how are you going to solve that problem? So back when all of the hosts were just running all of the things and they all had an NFS mount and they all had read, write access to the data and everything ran on those four hosts, this really didn't scale to different data centers. One of the reasons we switched to doing local discs on the mirrors was that we would be able to throw it anywhere. The system that it's using right now for that is just a little clustering system I built for myself that is basically it's literally SSH in a for loop that goes through it and it maintains some persistent SSH connections and it triggers and SSH is in with an identity that then pulls the information and it uses local discs for the storage. So the theory was that if we wanted to have a repos or if we wanted to have a mirror in say the Paris office, we would be able to hand them a small machine among other things. It would serve this and then we could just add it to this list of mirrors and then it could go mirror the data over there and then using some load balancers or some tricky DNS views they would be able to pull from their server. Whoa. Okay. If there are no further questions I and everything's falling apart as your talk suggests. Yeah, I get to thank you. And you, yeah. I'll take both of them. Yeah, I don't think that will.