 So, welcome everybody to the infrastructure functional group update. I'm taking it on behalf of Ernst, and this production, this, sorry about that, try that again. This update is for the production, the database security and the Geertzliq teams. So, first off, welcome to Victor Lopez to the production team. He started on the 12th of June. So, welcome Victor. Before we begin, I'm just going to start by going through the overarching goal for the infrastructure team, and that is to make GitLab.com ready for mission critical tasks. This isn't going to happen overnight. So over the next few quarters, we're going to basically work towards this by basically bringing up the availability of GitLab.com to 99.9% availability, bring 99% of all web requests down to less than a second, at least the ones that are user focused, and also complete the top 10 risk assessment actions that we have compiled. So I'll move on to accomplishments. The first, I'll start off with the production team, and the first accomplishment that I'm going to highlight is we now have a Canary environment, so there's a link to the issue over there. And the point of the Canary environment is really to help smooth out the deployment process. So what we're going to do is when we deploy a new release of GitLab.com, it'll go into the Canary first, and the release engineers and anyone else can go there and test it and make sure that it's working correctly. So I encourage you to do that next time we have a release. The next accomplishment of the production team is they have increased the size of the Git storage fleet to 191 terabytes by adding four new machines to that fleet. And I've highlighted those in the bottom right-hand corner, that's Git 9 through to 12, and you can see that they're pretty much empty at the moment, and there's about 100 terabytes of data free on the Git fleets at the moment. And there's a whole bunch of other things that the production team has been doing. They've been super busy. I'll highlight a few of them. I'm not going to go through all of them, but they've moved the front-end fleets to an ARM environment which will cut down costs and network pairing. They've delivered training on Terraform, which is a new technology we're going to use for maintaining our fleet. They've made all the dashboards on monitor.gitlab.com public. They've stabilized the Elk infrastructure, which we as the Git Elite team are finding very useful. And they've rebuilt the Redis cluster, and they've done lots of other things as well. So good work there. I'll move on to database. So I think one of the biggest contributions to the database team, well, Eurek has contributed over the last, since the last functional group update, is project authorizations are now much faster. So Eurek basically rewrote all of that code. And he dropped support for MySQL in the process, but it is much faster and way better. There's a whole bunch of other things that happened as well. Background database migrations, a Postgres update, and some new rules like Reconst all serialized data in the database. And single table inheritance and polymorphic associations are no longer allowed. And we're supposed to put that in a separate table. I don't know Rails very well, so I don't know much about that. From a security point of view, Brian has published a breach notification policy, which is linked there. We have VPN servers for production use. So when we want to access the production servers, we're able to go through a VPN. We don't have to use a public IP address anymore. Backup images are encrypted, and we're regularly testing that that's all that the backup process is working correctly. There's also been better protection for DNS domains to stop people from stealing, say the GitLab.com domain, and better filtering for sensitive data in log files. So workhorse will no longer write up access tokens to the log files. Moving on to Gitaly, one of the things that we've been doing over the last few weeks is we've been ramping up and moving all the Gitaly requests to the file servers. And so this graph over here shows how we've basically been ramping that up. It started at zero. This is a two-week scale that we've got on there, and we're slowly ramping that up. And every day we add a few more machines to that fleet and just kind of watch and make sure that everything's working properly, and so far it's been going really well. And what we've been finding is that the requests that go through to the file servers seem to be getting processed much quicker than the ones that go through the NFS servers, sometimes for several orders of magnitude. So that's looking really good. We've also managed to migrate about 10 endpoints through to Gitaly. And you can click through that chart to see the real charts, and you can go see how each of those endpoints is doing, comparing the file server version and the worker version of those. And you'll find that on most of them is a really good performance gain that we're getting from those. And it's also just worth pointing out that in the last Functional Groups update we had, I think, one endpoint running in production. Now we've got 10, and we're moving ahead with a whole lot more in 9.4. Another thing that we've done on the Gitaly team is we've started using structured logging, and this has been really interesting because it's allowed us to basically start cutting the logging data in new and interesting ways, and sometimes surprising. So the graph on the left is how much time, at least in Gitaly, we're spending in a whole bunch of different repos. And no surprise, the time that we spend, well, the repo that we spend the most time on is GitLab CE, but then the graph on the right is actually, it's how many requests are being sent for each repo. And what we found there was there was a mirror of GitLab CE that doesn't seem to be used very much, but it's getting way more requests than any other repo in the system. And we were able to see that with the structured logging. So we're quite happy with the results and it's helping us debug other problems and find out what's going on. And we can also figure out which users are using GitLab the most and putting the most pressure on our services. Another thing we did was we added feature flags, and I believe some other teams have started using this. What it means is that just in Slack, you can tell Marvin to toggle a feature flag on and off. You can either tell it to be on, you can tell it to be off, or you can say 50% on, and 50% of the users will get that feature flag and 50% won't. So we've been finding it really helpful for switching things to GitLab or not. We'll move on to concerns. The first concern is the production team. And basically we're making progress towards our SLA of 99.9% availability. At June, we were 99.88%, but the problem is that at the moment, deployers are causing downtime and the application really needs to build in resiliency. Obviously, Gitly will help with that, but having zero downtime deploys obviously makes a huge difference as well, and there are other things we can do in the application. This is quite a worrying graph. So this basically shows the latencies that Pingdom is reporting for GitLab.com. And if you take a look from October through to now, those times should be on a downward trend and they're on an upward trend, which is kind of worrying. So we're not seeing the improvements that we hoped for there. And the last concern from the production team is really that a lot of outages are a result of unscheduled deployment issues, and it's something we really need to focus on and make sure that that is tackled. It's just worth pointing out that these concerns are really concerns for the entire company, or at least all engineers at GitLab. It's not something that the production team or the infrastructure team can tackle on their own. People need to make sure that their application code is resilient, that they don't have migrations that need downtime, et cetera, et cetera. So it really is just everyone's responsibility to work together to improve these numbers. This is a move on to database concerns. This is a concern of Yorick's. This is a graph of the right times, the read and write times of the primary database server versus the secondaries. And you can see that the top lines of the primary and the secondaries are way down below. And there's an issue on to that in that trying to figure out how we can optimize the disks on that volume on the primary server to make them a bit faster. And a concern from the security team, from Brian, is that there's not enough activity on security issues. So I did a little charts to see how many security issues are being tackled per milestone. And the trend is not good. But yeah, so if you pick up any security labeled issues, please run with them. Brian had some other concerns around application audit logs, around staging needing better sanitizing, and around audit trails for cloud servers. So at the moment, if someone logs into AWS, we don't really know who was doing what and when or when to Azure for that matter either. The biggest problem concern we have from the Gitterly team is really that at the current pace, it'll take us about a year to complete the migration from NFS2 to Gitterly. And obviously, we need to get it done from a company point of view. We need to get it done way sooner than that. So this is something we're considering at the moment like how we can speed that up. So I'll move on to plans for the future. First plan is we're going to basically be rolling out elastic search and production. So we'll have all elastic search goodness for finding things on gitlab.com. And it was blocked on another issue, but Pablo edited it just before this presentation and it's no longer blocked. So presumably that will be done by the next functional group update. The next thing that Pablo's working or Pablo's team are working on is containerizing the front-end fleet so that we can order scale it with the home charts and basically moving towards a Kubernetes environment. And then we will also be getting a postgres failover mechanism, which will come included in the OmniBus package. Moving on to the plans of the database team. Yorick will be optimizing the 10 worst-performing Rails controllers that I've listed down below. And there are issues for each of those Rails controllers to basically sort out their database access and try to speed them up. Another plan is to migrate push events out of the events table. And Yorick's hoping that this will save about 100 gigabytes of disk space in the database, which is pretty incredible. Security plans, one of the plans that Brian's working towards is offering VPN services for all team members so we can work remotely from coffee shops and from co-working spaces safer. So there's an issue for that as well. And some other security-related plans are endpoint security management, properly documented incident response plans, package signatures, and a disaster recovery plan. And something has been left in the presentation. From the Gitly point of view, really what we're working on is migration. So what we're doing is we're taking individual pieces of code, and we're pulling them out of Rugged from the GitLab CE code base and moving them into Gitly. And this is quite a laborious process and we're just carrying them through one at a time. And we've got a board and pretty much all of our migrations on that board. And the point is basically to move things from left to right across that board as fast as possible. The other plan for Gitly that we're working on at the moment is how we can basically speed up our rates of migration, at least 10x. And so that is something we're looking at quite seriously right now, like what we can do to basically speed it up so that we can get this in as soon as possible. And we have a few plans on that. Finally, we are hiring. So if you know anyone who would make a good director of security, a security specialist, a database specialist or production engineer, please follow those links. And I will go back to my browser to see what I've got. Cool. So Sean is pointing out that single table inheritance, representing several types in the same. Right. Okay. Josh, does it log who clones a repo? At the moment, Josh, Gitly doesn't have any sort of information about users. So we don't have the user. We have the repo, but not the actual individual user who's doing the operation. One of the things that I think would work pretty well is if we push that structured logging forward into the GitLab application, obviously just sort of a little bit at a time. And we did structured logging in the Rails application as well, because then we were able to do that. Toon says, that's nice to hear, but I'm not quite sure what that was about. Doesn't everyone really have a VPN service? Asks Kim, probably. Cool. Okay. Thanks. Thanks, Sid. What can backend people do to help with performance? Sid asks. So I think that the number one thing really is like your N plus one problems, N plus one queries, and those aren't just SQL queries. Those can also be, going forward, there will also be N plus one Gitly queries. You know, if someone takes a piece of code and calls Gitly 20 times in a controller, you can have exactly the same problems. So it's really just a matter of profiling your code and understanding what it does in various situations. I'd say that's the most important thing. But knowing that you're doing the most optimal implementation of your code is really important. And I'll add to that, the top 10 controllers, the list that Yorick linked it, I think those will give the biggest bang for the buck. So we are profiling those. So please take a look at the profiles, understand why it's slow and take one of those issues if you have a chance, because I think Yorick can do a few of them, but it really helps if everybody devises and conquers. Yeah, absolutely. Cool. Ben says, I'm working on trying to convince a good candidate for direct of security. Good. Mike, with us accepting merge requests on Gitly migrations, can we stimulate the community for a big push? Yeah, those, so basically anyone who wants to, we've marked a whole bunch of Gitly migrations as accepting merge requests. And Jakob has actually done a whole bunch of analysis on them, and a lot of them are actually rails only. So for people that, and we've marked those as Ruby only. So for people that don't want to sort of take the plunge into Golang yet, you can contribute to Gitly in a pure Ruby world as well, which is great. So, and Sean says, we should ask the community to do work that we think is important. Yeah, and so we are marking, basically, we've got a whole bunch of work on our migrations board that is accepting requests. And really, it's a matter of starting to promote that to the community. Metrics, all the things. Can we also work on performance issues? Yeah, maybe also work on. Yeah, OK, cool. Any other questions from anyone? Cool. Then I think that that is the update. I hope everyone has a wonderful day.