 Right, it's three o'clock in London. So let's start the functional group update. Hello everyone and welcome to the functional group update for infrastructure. My name is Andrew and I'll be walking you through what the infrastructure group are doing at the moment or have been doing for the last five weeks. Cool. I use a slide in my last FTU run through, but I think it's worth highlighting again because this is the number one priority for the infrastructure group. And it's to make get lab.com ready for mission critical tasks. And we're expecting that we will be in much better shape for this once the GCP migration is finished later on this year. So I don't know if you remember my last FTU, but there were two stages to the migration. The first is migrating from Azure to GCP. And the second stage is migrating from Omnibus to Kubernetes. And we expect we will be sort of in shape for the mission critical goal once we've moved to Kubernetes. So with that in mind, let's continue. The first thing I wanted to highlight was the work that Yorick has been doing with building out a chat ops feature for GitLab. So I don't know if all of you know what chat ops is, but if you don't, I encourage you to read through the issue that I've got linked in there. But briefly, what it allows us to do is to run commands in Slack that will be executed against our infrastructure. So the example that the first example on the left that I've got there is basically running explain plan against production Postgres instance from Slack. And then the example on the right is an example of fetching a graph from from chat ops. So the reason why this is so important is once it's complete, we'll be able to automate a lot of tasks that the production team are currently sort of having to do, you know, a lot of toil tasks that they're having to do. We'll be able to automate that. And the developers will be able to run explain plans against production and come up with better queries before things go into production. And Yorick's just been doing a fantastic effort with this. So congratulations to Yorick. Cool. And then let's move on. The next thing I want to talk about is the TCP migration. So this is a burned on chart that shows our progress towards having a fully ready production environments in the Google Cloud platform. So as you can see, we sort of running a little bit behind schedule at the moment. This is down to some staff shortages. We are running with a much smaller team than we'd hoped. Also last week there were some vacations and also quite a few production problems that we had to deal with. But all in all, we are expecting to hopefully come in pretty close to the middle of the month. The next thing that I want to talk about is the geotransfer. So the three main parts of the of the GCP migration for the Azure to GCP migration are number one, setting up a production environments. Number two, getting all the data from Azure to GCP. And then the third step is also getting a lot of the data and taking it from disk and moving it into object storage. But I'll talk about that in a second. So the GCP migration, sorry, the geotransfer has gone really, really well. These numbers look high. So you can see this is the number of repositories that have failed and it's 48,000, which sounds like a lot. But if you take that in context of millions of repositories, it's actually very low. And so this part of the migration has gone really well. There's some fixes that have gone into the latest release 10.6, which will hopefully fix this and bring this down. So this was until about a week ago. I think this was about 100,000 and we made some fixes that brought it down to about 50,000 and we expecting that to drop down quite quickly. But generally the geotransfer from Azure to GCP has gone has gone really well and we're really happy with how it's gone. The second part of the plan that I mentioned was the object storage migration where we taking a lot of data, not Git repositories, but a lot of the other data, LFS objects, artifacts, uploads, traces, and we moving them into object storage. And so strictly this isn't actually needed as part of the migration from Azure to GCP. But it's kind of a preparation for the next step of migrating from omnibus to Kubernetes because Kubernetes won't last used to storage. The fact that it also makes it easier to do the geotransfer's help. So that's why we chose to do it now. So this has also gone really well, but there've been a few problems that we had to make application changes for. And those have gone in. I've included some links there. And most of the blockers that are currently blocking our progress on that have been addressed in 10.6. And so we're hoping that in the next few days we'll be able to restart those migrations. And finish that off pretty soon. Cool. Just another point that I wanted to highlight. As part of the GCP migration, we're hoping to move as many of the logs as possible to structured logging. And so if you don't know what structured logging is, it's basically a way of kind of logging things in a way that's more structured. But in the background there, you can see a sort of typical structured log. And so the reason why it's really helpful is that we can diagnose problems a lot quicker. We can find problems. We can generate graphs from logs and generally sort of find issues. It's particularly useful for finding abuse. So if you have those who's really hitting a repository really hard, it's quite easy to find it using structured logging. And so Giddly has always had structured logging. Greg has been working on getting structured logging to PostgreSQL. Workhorse is complete. Jakob's working on GitLab Shell. And we've got partial structured logging GitLab Rails. And one of the last remaining parts is Sidekick. And that's something that we just need to get scheduled and into a release. This is something that the infrastructure group can't take credit for. It's actually been done by the Prometheus group, but it's really, really important to us. And so what it is, is that the Rails application now is getting scraped by Prometheus. And what that means is we've got application metrics from Rails coming into Prometheus. And so we can start building graphs with that. We can start putting alerts on that. And I encourage all development teams to go take a look at that data and see what's there and figuring out how you can improve it. So this is just one example of the kind of data that we've got now that we didn't have a few weeks ago that I've got on this graph here. And what this is, is which controllers are calling Giddly and sort of creating the most load. And so by looking at this, we found that there was some unexpected data in there which we'd never had any insight into. The number two largest consumer of one particular Giddly endpoint is a network controller. And we found out it was a very small bug and very easy to fix. And we got that patched. And so that should basically drop off and not even be on this graph anymore. And so having that Prometheus data is really, really exciting. And this is another little piece of work that the infrastructure team are looking at. So you might be interested in it. So basically what we're going to do is we're going to run the Canary via gitlab.com. So instead of having to go to canary.gitlab.com to test the Canary, you'll be able to opt in via a bookmarklet. And then whenever you go to gitlab.com, you'll actually hit the Canary instance. What we'd like to see is that a lot of the team basically or a lot of the company sign up for this and use Canary as their primary instance of GitLab. And when you click on a link or whatever, it opens it up as GitLab, but it's actually driving the traffic to our Canary instance. And we're hoping that what will happen as a result of this is that things that are going into release get caught in Canary much earlier. At the moment, we run everything through a Canary environment, but it doesn't generally get tested. And the first time we find out about a problem is when it hits production. And so this is a small piece of work that we are working on to kind of make Canary more generally available to everyone. This is something we've been doing with Gitter for a long time and I found it works really well. So comment on the issue if you feel anything about that. The next thing I want to talk about is Gitterly. So we are pretty much very, very close to Gitterly version 1.0. And the Gitterly 1.0 milestone is basically the point at which GitLab.com can run without any NFS access. And it's not removing all Git access from the GitLab Rails application, but certainly everything that we use on GitLab.com. And as part of that, we're going to do some testing where we will have a feature flag that we can enable and when that feature flag is enabled. Anything that tries to use local file system access will raise an exception. And what we can do is gradually turn that on across the fleets. And if we see any problems, we can turn it off. But we're hoping that we won't have to do that. Yeah. And so hopefully 10.7 will be the point at which you hit Gitterly 1.0. And then the final thing that I just wanted to highlight is the work that Andreas has been doing. So Andreas joined about the time of my last FGU about five weeks ago. And he's working alongside Yorick and Greg looking at improving SQL timings and looking for slow queries and fixing them up. And he's been doing a fantastic job. And this is just one example of what he's been doing. So if Andreas approaches you with some questions about some SQL, I encourage you to help him out because it'll probably pay dividends. And then finally, we're doing a lot of hiring at the moment in the infrastructure group. So if you know anyone for any of these roles, please, please, please get in touch with myself or Nadia. And that is the update. So let's move across to questions. Stop sharing my screen. Cool. So the first question is from John. Is chat-ups part of Libre or premium? As far as I know, it's a premium feature. I don't know if there's anyone else who, I don't think it's an ultimate feature. Anyone from? Oh, there we go. Okay. It is an ultimate feature. Yeah. It's in the read me of the repo. Okay, thanks. And then Reb asks at this point, what data does geo not transfer? So when we started off, geo didn't transfer traces, but traces are now artifacts. And so it does do that. The other thing that geo doesn't transfer is pages data. And so we're going to be doing like a very big r sync from Azure to GCP to get the pages data across. And as far as I'm aware, and I stand corrected on this, there is no, nothing on the roadmap to add pages as part of the geo products. Don't know if anyone can answer that. There's nothing in the roadmap right now. No. No. Docker registry is also the one that isn't replicated at the moment. We tended to ask customers to use S3 so that they can do some other replication, but it's not built into geo yet. Thanks, Stan. Cool. And then Stan just had a comment on the graphs. Right. John was answering Reb's question. How will we know when GitLab.com is ready for mission critical tasks? That is a very good question. I think we need to have a proper answer to that. And I think that everyone, if you ask everyone, they'll have a different answer. My personal sort of a view of this is that we have to have auto scaling. So, you know, as we get surges in, right now, if we get a very large surge in traffic, basically the site just gets slower. And so one of the things that I would expect to be ready for mission critical tasks is that we can auto scale very rapidly to kind of deal with traffic as it arrives. That is part of it. Another part of it I think is also having sort of more limiting on, you know, users doing abusive things. So at the moment, there's certain operations that users can do very easily that can sort of take down a small part of our infrastructure. And I think we need more limiting on those operations. So we basically spot them and prevent users from doing that. I'm sure most of the time it's by accident, but it still happens. Yeah, those are the two main things that I would say. Right. Is Prometheus for Unicorn available on GitLab.com? Yes, it is. So it's actually in a different, it's in Grafana. So you can find it in a data source in Grafana. It's called Prometheus app or something like that as a data source in Grafana. But there's also a separate URL called Prometheus app 01. I'll paste it in the in the chat, which is where you can scrape it. So I believe the Prometheus team found that there was too many series to put it in the main Prometheus instance. And so there's a second Prometheus instance that has the Unicorn data in it. And so that's where you can go dig around and explore. One of the problems with it at the moment is that the names are kind of, we're in this transitionary period where we're moving from influx DB to Unicorn. And so a lot of the series names are not that easy to work with. And I think we need to try and make them better. So if you look, they'll have like, they're very long names and they have what is what is effectively Prometheus labels as part of the name. And so we need to fix that, but still having the data is super valuable. Cool. And auto scaling. And thanks, John, for adding another link. And then Paul has got a link for DDoS attacks. I'll have to read that. Oh, right. Is this what GitHub were attacked with? Yeah. And so I know that GitHub have got a lot of this, the stuff that I talked about with limiting users from basically doing things to infrastructure. I know that GitHub have a lot of that infrastructure in place as well. They've given some talks about how they limit users who try to do 10,000 clones of a repo simultaneously and that kind of thing. And we've got sort of the basics of that, but we need to do more. Cool. Any other questions? Nope. Okay. Well, have a lovely day. Have a lovely day everyone and I'll see you in the team call in 13 minutes.