 Hi, good afternoon, everybody. So yeah, as William said, we're going to be going over best practices for Kubernetes runners. This is lessons that a 3-person team learned while managing GitLab runners for around 1,800 users at F5 Networks. So we'll go over a bit of an introduction, go over a little bit about myself, F5 Networks, our install. Then we'll go over Mike and the DOS they performed. It could have happened to anybody, but Mike is our lucky volunteer to perform this DOS. We'll go over limits and why you need them and why setting limits is very difficult in your Kubernetes environment. Finally, we'll go over labeling workloads, which while it's not the most important thing you can do with your pipelines, it is a very useful tool when trying to debug or get more metrics and information out of them. So just to get started, a bit about me. I'm a software engineer at F5 Networks. I've been using GitLab since around 8.x. And even though I've done an rn-rf slash on our GitLab instance, I still have sudo access, so yay me. So yeah, like I said, we have about 1,800 users worldwide for our install. 7,500 projects roughly when I checked last month, that's excluding forks, so I'm not inflating those numbers by people forking it and then reintroducing merge requests. Somewhere around 350 to 400,000 CI jobs are kicked off every month, most of them going through the runners that my team manages. And for all of this, up until recently, we only had three engineers managing our whole CI pipeline, GitLab itself, as well as a few other tools on the side. So any requests that come in, any questions, issues, they all came to our three-person team. Recently, we've been expanding the team, so it's a lot easier. We've also introduced more people in India in order to give us a more 24 by 7 uptime and process for watching for tickets, but that's a newer introduction. So on the screen here, you can see we have a basic architecture diagram. In our case, we had to break up our GitLab instance. So originally, it was a single server, and it was actually taking a single VM that was taking up a whole hypervisor in our virtualization lab. Obviously, that was a little bit too big, so we broke it up into seven individual servers. We have two HA web servers, you could call them. And then on top of that, we have a Postgres server, Sidekick, Redis, and Gitaly all broken out into their own servers. Additionally, we have a Postgres replica. That is a read-only copy of Postgres. That allows us to do reporting, as well as monitoring, using that to look for anomalies. This is all fronted by an F5 big IP, which if you're familiar with F5, you probably have heard of. It doesn't really pertain to this talk, but we are using that for our load balancing. So GitLab runners. We manage two main types of GitLab runners. We have our Kubernetes runner, which has most of the jobs going through it. I'd say about 90% of CI jobs at F5 go through one of our Kubernetes runners. On top of that, we also have Docker machine that we're using, which is both on-prem as well as in the cloud. So being a networking company, sometimes you need to configure custom networks in your jobs for integration, as well as sometimes even unit testing. So in order to do that, Docker machine allows you to have your own virtual machine that is dedicated to your job and will allow you to set up your custom networking topology within Docker, thus allowing you to run your test and tear it down afterwards, not affecting other users. You can't really do that in Kubernetes, so we've chosen to give users a Docker machine type instance as well. So let's talk about Mike. So Mike, or at least that's what we're gonna call him, was given the task of setting up some automation to configure different settings on various projects. So he decided he would use a GitLab CI job to do this, which makes sense. You're going to be configuring GitLab. You want it in source control. That way you can iterate over it. Everybody can review it. It's an easy way to manage it. So use GitLab CI in order to do this, but unfortunately when he was setting this up, he didn't quite read the fine print in the library that he was using. So let's take a look at his source code. So this is Node.js. For those of you that are not familiar with it, don't worry, it's a pretty simple example. This is the actual code that he ran with the one exception of the project, not the project name, the group name. So all this does is it gets a project ID for your project. Assuming that it finds it, then it creates a scheduled pipeline on it to run once per hour at the 30 minute mark. Seems straightforward. The problem gets to be when you run this, when this code is run within the example and for a timer project. You might see where I'm going with this, but if you don't, you get something like this. You schedule a job, then the next hour you schedule another job, so now you've got two jobs scheduled, and then you get four jobs scheduled, and then eight. After 10 iterations, you get around 1,024 jobs scheduled, and after 15, 32,000 jobs. If this was allowed to run for 24 hours, you would end up with 16.7 million jobs being scheduled on that 24th hour, and with a CI pipeline capacity of 350 to 400,000 jobs per month, that will easily clog up your system, taking down your Kubernetes nodes, as well as your GitLab nodes, GitLab nodes, most likely. So it's a pretty easy mistake to make. His assumption was that this create function would not create a scheduled task if there already was one that matched. The way to fix this is to check all the scheduled jobs, see if the one you want exists, and then only create it if it doesn't. But something as simple as this and benign as this could easily take down your system. So what do you do when this happens? Well, first you have to find out who did it, or more specifically, what project did it. You don't necessarily need to know the user yet, but you need to know what to stop. Once you've determined who did it, or specifically what project has done this, you need to disable CI on that project, and the reason that you disable CI is so that they can't spawn off more jobs. So once the CI is disabled, kill all the pending jobs. So for that, we choose to use GitLab Rails, so we SSH directly into one of the nodes, and we use the GitLab Rails command, enter into an interactive shell, and run this script on the screen, which I have a link to a snippet for. This will cancel all pending jobs. You could also do this via the API, but because we have access to our machines, we chose to do it this way, this also gives the advantage that if the API is not responding, then we are able to connect to it directly and bypass that issue. You can also potentially kill the running jobs, and this one is more of a judgment call. So if the job is going to take 30 seconds, it might not be worth it to kill the ones that are currently running. If the job is going to take 30 minutes, you're probably going to want to kill those jobs and free up the resources for your users that are waiting. Oh, and in order to do that, you would just change the status, pending into status running. So what did we learn? Well, we need to monitor our queue size for GitLab. So we leveraged that read-only database that I mentioned earlier, and we set up a cron job that checks against it every two minutes and verifies that it's not exceeding some threshold, specifically how many jobs are in the pending state. So this is an actual notification that went out last month, I think it is, where we were backing up our job queue. And so you'll notice a few things in this. First off, we have a dashboard you can visit that I hid. You'll also notice that we include the playbook into this actual alert. And the reason we do this is if it's 3 a.m., you're not going to want to have to think. You want to have it right in front of you. You don't want to have to know what playbook to run. You don't want to have to know where to find that playbook. You just want to know exactly what you need to do. So we've included the playbook in our alerts. And so far, this has saved us from two additional DOS attacks, which happened over the summer from an overeager intern who put calls to triggers in a infinite loop. So all things considered, this has saved us two outages and makes it well worth it. At first, we did have a few false positives while we were still tuning it, but we've gotten it pretty well tuned, so now we don't really get those anymore. So let's talk about limits. Raise your hand if you have an unlimited cloud budget. That's what I thought. Just for people watching this later, nobody raised their hands. Nobody has an unlimited cloud budget. And if you're running on-prem, then it's even more constrained because it takes time to get new hardware in in order to give people access to it. You need to buy it, rack it, stack it, actually configure it, and install the software onto it. So to meet the demands of all your users, you need to put limits in place. Otherwise, one or two users can use up all of your resources. So while users may not like limits, I know I don't. It helps keep the environment healthy for all users. So setting limits while they are enforced, they are a conversation starter. Just because you set a limit does not mean it will not change. Projects grow, projects evolve, projects need more limits in the future. That's kind of the way things work. So this is a way to start a conversation with users when they come to you and say, hey, these limits don't work. So Kubernetes supports limits by default and the GitHub Runner for Kubernetes allows you to inject these limits, which is perfect. So when it comes to requests and limits, requests are used by the Kubernetes scheduler in order to determine which nodes to run a workload on. Limits are used to determine when to kill a job for exceeding that limit. You can have different requests and limits. If you don't specify the requests, though, and you do specify limits, that is what is used by the scheduler to determine the request value. So here is an example. This is what that link on the previous slide was linking to of the GitLab config Tommel that we have for our GitLab Runner, roughly what we have running in production. And this is our Kubernetes GitLab Runner. We have a concurrency of 200 jobs, so it'll let Macs spawn 200 jobs. And you'll see that we are limiting the CPUs on the build container to two and memory to six gigabytes and on the helper and service CPU and memory limits. We have one CPU and one gig of memory each. And so it gives you that flexibility to break it out because generally you don't necessarily need as much CPU or as much memory on a service that you're spinning up in your CI job. So for example, you spin up Postgres for some sort of integration test. You don't necessarily need as big of a container as you do for actually building your code. So this is powerful in the lighting you break that up and actually enforcing that. So adding limits after the fact. So adding limits after fact is in some ways easier than adding them before you've set up a Kubernetes Runner, although I would not necessarily recommend it. This is what we did. So we started off using this Kubernetes Runner when there were roughly 50 of us and say 100 projects in GitLab. It was very small and so we grew naturally and we didn't know better. We were also using it when the Kubernetes Runner was experimental and I don't know that necessarily supported limits at the time. But the way you do that is you enable the Kubernetes Metric server and then you start monitoring your resources. So you need to determine, well, what are users doing? So we have a few tools that we can use. This one I wouldn't recommend necessarily using for this purpose, but it is a useful tool. It's called K9s or K9s, depending on who you are. And it's essentially a top like interface for Kubernetes. So what it allows you to do is look at what is running at the time, see how many, for example, CPU cores it's using, how much memory. You can also see the percentage of CPU and percentage of memory based on the limit. So you'll notice it says zero for the percentage of CPU and memory. That's because I have not set any limits in this particular demo. So if you don't have any limits set, it'll always show zero for those particular columns. So they may not necessarily be useful columns to you, but it can be useful for indicating to your head, hey, this is at 0%, I don't think I have limits set. More likely you'll use something like Prometheus and Grafana. You probably are familiar with this. GitLab has a native integration with Prometheus and Grafana. In our case, in this case, we are getting it from the Kubernetes metrics server. So I have included on the right side of the legend, the actual values, just to make this easier for us. And also note that it's only about 40 minutes worth of data, so it's not enough to go make decisions based off of it. It's good enough for example purposes. So when it comes to Prometheus and Grafana, you will set up some sort of dashboard to look at, okay, how many CPU cores are my jobs using? And so in this case, we can see all the jobs are using around four CPU cores, and that's because of the way I specified them. So you can reasonably assume that four CPU cores will be enough for all of your CI jobs, and you can use that as a limit. You might wanna go a little higher to prevent resource constraints causing it to be killed, but we can argue that four CPUs is good enough for this. Then we come to memory, and similarly, I put the actual values, the maximum values in the legend, and we can see that we are somewhere between 500 megs to one gigs of memory being used by each job. Memory is generally cheaper than cores, so you can probably go with one and a half to two gigs of memory for each of these jobs, and you'll be safe. Arguably, you'll probably want more in your production environment, but this is just for demonstration purposes. So setting limits from the start. This is a very difficult discussion because part of DevOps is making data-driven decisions. Well, if you don't have data, how are you going to make decisions? So ideally, you are migrating from something, say, a Docker runner or a Shell runner or some other similar runner where you can have these metrics available to you and you can just use them to extrapolate what you would assign to your Kubernetes runner. If you're brand new to GitLab and brand new to GitLab's CI at all, then it's pretty much just a shot in the dark. How many builds do you... You can kind of make some decisions based on how many builds do you expect to be running at any given time. What are your resource constraints? So how much memory, how much CPU do you have? Is anything else running on your Kubernetes cluster if you have a Kubernetes cluster that is running some sort of hosted application and you're adding your GitLab CI jobs to it, you don't have access to that full Kubernetes cluster so don't make your number, don't decide on your limits based on the full Kubernetes cluster, base it on whatever is left over in the Kubernetes cluster after those applications are added. Ideally, for separation and reasons like Mike's DOS attack, you'll have a separate Kubernetes cluster just for your CI jobs, but that isn't always the case. So expect to be wrong when you're making decisions without data, which makes sense. You probably already knew that, but I do want to call it out. You're probably going to make mistakes when you try and decide on your limits without having any sort of data available to you. So what happens when a user hits their limits? Well, their job gets killed. That's not 100% of the time Kubernetes does some magic under the covers where if it hits a limit just for a moment and it goes back down, it won't necessarily kill it, but generally speaking, it gets killed. So this can go either one of two ways. The user goes, oh, wait, I did GCC-J16 instead of J8. Oh, I'll just go fix that and everything will work. That doesn't happen often. I think all the administrators and operations people in this room know that. What's more likely to happen is they ask for more resources, which I've definitely been on the receiving end many times. The first thing you need to do is determine what their actual need is. If they just say, I need more CPU or I need more RAM, you need them to come back to you with concrete numbers. Without concrete numbers, A, you can't give them what they want, and B, they might come back and just say, oh, you didn't give me more. Yes, I did. You just needed more than what I guessed. So you need them to come back with concrete numbers. If you do not have the capacity to actually support them, don't. I know it's hard to say no, and many businesses don't like when you say no, but if you can't support them, then don't jeopardize your other users by trying to give them more resources than you physically have. Work with them and try and find a common ground. Maybe order more hardware and say, hey, we can give you this after this hardware comes in and it's installed. Try and find a way to work with them because, generally speaking, we're a reasonable human being, so we can probably work together. So labeling workloads. Labeling is not the most important thing. Labels, if you're familiar with Kubernetes, are key value pairs for identifying workloads. They can, in GitLab's case, in the Toml file, they can be expanded environment variables. So for example, like you can see on screen, you could set the job and the dollar sign CI job ID, and it'll expand the label so that job is whatever the GitLab CI job ID is. Likewise, you could do source or project. These are some examples of some useful labels that you could do. They can be used by admins when they're manually doing cubicle commands against Kubernetes. They could also be used in reporting tools like Prometheus and Grafana for doing limits on stuff. So you could say, I only want to look at things that are of source merge requests. They aren't the most useful, but if you run into an issue, it can be useful to have for debugging purposes. So there are some limitations with labels in Kubernetes. These aren't limitations of GitLab but Kubernetes itself. Not all characters are supported. So for example, if you put a question mark in there, it's going to be a problem. There's a maximum length. You're probably not going to hit this one, but you can only have labels of up to 63 characters long. And if you do put an unsupported value, whether it's via bad character or a long label, your job will not start. There won't be a very good indication to the user why the job does not start. There is a GitLab issue on it. There is an open merge request that I need to follow up on to actually get this fixed so that GitLab will clean up the values that you introduce and actually prevent it from having failed jobs that don't give users good input of why they failed. I learned about these limitations by doing some testing with this in our Dev environment, seeing it work perfectly, introducing it into our production environment, finding out I didn't do enough testing. Slash ended up in my thing because I didn't test against subgroups and I was just putting the whole path, or putting the group name of the job into the pod labels. I didn't think about subgroups and the fact that they have a slash in them and suddenly anybody with a job in a subgroup just didn't start. I've put a link there to the official documentation around the limitations for labels in Kubernetes at that bit.ly link if you would like to check those out. And I'm definitely running ahead of time. So I'll go over the summary and then if anybody has any questions, I can answer them. But basically mistakes happen and you need to be prepared for it. So basically Mike made a really simple mistake, a logic error that could have happened to anybody. You need to be prepared for it through active monitoring and alerting. Preferentially you'd have automated remediation but manual remediation is probably what you will end up with. Setting limits will annoy some users but it does help keep the ecosystem healthy. You are going to have jobs that will want to exceed those limits but if you can't support it, if you don't have the resources then you need to work with those users to either limit how much they're using or come up with a plan to get them to where they need to be. But limits are only starting point. You do need to revisit them from time to time. As projects grow, they need more resources to build. A project on day one is not the same as the project on day 100. That's obviously hyperbole in terms of complexity but you do get the idea. Hello World is not the same as GitLab. They take a little bit of a different amount of time to compile and test. When you are setting limits without any data it's okay to be wrong. You just need to be nimble and adjust as needed so your users will come up to you and say, hey, my jobs are failing. Okay, we go look into it and then we see, oh, they don't have enough CPU. They need, say, one more core. Okay, I can afford that. I will give them one more core. Labeling workloads is powerful. It's not the most important thing in your environment but when you are trying to debug something it can be extremely useful for exposing data. So it's better to have it and not need it than need it and not have it in my opinion. It's something that I don't use every day and I'll be honest about that. But when I have needed it in terms of looking through logs it's been extremely useful for me at least. So thank you. If anybody has any questions I'm free to answer any questions. I have about five minutes left. You can go ahead and raise your hand if you have a question and I'll bring the mic to you. Do you have teams at F5 that manage their own runners? Yes, we do. So we have around 200 different individual team runners. For example, if you need, we don't have Mac hardware and we build stuff for iOS and Mac. Well, we don't have the physical hardware or the actual expertise in that. So some teams, for example, do run Mac minis to build their code and they import them. Ideally, they all run through ours but we do understand that sometimes it's just not possible. Here we have one out there. Sean, thanks for the talk. Welcome. I was curious if you had any thoughts on preventative measures. On what? Like preventative measures. Yeah. Like a reactive approach to like the... Yeah, so that's... Yeah, preventative measures would be great. We haven't come up with an idea of how to do that efficiently yet. So we're open to ideas on that. For example, the mic example, it's really hard to prevent that because you have to understand what the code is doing and the context around it and for us to prevent that ahead of time, ideally we'll have been prevented via merge request and his colleagues looking at and seeing, oh, hey, this is an obvious loop. In this case, it was a test project and so he was the only one working on it and so it didn't go through the normal merge request pipeline which would have most likely flagged that as, hey, we need to fix this. So ideally your users are going to prevent stuff like that from happening but if you're working in your own personal project, there's not necessarily going to be that merge request that happens. So ideally the users aren't doing stuff like this and don't cause these issues but I just, I don't honestly have any idea of how to automatically prevent that particular issue from happening. We do have some preventative measures in our GitLab runners for example and when you're building darker images in Kubernetes, generally you need one of two things, either a privileged runner which we are not allowing or you need some sort of VM that is exposing the Docker socket for you to actually use. So we are building them in that fashion. You could also use another tool but that's talk for another day and so we found that users occasionally would create networks on this because they would do a test and automatically the address of these VMs were getting injected and they were causing other jobs to fail. So what we did is we put a big IP in front of it in our case but any load balancer in theory could do this and we put special rules in place to say, for example, the slash networks endpoint of the Docker API is not allowed at all. So we leveraged tooling to prevent users from causing issues for others in that particular case but in the case like the mic example, I don't think there's an easy way for it to prevent that automatically other than a manual review process which their team would do. Welcome. And hey, thank you, Sean. Let's give Sean another round of applause. Yep.