 Okay, that is time. So let's get started. Hi, everyone. Thank you for joining me. My name is Eddie Zaneski. Thank you for sticking around on the last day of talks. I know this is right after lunch, so hopefully we'll have some fun in here. Today we're going to be talking about the cluster killer bug, also known as Learning API and Priority and Fairness to Hardway. Quick bit about me. You can find me on the internet at Eddie Zane. I live in Denver, Colorado. I like to climb mountains. I'm a co-chair and tech lead for Kubernetes 6 CLI, so I maintain kubectl, customize all those CLI things under the umbrella there, and I'm a staff DevRel and open source engineer and I need a new job, so hire me. Thank you. So we're going to flash back to February 1st, 2023, at Cloud Native SecurityCon. Was anyone here at Cloud Native SecurityCon? No, that's okay. It was out in Seattle. It was the first one. CNCF decided to kind of like split off an event into its own and it was a good time. And I was out with a bunch of coworkers doing some karaoke late at night, and I had a media interview with the new stack the following day, so we were going to go through some of the new stuff that we were working on, some of the new controller stuff we were building, and I snuck out of karaoke to go do some last-minute prep in my hotel room at night, and just like all good things, nothing was working. I had no idea what was going on. I thought it was a network. I thought it was a computer. My requests were timing out. None of my demos were working. Everything just kind of seemed broken. And so I did what anyone would normally do. I deleted the cluster and made a new one, and everything worked, and I was pretty happy with that, and then I went to bed. And I woke up in the morning and tested it out again, and it was broken again. So here I am, probably like two or three hours before I'm supposed to do an interview, and my demo wasn't working. So we pushed the interviewer back to the afternoon, thankfully. New stack is awesome, if you ever work with them. Did a bunch of triage. Just trying to figure out what was wrong really quickly. Learned a lot of things about what to do in the moment when you have to get on stage and do a demo. And thankfully, we managed to get everything working. Still didn't know what was quite wrong, but I was able to do the demo after basically barely any sleep and a very stressful day. So I have a little clip I'll play for you of the beginning of this. See if you can see how frazzled I was during this interview. There used to be this advertisement that was like popular in Northwest for Rainier beer, they go rain. Is that ring a bell? No. Okay. But I want some now. So advertising works. Okay. But Alex from the new stack is great if you've never read him before. So like I said, thankfully, we were able to pull this off, but it was a long stressful two days trying to figure out what was going on. And so I have some early takeaways from all of this and we'll get into the bug. We'll get into everything that was going on. But the early takeaways was I should not have deleted that cluster from the beginning. I kind of should have put it on the back burner and like left it around to introspect later so I could compare and contrast. So don't be too quick to pull the trigger and blow it away. I know that's what you all want to do is does it work if you turn it off and on again, right? Try not to do that. Start ruling out what you can as soon as possible. So going through the logs, going through the metrics, going through everything I was trying to rule out like what could actually be wrong with what I was seeing. Think out loud whether this is with people around you or in Slack. Just kind of type out your thoughts, screenshots, pastes, snippets of metrics and stuff. Do this as you're going. And then glance at the logs and metrics. I say glance here because especially if you don't quite know what you're looking for and your logs aren't always obvious, you could spend a lot of time reading log lines that you've never seen before and think that every single log line could be what the problem is. So take a good glance but don't get too caught up in the logs and metrics when you're first trying to get through this stuff. And your goal is to try to reproduce the issue that you're running into. You don't really have to understand what's going on. Again, I'm coming to set a frame of, you know, I have like three hours before I need to go and demo this thing. So I can't actually dig into everything that's going on. So you want to try to figure out like what causes this bug to happen so you can build and work and avoid it. So what was I demoing? So the controller, it was a admission controller that we built. It was built around the six-store policy controller. Anyone familiar with six-store? It's a Linux foundation project part of the open SSF for signing and verifying software artifacts. So you can do supply chain security policies to say I only want containers that are signed by these images or don't contain these known vulnerabilities or dependencies. So it was built around the six-store policy controller, which is open source. It opened a lot of long-lived connections to the API server. It did what's called a watch, if you're familiar. It's basically just like a very long-lived request that kind of streams responses back from the Kubernetes API server. It was built with Knative serving, which anyone built a controller before or used Knative. It's a lot of fun. It's a pretty, building a controller from scratch is fun. You kind of learn all the knobs and then you reach out for Knative and never think about the primitives ever again because Knative does everything for you. So there's a bit of magic baked in there. And the very unique thing about this controller is it had the ability to run what we called agent lists. We were a security company, and when you have a security product, you generally need to run a sidecar or an agent on your customer's servers or clusters. And this was something we heard from a lot of customers was that they didn't want to have to run these sidecars or these agents. They really didn't want to have to install yet another agent on their clusters. So we designed a way to run our controller's agent lists. So stepping back a bit, what does that controller actually look like? Well, we have the Kubernetes API server, which is the heart of all things. And we have the controller, which in our case was a Go binary, but it could be built with anything. But these ideas of these few components that make up a controller. So you have your client that can talk to the API server. You have your work queue that's kind of listening to those events coming in and putting them on a queue that the controller can process and reconcile. You have these informers and Lister caches. This is kind of just opening those long live connections to the API server and kind of listening for all those pod events, all those node events, all those config maps you're creating. And then it caches them. When you do a list operation on a Kubernetes API server, it has to go to Etsy D and query all the values out of Etsy D. And it's probably one of the most consuming requests that you can make the Etsy D as a list. You have to get back a ton of information. You can't use like Etsy D is a key value store, right? So now you can just look up a nice single key of an O of one operation. You have to get a huge list. And so that's what we have this like Informer does caching. And then everything runs in a reconciler loop. So this is the loop saying I'm going to take something off the queue, do some work that needs to be done, maybe like monitor a new thing or enforce a new policy and then pop it off. And so the way we made this agent list was we actually built a proxy that sat in front of everything and we registered an off cluster mutating admission webhook. So Kubernetes, if you're not familiar with mutating admission webhooks or any sort of admission controllers, you can register one with your cluster. It sends events and requests to it and you respond whether or not something should be come in. So if you say I don't want a pod named Dave to ever run on my cluster, you can write a admission controller for that. You get a request when a new pod comes in and you can check the name, see if it's Dave rejected or not. So we were able to run this cluster, run this controller in one of our managed tenant clusters. You see this some other cluster bit. And by having that proxy where the API server was reaching out sending all those events to us through this tunnel, we were able to run our agent but off your cluster. It's very novel. It worked pretty well. So what we saw when this bug was happening was the cluster appeared just super dead. Just like all those requests were timing out. We weren't able to pull any of the metrics up. We weren't able to pull any of the logs. None of the pods were responding to requests. Everything just appeared dead. Client requests were timing out. So if you made a cube control request or any sort of other client request, it would just time out. It hit with the one minute context deadline and just pop you out with the 504. The GKE dashboard was dead. This was the very interesting one. Have you ever seen the GKE dashboard before? It could show you like your nodes and your storage and your pods. When we loaded it up, I have a slide for it in a bit, but it was just it was dead. It was showing us no data was available and throwing some errors. After some debugging and triaging, we realized that this happened after four to six hours of our controller being installed on your cluster. Four to six hours later is when we'd start seeing the spikes and the traffic dropping. We named this the cluster killer bug because it would brick a Kubernetes cluster as far as we could tell. It visibly only affected Kubernetes 125, which was the unique piece. It only happened on GKE. We also did a bit of AWS testing. The whole thing with the agent list was it did some funny I am impersonation to be able to do that off cluster traffic and that proxying. We were only able to actually see this with GKE. It only affected agent lists. So when the cluster was running, the agent was running off the cluster as opposed to on it. And the only solution seemed to be deleting the cluster and recreating and reenrolling it. And when you're a services company and a security company, this is not an acceptable solution for your customers who clusters you bricked. We can't be bricking customer clusters. So this became a pretty top priority to figure out. Thankfully, we didn't have anyone using agent lists on GKE at the time. So we had a little bit of headway. We had AWS agent list users. So it was good for the time. This is what it looked like when the GKE dashboard died. You can see there it's like we can't make any requests. Kind of strange. Looking through the logs, you know, when you're looking at logs that you're not familiar with that you really haven't dug into before, every single log line that you see could be what's wrong. This was something that was going on with the list and watcher cache of the six-door policy controller. It was giving us like a timeout, allowed request, and then initializing. So we spent a bunch of time trying to dig into was it the policy controller and what's going on with the Knative Listers? Then you wind up with log lines like this. This is just kind of like a full-on stack trace, which is great to have when you're debugging. Not something you wanted to see returned in a response sometimes. We tried not to leak state like that. I'm not even going to try to explain what's in there. It's just a full stack trace of the API server. Looking at the metrics, the only thing that was notable was the API server memory. It would kind of spike and then drop down and spike and drop down. This is over the period of what is that? 27 days in April of a cluster. It looks relatively normal. You have some spiky traffic bits, and then you have some lows. But this kind of gave us a good place to start. It was like, why is the API server ooming itself? Obviously, you assume that those dips are the API server being oomkilled or restarted. Our first theory, we dug into the change log between 125 and 126 because, again, we couldn't reproduce this on 126. We found a change that came through. It was the admission controllers can cause unnecessary significant load on API server. Spend a ton of time trying to dig in and understand this change. Really couldn't see how this would impact what we were doing. We ruled this out. Just so some lessons from all of that. Have a baseline of what your logs and metrics look like. Obviously, if you're doing a demo cluster, you're not going to have this. But like I said, keep those around. When I went in there to debug those logs, and I saw those random stack traces and those random reconciler errors, to me who had not worked on this product before, I was not aware of are these normal or are these not normal? It's good to have a baseline of what you're comparing against. Of course, being able to compare historically with your metrics helps a ton. Slow down. Trying not to rush. Like I said, this was a, we were kind of racing the clock here where thankfully there were no customers that would encounter this at the time, but we were a small startup that was growing quickly and who knew who would want to use it. Take your time, slow down, kind of like think through. Get creative and experiment. This is ultimately what led us to figure out what was going on. Just throw shit at the wall and see what sticks. Try things that sound crazy. So this is going to be kind of small, but the trying crazy piece is how we were able to actually get a lead here. And so I don't think I can zoom in too much now. But what you can't see at the top line there is it is a HTTP get to API slash foobar. This was me just randomly throwing things out of tube control at the API client with a little raw query. And we saw this log line show up in the API server. And this was different than all the other timeouts and log lines we were seeing. This was a four or four not found, which compared to all the other responses we were seeing when the cluster was seemingly bricked, the fact that we are able to see a response like this was a lead. So we knew that the API server was listening. It was responding because it was able to tell us that the resource was not found. And then we got even crazier. We wound up SSH-ing into the nodes, set up some certs, was able to talk to the API server directly with no other cluster traffic to talk to the API server. So the nodes were perfectly healthy. That was one of the other things with all the metrics. Everything looked healthy aside from that little bit of memory spiking. All the request times, all of the processing, the CPU utilization, nothing really seemed out of the ordinary. We were just kind of in a weird magic black box state. So when we were able to talk to the API server and have instant responses, this gave us a good lead. And from here we were actually able to dump all of the metrics. So if you haven't done with kube control before, you can do a kube control get dash dash raw. And you can make raw API requests and it includes your alt headers. So we were able to make requests to dash dash raw slash metrics slash logs and get all those actual metrics and logs we weren't able to get from the GKE dashboard and log exporter. And then we found a little bit more breadcrumbs once we started being able to pull these logs. The one that really stuck out to us was this too many requests line. And from what we were seeing with, again, we weren't getting responses from the API server or they were timing out. And then to see too many requests coming in, well, what was that? What is telling us that there's too many requests? Were there too many retries? The API server didn't look flooded, so what was going on with this too many requests? So to take us a little step back, I want to talk briefly about API server tuning. This is something that most people probably are never going to have to touch or have access to. The two flags that you would set on the API server are max requests in flight and max mutating requests in flight. And this is how the API server can be set to the max number of inbound requests at a time. These are hidden from you if you're using a managed Kubernetes service. You usually don't get access to the logs. Thankfully, if you look at GKE, you can see these flags printed out at the top of your API logs. So if you do like you control get slash log slash API server, you can actually get the logs. And so here's actually all the flags that are set on my GKE cluster when it starts up with the API server. I don't know if there's many other providers that do this, but I wish they did. This saves a ton of time, so you can see. And if we look in here, we can see that max, what is it, max? Max requests in flight is set to 60 by default and max requests in flight mutating is set to zero. So zero here means unlimited. So we should technically have these unlimited requests coming in, and the max request in flight is 60. That's pretty decent, especially for the response time around trip. And so these are the two dials that you would be able to tune if you had access to the server flags, but you can't. So out of that, we built a feature into Kubernetes called API Priority and Fairness. Anyone familiar with this? A few people? Cool. Quick summary from the docs. The API Priority and Fairness feature is an alternative to those two flags I mentioned that improves upon aforementioned in flight limitations. APF classifies and isolates requests in a more fine grainway. It also introduces a limited amount of queuing so that no requests are rejected in cases of brief bursts. Without APF enabled, overall concurrency in the API server is limited by those two flags. And with APF enabled, the concurrency limits defined by these flags are summed, and their sum is divided up among a configurable set of priority levels. Each incoming request is assigned to a single priority level, and each priority level will only dispatch as many concurrent requests as its particular limit allows. So when we came across this while looking through API tuning, rate limiting in the cluster, this obviously sounded like something that we were seeing, right? So quick background on APF. It was a beta feature that was released in Kubernetes 120. It's currently at a beta v1 v3 still. I don't know when it's planned for GA. It is enabled by default as of 120, so your clusters will have this unless they're opted out. I don't think any of the cloud providers opt out, so you're most likely have this feature enabled. It's how you control traffic in an overloaded situation, so this is intended for you to prioritize different types of traffic so that when you are having backed up or a massive burst of requests to your cluster, that those important requests get through. Think in this case of your metrics endpoint, your logs endpoint. You always want those responses to be, I mean, those endpoints to be available, even when the API server is bogged down trying to list a thousand different config maps or something. And what I've really taken away from APF is it doesn't matter until it does. This is one of those features that probably 90% of people who use Kubernetes are never going to have to think about or touch until you run into it and it becomes an issue. So, spoiler, if you couldn't tell from the title, this is actually what we were running into with the bug we were saying. So, there's two resources that are part of the flow control schema for API priority and fairness. The flow schema is what defines the who and the what of your cluster traffic. So, this is very similar to how RBAC does it. So, you have your subjects, which are user service accounts or groups, and then resources slash non-resources, which are your group version resources, so your pods, deployments, and your paths, so slash metrics. It has a matching precedence built in. So, this is where you can elevate certain policies for certain requests. And there's some built in ones that we'll take a look at in a second. And so, the flow schema maps to a priority level configuration. So, you define this flow schema. Here's an example of what it looks like. So, this is going to map everything from the service account demo, every single type of request. So, non-resource URLs and resource URLs. And it's going to map all of these requests at the 1000 priority level to the priority configuration name demo. You see that mapping there. Priority level configuration. So, this is the, it has two different types that you can figure at the top level. It's either limited or exempt. So, you can exempt traffic from a policy. So, you can say, hey, none of this stuff ever gets rate limited. It should always have priority. And then when a rate limit hits, or the concurrency limit hits, you can choose between queuing those requests or rejecting them. Two different types. And these are the knobs to turn for your tuning. So, here's what one of those policies looks like. You can see the queue, the limited, and then all those knobs that we're going to talk about. So, the queuing configuration has four main knobs that you tune. And so, this is the V2, beta V2 version. Beta V3 made a few changes, which we'll talk about in a little bit, but this one's a bit easier to understand. So, assured concurrency shares. These are kind of like your seats for your concurrent connections. These are the number of concurrent requests that are allowed to be executing. And we say seats loosely here, and we'll talk about that in a second. Hand size. So, the way this works is with the fair queuing algorithm. There's lots of white papers and Wikipedia pages that I tried to understand. Essentially, what hand size means is, out of all of your queues, it's going to randomly pick one. And from how it picks one is it is dealt a hand. So, it will pick, maybe your hand size is four, so it will pick four queues, and then randomly pick from those. And this is hopefully to stop it from starving out particular queues. Your queue length limit is how big the queue is, so how many requests can be in a given queue, and then queues is your number of queues that you have. So, when you want to increase queues, you reduce the rate of collisions between those different flows at the cost of increased memory usage. A value of one here effectively disables fair queuing logic, but it still allows requests to be queued. Increasing queue limit allows larger burst traffic. So, this is where you're responding to larger burst requests coming through. Again, it's hopefully so you don't drop any requests at the cost of higher latency while those requests are in queue and more memory usage to hold them in queue. And changing hand size allows you to adjust the probability of collisions, so you have a different hand size to choose from out of all those queues. So, there's two built-in levels, well, there's several built-in levels, but there's two that are worth talking about. The workload low priority level is for requests from any other service account, which will typically include all requests from controllers running in pods. And then there is the global default priority level. This is basically all other traffic from your clients and stuff. So, if you make a cube control command from your laptop, it is going to hit global default. This is for authenticated users. There's also a catch-all bucket, and that is usually for unauthenticated types of requests. So, these are the two that you really work with. Most of your traffic that runs on your cluster, inside of your cluster, your controllers, are going to hit that workload low priority queue. There's a few special cases to talk about. When I said the assured concurrency shares, the list request is a very consuming request to the API server in that CD because, again, it has to walk over those keys. The way lists work is it kind of estimates how many keys are going to be returned, and it kind of picks the number between that to figure out how many shares a single list request will use. So, when you think of these, they don't map one to one, generally. And then there's the watches. Watches are those long-lived requests to the API server to listen for those events back. The watches generally take up a few seats when they first start, and then they are considered not taking seats while they are sitting and waiting for events to come through. And then there are special groups like the system masters, which always get prioritized to top-level traffic. So, this is a policy from the docs. This isn't one that's included by default, but this is one they recommend everyone have, which maybe should be installed by default, but we could discuss that. So, this is what would prioritize all of your health checks, readiness checks, and liveness checks to be always allowed by system unauthenticated users. So, this is for your scraping of your metrics and your health checking. So, this is a policy that you probably want to have in your cluster if you don't already. It's not there by default, and this would exempt all of that traffic. So, it does open the possibility for you to be DDoSed or smack down on these endpoints, but, again, if something happens, you do want to be able to prioritize these paths. Quick look at what these levels look like. You can see the catch-all, global default ones I mentioned in there, the assured concurrency shares, all the cues and stuff. So, you kind of see the scale of what we're talking about, what these look like by default. The way that the concurrency limit is calculated in the clusters, we sum up these concurrency shares. So, in our example here, we have 245, and then we take our two flags that we had before, the default values I'm using here are 400 and 200. You pick a priority level, which workload low has 100 concurrency shares, and then you do this equation. So, you take the two numbers, add them, divide them by the ACS value for the total cluster, multiply that by the ACS for the priority level, and that's how many requests per second your particular priority level is allowed in your cluster. So, this is much more visible and configurable for you as a user. So, for example, what happens when you change these values? So, if we say that we add another priority level with an ACS of 55, that brings our total ACS to 300, do the math, we have 200 requests per second from the previous total, and now once we have our new share in, we have 110. So, we go from that 244 requests a second down to the 200. So, we lost 45 or whatever it was, requests per second, 44 requests per second. So, adding more priority levels, you can prioritize traffic to different levels, but it brings all of your overall max requests per second down. So, that's something to consider. In 126, it actually, again, it doesn't really matter, it makes it a lot better. So, I have a demo to show you. I have a quick program I wrote. This is a kind of a, just a blaster go program, pulls out client go, disables all the local client side limiting, and then fires up 100 workers and makes a ton of infinite list requests. So, we're going to fire this off of my cluster. I want to show you the APF policy we have configured right now. So, we have 30 shares, Q size, blah, blah, blah, and so this is just targeting my demo user. So, this is the one you saw before. So, this is what's set up. So, if we fire this off, we're going to open up our watch here, and Internet works, cool. So, what you're looking for in this request is this number going up right here, which I think the Internet is fighting me here, but you can see it climbing. So, this is the request, the total requests that are being made to my cluster for this particular flow. And you can see that that number is climbing, and then we have the number of requests currently executing up here. So, there's seven, and then in queue requests, they are, there's 64 right now, right? So, this is just refreshing as we go. So, currently, we're handling the load, actually. I'm not getting any errors reported. My queuing is working out. This is finally tuned for this application. But now, if we start to kind of tweak those numbers, and we say, let's just make everything one. And so, now, we're going to start, you can see we immediately drop, and that's one of the cool things, this is all applied, and we can figure it on the fly. We love Kubernetes. But you can see now that my requests are being rejected, and my reason bucket here for the queueful is starting to rapidly climb. And if we pop back to our sample app, we can see that we are getting, this is a 429, or 429, some 409. The servers receive too many requests and ask us to try again later. So, this is APF working as intended. My policy is rate limiting, and I'm getting responses back on my client. Cool. And so, we could play with that and tune that, but you get the idea. So, what was the original bug? Well, off-cluster traffic was getting bucketed into global default, right? That is what happens with global default. Our controller's traffic should have been in the workload low bucket as we saw. And so, what was happening was all of our requests were kind of backing up, and all of our retries were getting stuck, and we filled up the global default bucket. So, when we went to make requests from kube control, when we went to go look at our external monitoring, everything seemed dead because the requests were just sitting in queue and getting kicked out. The fix follow is to create APF resources. Pretty straightforward. Again, not something you need to worry or know about until you do. And so, the question is, when would you actually want to create these resources? Well, definitely when you're being clever and running things off of a cluster. But a good example would be if you have a controller that constantly makes list requests to your API server, right? You might want to bucket those or queue those up so they don't starve the controller, and maybe only want one of those executing at any given time to list all of your, you know, thousands of config maps or whatever you have. So, there is a reason to know about this, to tune for things that you're seeing in your metrics. So, when you're looking through your Grafana and your logs, and you notice things might be backing up or failing, this is when you'd be able to apply a policy to kind of even that traffic out a bit. And again, you can do this by service account or user or namespace. So, a couple other questions. Why did this not work on 120, yeah, why did we not see this on 126? Well, that's because 126 introduced a new feature for APF called borrowing. Borrowing is added a few new fields. So, the borrowing limit percent and lendables percent, this allowed APF policies to borrow and share from other buckets that weren't full or whose queues were empty. You could specify a percentage on these values to tell you how much you could borrow. So, again, we were in 126, our bucket was getting full, but because it could borrow from all the other buckets, we didn't actually see this behavior. So, this is rad. It also changes the nominal concurrency shares from the ACS to the NCS. And there's a complicated formula that's explained, but the TLDR is bigger numbers mean a larger nominal concurrency limit at the expense of every other limit being prioritized down. So, last question. Why did the GKE dashboard die? Well, this was a fun one. I don't know if you can see that, but the GKE dashboard was landing requests into the global default bucket. I think this is fixed now, but yeah, that's why when we would load the GKE dashboard up, it would look like it was dead and led to our believing the cluster was dead because their requests were getting bucketed with all of the other off-cluster traffic. So, again, I'm pretty sure that they fixed this now, which is good. There was a fun side quest where our API token didn't have the email scope permission, so we're getting back these Google guy IDs. These are like internal IDs for Google. Not like a problem or anything, but when we were trying to write a policy, we ran into numbers instead of the surface count emails we were looking for. So, if you ever run into that, just always include the cloud scope in your Google IM and the email scope. Couple minutes left. The more lessons learned. Being clever means you need to do more research. This is something that we didn't know about before we ran into it. So, yep, you have to spend time to figure out and understand things. We need new people to test these features and give feedback. In writing and preparing for this talk, I was going to do this on a 128 cluster, but I actually found two new bugs with APF that we're going to get filed and fixed, which is cool. So, if you all haven't played with these knobs or any knobs in general, go out there. We as maintainers don't really have a direct line to the community other than the people who show up and report bugs. So, when you read those blog posts or those release notes and they say, hey, we need people to test these features, we really do. So, please help us out. Read the docs and release notes to understand what's changing between versions. And again, get involved upstream. There's lots of us that work on fun, cool stuff like this. So, if you want to know where to get started, feel free to hit me up or anything. This isn't solved, actually. We, there's still an open bug that we need to figure out. We were not getting back 429s like we should have. We were getting back 504 timeouts. So, I'm pretty sure that the switch to sharing the buckets for 126 is simply hiding a bug that I ran into in 125. So, hopefully no one else runs into this, but we are going to dig into it at some point. But there's still a bug out there. A couple of shout-outs to the bunch of people who helped me while solving and fixing this. Jordan, Mike, and Billy, they're all awesome. And I am out of time, actually, right on the dot. So, I'll hang around here outside to answer any questions. Please scan the QR code and give feedback. And yeah, thanks for coming and listening to me ramble. This was fun.