 So, hello, everyone. Welcome to my talk, where I'll be covering some recent developments in Kubernetes and within GKE for overload protection and prevention of various Denali service scenarios that we've been working on. So, firstly, just a little bit about myself. Obviously, I work on GKE, and I joined Google about two years ago. Prior to working at Google, I've been working on Kubernetes for about five to six years at this point. So, over the years, I've worked on many areas of the code base, but more recently, I've been working on some areas in Kube API servers, specifically around API priority and fairness, and kind of trying to operationalize this subsystem within GKE. And so, that'll be the focus for most of the talk. But before we go there, it's worth just kind of mentioning kind of the different types of Denali service that a Kubernetes cluster can experience, and I'm sure there's been like tons of talks about how you can break your cluster and whatnot. And so, many times when we think about like Denali service attacks, we think about bots flooding your network with millions of packets per second, or we think about security vulnerabilities being exploited at scale. And don't get me wrong, these types of attacks do happen, but luckily for myself and the other engineers that work on GKE, we leverage many of Google's common infrastructure and services. And so, out of the box, we get pretty good protection for those types of attacks at a really large scale. And yeah, it turns out Google infrastructure is fairly good at this kind of thing. So, many of these Denali service attacks don't end up really being my problem, right? They're kind of like handled at a more like common infrastructure layer, and there's really smart teams of engineers that kind of work on these kind of things for our networks and our servers at Google. And yeah, there's like tons of recent blog posts around these kind of topics, the most recent one around the HTTP2 rapid reset. That was pretty cool blog post, which I recommend. You check it out if you haven't. So, from GKE's perspective, the types of Denali service that we need to mainly be concerned with is Denali service at the application layer, or in this case, the application being like Kubernetes, or specifically the Kube API server. So, we think about like the possible types of Denali service that like Kube API server can experience. We can very broadly break it down into kind of two categories. The first is, you know, a bad actor kind of gets access to Kube API server and it starts kind of exploiting some like known bugs or performance characteristics about the system and then tries to take down the API server. But you can argue this is like very similar to basically like security exploit, right? Like if such a bug was known, we would probably treat it like a CVE and like go through the standard, you know, process to like quickly roll out CVE fixes. So, I'm not going to really touch on that topic all too much. And the second is a user that is like unknowingly causing a Denali service scenario in their cluster. And for this talk, I'll be kind of mostly talking about those kind of scenarios. And if you think about it, it's actually not that surprising. Like especially if you have like really, really large GKE clusters that are shared across many engineering teams, it's not uncommon to see these types of scenarios kind of happening. So, yeah, let's go through some one example. That's not that's not uncommon for GKE. So, let's say you have like, you have a 5000 node cluster in your organization. So, it's shared across like many, many engineering teams in your big organization. And you know, one team, maybe it's the monitoring team or some other team, they need to, they want to like deploy this new Damon set. And so for starters, like already this Damon set is going to run on every single node in your 5000 node cluster. So, you're basically running 5000 replicas of this thing. And yeah, let's say the purpose of this Damon set is for monitoring. So, the first thing it does is it needs to find, for every single replica, it needs to find every pod that is running locally on the same node. And then it uses the pod's IP and the status to scrape the metrics and point. And then it pushes it into like a centralized place so you can, you know, use for your dashboards and whatever. And then let's say this Damon set doesn't follow like all the best practices about Kubernetes scalability. So, it doesn't use a list watch pattern to fetch objects. It doesn't use informers. And instead what it does, it like, it has this like for loop where it will just like periodically list the objects that it cares about. And then even worse, let's say, you know, it doesn't specify resource version. So, basically API server, every time it sees this request, it needs to get all the objects from at CD to get the latest version. And then as the kind of the final nail in the coffin, it'll use like a field selector to filter pods based on pods that are on the same node, which actually requires API server to fetch all pods from at CD because it actually doesn't support filtering like at CD objects based on specific fields. So yeah, you can imagine like this type of scenario can like really quickly snowball until like a full outage, right? Especially if you have like 5,000 of these clients like kind of just being deployed at once. And like, I make this kind of sound like a hypothetical scenario, but it actually happens quite frequently. So like even for kind of like well-known third party packages or third party add-ons provided by vendors, like it's not uncommon to kind of see these types of these types of clients. And so the goal for Kubernetes is to make it like we want to make Kubernetes resilient enough so that in these types of scenarios, it can kind of automatically protect itself and preserve like specific capacity to handle or to be able to run like the core system components. So if we kind of take a step back and look at the history of Kubernetes, the very first mechanism that we introduced was basically two flags in the API server, the max requests in flight and the max mutating requests in flight. And these two flags basically provided limits for the maximum amount of read and write requests that you can process at any given time. And in GKE, the first thing we had to do was we had to basically like operationalize these flags in our fleet. And the tricky thing is like you can't just like pick one value and just have it work for the whole fleet. We have to basically tune this value so that it's appropriate based on the capacity that a cluster has. If you pick a value too low, you risk kind of preemptively throttling your clients excessively and you end up with like unused capacity in your cluster. And if you set the value too high, you risk basically like overload like diluting the protection and then completely overloading your control plan. So what we started with initially was very supplementation. So these are kind of just like made up numbers and the exact numbers aren't really that important, but we basically use like a linear function where for every CPU core we give for the control plane, we would add 10 maximum flight requests for each core. And this approach actually got us pretty far. Like it actually worked fairly well, but then it kind of started to have some limitations. And so the obvious big limitation is that there's no concept of prioritization. So you can still have like complete overload situations just by like a handful of clients or in extreme cases like even one client can basically use up all your max and flight requests and basically make your brick your cluster and take it down. And so this was obviously a problem. We want like the overall availability of system to be a little bit more robust. And so this is where APF comes into play. But yeah, really quickly before I go into details, I do want to call out some of the folks who've been driving this effort for multiple releases. I think in 129, which is the upcoming release, we're actually planning to promote the API priority and fairness feature to GA. So that'll be a huge milestone for the team. These folks have been working on it for I think almost like 12 releases now. So a huge milestone. And my main contribution in this area is mostly figuring out how to operationalize this in GKE and kind of just like throwing some bugs over the wall to them. So they did the hard work really. So yeah, like in the very initial implementations of APF, we basically had this system to define priority levels. And then within each priority level, API servers like managing queues to basically determine like how much available capacity is left and whether it should either accept or drop new requests. So now we have some guarantee in high load scenarios. Basically like when API server gets overloaded, we can be confident that some requests are reserved for the queue, but some requests are reserved for the scheduler and whatnot. And basically how we achieve this was we introduced a new API group, flowcontrol.cubinaries.io, which basically introduced two APIs, the priority level configuration and the flow schema. So really quick recap for those who aren't familiar. Priority level configuration defines limits on the number of outstanding requests and limitations on the number of queued requests. And the flow schema is basically how you classify requests into those priority levels. So with the introduction of APF, the kind of the first immediate problem that we needed to address was that each request was basically treated as one seat inside the APF queue. And this was the case when the feature graduated to beta in 1.20. But the problem is that in reality, the amount of actual work that API server needs to do to process these requests actually varies significantly. So it was important for us to have some kind of way to associate like a more accurate cost to these requests and then apply it to the amount of seats that we're taking up in these queues. So over many releases, we introduced some heuristics to try to estimate the amount of work involved in a request. So this was actually pretty like it was not that hard to implement because we actually had very, we had the basic parameters all available in the request metadata. The problem is really figuring out like what are the actual right numbers that can achieve this. So for single object get, those still cost one seat, which makes sense because in the worst case scenario, we fetch the single object from LCD and then we have to serialize the object and then return it to client and all that. If we look at a list request, this could potentially list all objects of a resource in the cluster. So the amount of the cost has to be some function of the amount of objects that we're going to potentially list. In addition, we actually have to double the cost if we list from storage instead of cash. And we can actually determine this based on the resource version of the request. So the formula for estimating list request is n divided by 100 times two if resource version or times two if it's basically resource version is not zero. So it was listed from storage and where n is the number of objects that's in storage. So watch is similar but slightly different. If watch request specifies send initial events, then we treat it like a list from cash because that's effectively what it's doing. Otherwise, we treat it as just costing one seat. Utating requests, which includes create, delete, update, sorry, create, update and delete, is actually the most unintuitive one because the cost to process a single mutating request is actually one. But we have to factor in the potential watch events that will be propagated based on that one mutating request. So the more like watches you have for that object, then there is a kind of when you mutate the object, you're going to send the watch events to all the all the clients that are watching for the object. And so you have to figure out like how to actually put a cost to that propagated watch events. So the formula we use for mutating events was w divided by 25, where w is the number of watchers for that object. So how does the API server actually know the values of N and W? It's nothing really fancy. We basically run like a small controller inside the API server that is periodically checking for a number of objects for resource in that CD and also keeping track of number of watchers, watching for specific types of resources and namespaces. Okay, so in practice, these heuristics, they're not perfect. But they're kind of as they're pretty significant improvement from just giving like every single request the cost. And they do have some some some known limitations. So starting with list requests. One limitation is that we don't actually factor in namespaces. So the object tracker basically tracks total object count on a per resource basis. And basically, like, you know, if you have like 1000 config maps, and they're all in, you know, like 900 are in one namespace and other 100 are divided across other namespaces, if you list config maps in the namespace with like five objects, we're always going to treat the amount of config maps as basically 1000. So like, obviously, this is can be problematic. But in practice is actually not a big deal because listing five config maps, even if we associate a high cost with it, because we process the request so quickly, it usually ends up kind of being negligible. But there are some cases where it has some problematic. Another limitation is that we don't factor in the size of the request, because obviously the size determines like how much work is involved in serialization. So that's something we currently don't factor in at all. For mutating requests, the watch tracker, I guess the biggest limitation is that it's only tracking watchers that are local to the same API server. And it's not actually like communicating with the other API servers on the actual total amount of watches. So it's actually not a complete accurate measurement of the actual like watch events that you propagate for for mutating requests. And then the other limitation is that not all watch events are actually equal in work. So we don't we don't actually account for that. Like some watch events might trigger like more logical changes in your system. Whereas, you know, some other watch events might just be like, you know, making a like just like setting a log event or something like that. So we don't actually care about the actual work by watch events, we just treat all watch events kind of kind of the same. So yeah, let's go through some examples. Let's say we have this request for config maps in the default same space. So let's say cluster has 600 config maps, but the request specifies resource version zero, which means this from cash. So the cost is going to be so earlier I mentioned, we basically limit the amount of seats to 10. So the cost is going to be the lower of either 10 or n over 100. So the cost in this case is going to be six. If we take the same example, and then we remove the resource version zero. Now this becomes a list from storage. And so n over 100 times two is 12. But because we have the upper limit of 10, the cost is going to be 10. So one problem we actually ran into in GKE so at some point we like upgraded to a version where list requests don't cost one anymore, but now they cost up to 10. We can have scenarios where the cost of a request is actually more than the actual available capacity. So earlier we talked about how priority levels, they're given like some share of the total max requests in flight based on the nominal concurrency shares. But if you configure like a really low max in flight requests, you can end up with priority levels that have less intense seats, but but you can have a request come in that costs 10 seats. And so you can have basically one request that effectively starves out the whole priority level and basically locks out the other clients from from at least until like the request completes. And so more recently, we kind of tune this formula a little bit further such that we take the smaller of either the estimated cost or 15% of the total available seats. But we still apply this upper limit of 10 because we don't want like the the seats to kind of like you can have clusters that have like tens of thousands of objects and we don't want any single request to basically cost more than 10. And yeah, this just makes it more likely for a handful of clients or a handful of requests to basically like completely starve out a priority level. So yeah, this this tuning helped quite a bit. But if we go back to the way we configure like the max request in flight flag. And again, these are just made up numbers to illustrate a point. We started to notice this pattern where like control planes with really small max requests in flight, we're still pretty prone to premature rate limiting. So like they would start rate limiting clients thinking the cluster is overloaded. But the cluster actually has like a lot of capacity left. And then for like really large control planes like the control planes that are managing like your 5000 to 15,000 node clusters. They had too much max requests in flight. So they weren't actually providing like sufficient overload protection. And we had we could have scenarios where cluster gets overloaded by some clients that aren't actually that important. And so we started to think about different ways we can calculate the most appropriate values for max requests in flight. And we landed on applying this like weighted functions such that the first few cores by the control plane, give more max requests in flight. And then as you like add more CPU capacity, we kind of like taper it off. And so when you get to like really large sizes, you don't actually add more max requests in flight if you give it more more capacity. So the fact that this is that we basically provide more upfront capacity when we start to control planes smaller. And this way they're less likely to like lockout clients and rate limit clients prematurely. And we also like restrict the max requests in flight for really large control planes, because you know, these are typically your like really large GK clusters. And we need to always make sure that there's available capacity to scale the nodes. And we don't want like non system clients competing with the scheduler and the controller manager and like other system clients that are trying to basically keep this like 5000 node cluster running. So yeah, this is based, this is basically to illustrate kind of what the max request in flight will look like as a cluster kind of scales to more nodes. Okay, so so far, we accomplished. We basically added like a mechanism to limit in flight requests using the max requests in flight flag. And then we introduced this API to define priorities and a way to classify traffic into these priorities. And then we introduced work estimation and tuned out a bunch so that we make the system more robust and it kind of more accurately measures the amount of work that's produced by incoming requests. So we're at a pretty good place at this point. But there's this, there's a really big underlying assumption that we've been making so far that isn't always true, which is that clients in the request are always going to the correct priority level. Right. And so this is where flow classification is relevant. So earlier we briefly covered the flow schema API, which is basically a set of rules and that maps some parameters of a request into a priority level. And if you don't have the correct flow schemas or an optimal configuration of flow schemas, your system availability and performance is, is going to suffer. And this is because we've designed these party levels around specific clients and the types of requests that we actually expect from those clients. But if these requests are not mapping correctly to those priority levels, then, you know, we're going to run into a lot of problems because we, we kind of designed the priority levels, like we basically sharded capacity to these priority levels based on what, like, you know, specific types of clients. And now, if we're not expecting that, the, the shares of the priority levels are incorrect. So Kubernetes does come with some same default flow schemas, which are designed to cover most cases. So looking at one example, the first one, all clients authenticated with a Kubernetes service account is going to use the workload low priority level. And like earlier in that like donut chart, the workload load actually has the most capacity. And this is because generally, if you're running a pod with the amount of service account, that's going to be a controller. And in large clusters, you're going to have lots of controllers running in this configuration, which is why it has the most capacity. But if you're actually running a service account in the kubeson namespace, we actually treat it more of a system controller. And so we put that into the workload high. A workload high is basically, you can think of like workload though is all controllers and workload high is all system controllers, basically. And yeah, there's a bunch of other rules similar to that I'm not going to cover. Actually, the last one is probably worth calling out. So this is the global default priority level, which is given a pretty small slice of the cake. And we reserve that for interactive clients like tube control and whatnot. And this will be important later for, for example. Yeah, so in GKE, we have, we have some goals around flow classification. So firstly, we want to improve the overall accuracy of flow classification, which means we want to configure our flow schemas in such a way that misclassifying flows is very unlikely. And then secondly, we want our default configuration of flow schemas and priority levels to work for every GKE user. And so we want, so we do allow customizing flow schemas and priority levels with some guardrails, but we want that to be a last resort. And we want it to be like a fairly, fairly rare occurrence for any GKE user to have to kind of tinker with flow schemas and priority levels. But in practice, it's actually become really hard to accomplish because of all the different way our customers set up the clusters and deploy their applications and all the different ways you can like authenticate and interact with your cluster. And so misclassification actually becomes kind of inevitable to some extent. So yeah, let's walk through some examples of misclassification that can happen on GKE. So very common example is you run some local tool on your laptop that generates a lot of traffic. These clients will almost always use the global default priority level that we kind of talked about earlier. And this is because usually when you run some controller or some tool on your laptop, it's going to authenticate, most likely it's going to authenticate in the same way that you would like with your control, right? It's going to point the keep config environment variable to your keep config. And it's going to, but the difference is that like it's going to generate a lot more traffic than you would with keep control, right? Like keep control is basically, you know, you run single commands that are, you know, you list pods or bleep pods or whatever. But if you run like a dashboard, for example, using the same keep config, it's going to basically try to like list the whole world just to show you like a nice like UI for your cluster, right? So very different traffic pattern compared to those two things. Another example is, yeah, like you run a controller on a GCVM that's like not part of the GCVM, that's also going to use global default because it's probably going to be authenticated using a Google service account and not a Kubernetes service account. So generally we want any non-system controllers to use the workload low priority level like we discussed earlier. Some third-party add-ons will run controllers in the kubes system namespace, which puts traffic in the workload high priority level. And like that's obviously bad because, you know, well, it's not always bad but it can be bad because workload high is shared with the scheduler, the controller manager and whatnot and you don't want to accidentally put a high load controller in the same priority as those system components because that can really mess up your cluster. And so like, yeah, like these are just some examples, but like you can think of more scenarios and like it quickly becomes impossible to cover all the different cases of mishossification. So instead of trying to chase all these corner cases, we need to kind of make this the system a little bit more resilient to this mishossification. And so this is where priority borrowing becomes really important. So this was introduced in 126 and as the name suggests, priority borrowing is a way for the API server to lend unused capacity from one priority level to another. So this allows for the available capacity in any priority level to be more flexible and it prevents a few clients from starving out entire priority levels. And so the the neatest thing about borrowing is that it actually allows us to better use better utilized unused capacity in the control plane. But in the event of overload, everything's kind of kind of self-correct to what whatever fixed capacity that we gave it. So optimistically it will always try to use like all the unavailable capacity in the other priority levels. And then if you kind of reach a tipping point, everything will kind of just be using the fixed priority that you've kind of assigned it, ignoring kind of borrowing configurations. And so yeah, for reference within each priority level there's two fields that you can tune. The borrowing limit percent, which is the amount you can borrow from the other priority levels. And then lendable percent is the amount other priority levels can borrow from you. Right? So we can kind of like tune these values so that, for example, like we might not, we actually, there are some priority levels where we absolutely don't want to let other priority levels borrow from it no matter what. And so these are values that we can kind of tune. And yeah, luckily in GKE, we haven't actually had to tune this at all, like the defaults have been working great. But it's probably too soon to stay at this point. We might need to revisit this. Okay, so I think we're running out of time. So really quickly, do you want to talk about webhooks? So webhooks can denial service your control plane, as many of you probably already know. So especially webhooks that use wild card matches. Right? And this is, webhooks are, I'm very conflicted about webhooks because oftentimes webhooks serve very valuable purpose, right? Whether it's policy control or, you know, side car injection, whatever. And we can, a lot of times we can't just like turn them off, right? Because, you know, obviously you're installing the webhook for a valuable reason. And also, there's kind of this like shared responsibility between GKE and our customers. Like if you deploy webhook, you're basically extending the control plane. But like we're not, we can't manage the actual like services that are back in the webhook. So what we've done is we built some systems at Google to provide some more proactive insights and recommendations to customers to let them know about webhooks that we think pose some potential risk to the control plane. So if you use GKE for a while, you might actually be familiar with this. So we actually do this for, for example, for deprecated APIs, if your cluster is using some deprecated API, and you're about to upgrade to a version that removes those deprecated APIs, we actually have like a bunch of UI that tells you like, hey, like you're using your cluster, we detected that your cluster is using a deprecated API. And we're going to block your upgrades until you kind of resolve this. And we kind of use that same system to kind of look at webhooks and say like, hey, this webhook is intercepting like leases in the cube node lease namespace. And your webhook is in the critical path of like node heartbeat, for example. And so we actually have this detection mechanism now to proactively let customers know about the webhooks. So yeah, this is this example. So yeah, if we detect unavailability, of a service back new webhook, we have this UI that pops up. So the earlier slide shows, this is the I read, I sorry, I redacted a bunch of stuff. But this is like the Kubernetes cluster list, right? And then there's this notification column that has a bunch of like warnings. So like, the API deprecations would be one example, or like the recent Docker shim deprecation would be another one. But we have a specific like warning for webhooks now. And if you click that, it'll either show one of these two kind of warnings. One is, if you reference a service for a webhook and that service was down for some amount of time, we'll actually tell you like, hey, you're, you know, you have a webhook in your cluster and the service is not available. And so we have some like docs to help you troubleshoot those kind of situations. And then, yeah, if we detect webhooks that intercept what we deem as like system critical resources, then we'll provide some recommendations to update your webhook to exclude those databases. So yeah, takeaway is that webhooks can take down your cluster and there are some best practices that you can follow. So mainly ensure the webhook has sufficient capacity and run multiple replicas if you can. Don't intercept system critical requests. So some examples are like lease traffic in cube node lease or lease traffic in cube system, right? Because those are used for leader election for system components, things like that. And you can actually use a namespace selector field in the webhook configuration to actually exclude entire namespaces. And then more recently in 1.28 we introduced match conditions. So you can use like cell expression rules to basically exclude or like you can basically use cell to basically like filter out or decide when you should invoke the webhook or not. And this is actually a feature that we've built with some folks working on EKS because we're both looking into ways to reduce the blast radius of webhooks. And yeah, that concludes my talk. Yeah, happy to take questions if there are any. Oh, thanks by the way. Sorry, I like kind of rushed the talking to them because I didn't want to run out of time for questions. Thanks for attending. Appreciate it. I'll also be like by the hallway if anyone wants to like chat about any of this stuff.