 All right, so let's start. So we are in the SICK API machinery talk today. And we have two speakers. It's Abu, who's here, and Mike Spreitzer, who prepared a video which Abu will play. And it's about protein fairness. And for those of you who have no idea what this means, a few years ago, when we didn't have that, one controller could basically kill the API server by sending too many requests. And basically, we depended on client-side throttling to be configured correctly. And we are in a different world today, 2024. And the work for that is basically in protein fairness. And this feature is equipped with the API server. And Abu and Mike are the ones who have basically pushed that and implemented that. So let's welcome Abu. Thank you, Stefan. Hello, everyone. Welcome to the IP machinery SICK dip dive. So Mike couldn't be here in person. We have a prerecorded video from him. So I'll start by playing his video. Spreitzer here to give you a brief overview of the API priority and fairness feature in Kubernetes. This is a feature in the API server and in the generic API server library that we use to build similar API servers. This feature regulates the load of the API server in terms of the number of requests that is active as serving at a given moment. The purpose of this feature is to protect the API server from the clients and to protect the clients from each other. This feature is based on attributes of the requests that participate in authentication and access control, so that it cannot be easily fooled. This feature, thus, is taking into account both the request rate and the time it takes to serve each request, because the product of these two is, on average, the number of concurrent requests. This is one-dimensional regulation. This is an approximate technique. It's a good approximation, and there are a few tweaks to make it better, which we'll see later. This feature replaces the maximum flight filter, which is an earlier filter, that is simpler. It reclassifies each request into one of just two categories, mutating or read-only, and treats all requests of a given category at the same. APF is more granular. It's also configurable, and it introduces queuing, and APF introduces some fairness between clients. Like the maximum flight filter, APF rejects requests using, in the standard HTTP way, which is the status code 429. Let's look at where the APF feature fits into the API server. Steph Eschermanski gave a good talk a few years ago at KubeCon about the overall structure of the Kube API server. Here, we're just gonna think about the chain of handlers, so-called filters, that do some general purpose processing of each request on its way into the server. The APF filter starts by doing classification. For each request, this determines the priority level that the request belongs to, and puts the request into one of the queues at that priority level. Each priority level has a certain number of queues that it is configured with, and has a dispatcher whose job is to take requests from the right queue at the right time, and send them on for further processing. In a way, it keeps the server as busy as requested, and allowed, but no more so, and do so in some fairness. Let's look at request classification. This starts with configured objects called flow schemas, that say how requests are to be classified. Each flow schema has a name, having numeric matching precedence, which is used to put the flow schemas in an ordered list. Classifying request consists of, starts with comparing each request to the flow schemas in order, to find the first one that matches. Each flow schema is configured with some matching rules that say which request matches. The result of this is two things. One is it says which priority level the flow that request belongs to, and the other is it classifies the request into a flow. In other words, for each priority level, the requests are classified into flows. A flow is identified by a pair of strings. The first string is the name of the flow schema that the request matched. The second string is extracted from the request, according to a rule, which is a choice that the flow schema is configured to make. A flow schema can make one or three choices. One is that the second string in the flow that a fire can be the name space of the object that the request will act on. Another choice is that the second string will be the name of the user in the issue request. The third choice is that the second string will be the empty string. The flow that a fire is input to shuffle string, which he used to put the request into one of the queues in the priority level. A priority level is configured with a fixed number of queues and a so-called hand size, which is the size of the subset of the queues that will be considered for a given request. Shuffle sharding will use the flow identifier as a source of random event tree to make a pseudo random choice to pick that subset of queues. The request then gets put in that subset that has the shortest length. So you see there is a distinction here between the number of queues and the number of flows. The number of flows is not so well controlled. It is dynamic and it can be quite large, impractically large to service the number of queues. Shuffle sharding is a technique for mapping a large number of flows onto a small number of queues in such a way that any one or few big flows do not crowd out the other flows. Now let's look at dispatching. Each priority level has a current concurrency limit. That is roughly the maximum number of requests that the server can actually be working on at a given time of that priority level. These are based on nominal concurrency limits. The nominal concurrency limits are derived from server capacity and configured concurrency shares. Each priority level is configured with a number of nominal concurrency shares. The server's capacity is divided amongst the priority levels in proportion to their nominal concurrency shares. These nominal concurrency limits are a baseline that is then tweaked periodically by borrowing to produce the current concurrency limit. The purpose of borrowing is to allow a relatively lightly loaded priority level, a level that is lightly loaded at the moment, to be able to lend some of its concurrency to other priority levels that are heavily loaded at the moment. Each priority level is configured with a number of queues and there is a dispatching algorithm that is inspired by the fair queuing technique from networking. We had to adapt that a bit to our use here. The details are in the cap. And this algorithm then is the thing that takes requests from the queues and chooses which queues to take requests from and send it on to further processing. A priority level can be configured to not queue and instead just reject excess requests like the Max Inflight filter did. Also, there are limits on queuing. Each priority level is configured with a limit on the length of its queues and there's also a limit on the time that a request can spend waiting in a queue. So a request can be rejected due to either of those limits as well. Finally, there is one priority level that is exempt from regulation. As mentioned earlier, there are some additional considerations. One is that APF will consider some requests to occupy more than one seat. That is to say, be relatively expensive or heavyweight. The leading example of this is a list request that returns a large number of objects. Such a request compared to others that take the same amount of time to execute is exceptionally expensive to handle. And so APF will consider such a request to be worth more than one request. The next consideration is watch requests. Whereas the Max Inflight filter simply did not regulate them, APF does regulate them. A watch request is a request to do one or two things. The main focus of the watch request is to keep the client appraised of changes to a collection of objects on an ongoing basis over some period of time. Some watch requests additionally will start by informing the client of all the pre-existing objects. This first phase, which is or is not there depending on details of the request, is much like the ordinary short-lived transaction requests and APF will manage the first phase as such. And once that phase is over, APF considers the request to be done, even though in fact it continues on to notify the client of ongoing changes. The costs for those notifications then, APF associates with the other requests that are actually making the changes. So for requests that writes or mutates an object, even though the reply was back to the requester in promptly without waiting, APF will consider the request to execute for a while longer to account for the cost of sending those watch notifications out to the watching clients. Another additional consideration is for the exempt priority level that has a nominal concurrency limit and participates in borrowing. This can be used to somewhat even out its effects on the other priority levels. Finally, the last one I wanna mention is rejections. The 429 status code, one of its standard features in HTTP is a recommended time for the client to wait before trying again. For a long time, Kubernetes always set that to one second. Recently, we've made that adaptive so the clients can do a better job of backing off. So that completes the brief overview and now Abu will take over and give you some more interesting details. All right, let me just quickly skip through the slides. Mike just covered, okay. All right, so let's do a quick recap on how APF handles a request. So a new request arrives, we find a matching flow schema and we compute the flow of the request. The request is enqueued using shuffle sharding and then we ask the scheduler to dispatch. So the scheduler dispatches the request that should be executed next using the fair queuing technique. At this moment, the scheduler will also dispatch as many requests as possible and then the request basically waits in the queue for a decision, right? If the decision is accept, the request will get executed. If the queue wait time threshold exceeds and the scheduler cannot accommodate the request, right? This will be the request will be removed from the queue and the rejected, right? So that's kind of like how APF handles a request. Next slide. So APF is highly configurable via API objects It ships with a set of flow schema and priority level configuration objects. We refer to them as bootstrap configuration. The left column shows the flow schema objects available in order of their matching precedence. On the right, we have the available priority level configuration objects. A flow schema object is assigned to exactly one priority level. On the other hand, a priority level can be shared by multiple flow schemas. Okay, that's the next slide. So let's go through some of the bootstrap flow schema objects and why they're there. So exempt, this flow schema matches all requests belonging to the system master's group. That means if you're a class red mean you fall into that category and these requests are always exempt from APF regulations. Next we have system leader election. It matches the leader election requests from Coup Controller Manager and Coup Scheduler. This traffic is critical for cluster availability. Then we have endpoint controller. It matches the requests coming from the endpoint controller that manages the endpoint objects. I think this is critical to keep the service network functional. Then we have system node high. These flow schema matches the heartbeats from the nodes, I mean, Kubelet. We need this for system self maintenance. And then we have Coup Controller Manager, Coup Scheduler. These matches the leftover traffic from their respective components. We have service accounts. It matches any leftover in cluster traffic. So if you have a workload that's running as a pod, most likely the requests coming from your workload will match this flow schema. Global default matches any leftover traffic, including any unauthenticated traffic or any traffic that is external to the cluster. Right, catch all. It serves as a catch all for any unmatched traffic. Under normal conditions, no requests which match this flow schema because the flow schemas that are above like acts as a net to match like all requests. Okay, so we talked a bit about, oh, sorry. So during startup, the Kubei BI server ensures that all bootstrap configuration objects exist on the cluster, right? It also periodically scans these objects and applies any update necessary, right? That means any changes to the spec of a configuration object will be stormed by the API server. For the changes to stick, the cluster operators must set the auto update spec annotations, annotation to false. The goal is to enable the Kubei pay server to update these bootstrap objects installed by the previous releases. Also the same time, not overriding the changes made by the cluster operators. So cluster operators can add their own configurations if needed. For example, if you have a buggy workload that is known to run amok and flood the API server, you can actually define a dedicated flow schema that matches the requests coming from your workload and then assign it to a prior level with like your small concurrency sugar, right? All right, so we talked a bit about the flow schemas. Now let's switch to prior levels. So we'll start with server concurrency limit. It is the maximum number of seats the in-flight requests can occupy at any moment on the server, right? So APF uses it as a fence for protection. So it'll try to prevent the load on the API server going beyond that fence. And Mike just mentioned it's the kind of approximation, right? So server concurrency limit is calculated by summing these two server run options. If not modified by the cluster operator, the default server concurrency limit is 600 on any Kubei pay server instance. It's also worth mentioning that a request on the server can occupy one or more seats. It gives us a granular mechanism to deal with requests that have variable costs. For example, Mike mentioned getting one object versus getting a list of hundreds of objects, right? So each priority level has a property called nominal concurrency limit that is basically the number of execution seats available to it, right? So let's see how we actually distribute the server concurrency limit among these different priority levels. So each priority level API object has a field called nominal concurrency shares. It's basically prescribes a fraction of the server concurrency limit that is available to that level. We use it to compute these nominal concurrency limit of that level. And more and higher value of nominal concurrency shares basically means more nominal concurrency limit to that level and at the expense of every other priority level. The pie chart here shows the default distribution of the server concurrency limit on a vanilla cluster. Okay, so next slide. Okay, so each priority level enforces its concurrency limit independently of the other levels, which introduces this utilization challenge, right? So if we look at the figure, priority level B is underutilized. It is running well below its concurrency limit, right? On the other hand, priority level A is saturated, right? If you look at the YOLO line, it shows a moment where the number of execution requests for priority level A has reached the concurrency limit. And there are an access number of requests waiting in the queue. So what happens to those access requests for A, right? They will either wait longer in the queue before served and this will introduce additional latency in the response time, right? Or in the worst case, they will be rejected by APF, right? So how do we solve this problem? If we allow A to borrow from B, some sits, right? Then we can help resolve the situation. So APF basically allows borrowing among priority levels. There are two fields that basically prescribe how many sits a priority level can lend to or borrow from other levels. So lendable percent is the fraction of the nominal concurrency limit of a priority level that other levels can borrow from it. Boring limit percent is basically the limit on how many sits a priority level can borrow from other levels, right? This table here shows the borrowing configuration for some of the bootstrap priority levels. For example, literal action, this priority level, it doesn't lend to any other level, but it can borrow without limit. So another interesting topic of what mentioning is priority inversion, right? It is a case where in the course of serving a request, some other request gets spawned to the Kubei PAs. For example, in the figure below, we have a cluster extended by aggregation, right? So user sends a request A, Kubei PAs server proxies the request to the aggregated API server. The aggregated API server spawns a new request B in order to serve request A, right? So B is subject to APF regulations independent of A. So if B is rejected by APF as a consequence, A will fail, right? Some examples are delegated authorization where the aggregated API server is in response to serving a request. It actually issues a subject access review to the Kubei PAs server. There are other examples. Kubei PAs server issuing requests to itself over a local loopback client. I think there's also examples of external admission web book server issuing requests to the Kubei PAs server, right? Which itself is basically serving call outs from the Kubei PAs server. So how do we solve priority inversion? APF doesn't have any mechanism to detect these spawned requests. The way we're solving priority inversion is basically exempting the spawned requests. So basically you have a flow schema that matches the spawned requests and assign the flow schema to an exempt priority level. For example, for the delegated authorization, here we are matching the traffic, right? So subject access reviews are matched from the delegated API server, and we are assigning them to the exempt priority level. So let's talk about client and retry. So when APF rejects a request, it sends four to nine status code, which means too many requests. It also adds the retry after header in the response. The value of the header basically tells the client how long it should wait before the next retry, right? So if your workload is built with Clanggo, you don't have to worry about retries. Clanggo automatically retries the requests. Okay, I'll just talk about that. I'll talk about a couple of situations where APF can't help. So in most production environments, Kubei PAs server runs as a pod and Kubelet probes it periodically so that it knows when to restart it, right? If the API server is overloaded, it may reject the health probes from Kubelet further degrading the cluster, right? And we can prevent this situation by exempting the liveness probes to the Kubei PAs server. There is another situation. This is a watch storm incident. It shows, basically from a AHA cluster with three Kubei PAs server instances. The graph shows the number of watches over time. Each line represents a unique Kubei PAs server instance, and it shows the number of watch requests active on that particular server instance. And the gap on the line shows a time window where that API server instance was unavailable for the duration. If you look at the highlighted area, it shows that the blue instance died, and all the watch requests from the blue instance almost immediately reestablished on the green instance. So this sort of watch storm can overload the API server or in some instance can crash it, right? So this is one of the many cases of cluster degradation where APF can come to the protection of the API server. All right, I think. APF has a set of metrics for observability. I'll just quickly go through some of them. So this one is a rate of requests dispatched for each priority level. So we can see how many requests are serving per second. The next one is, this one shows the number of requests currently executing for each priority level. This one shows number of requests currently waiting in the queue before they're served. And this histogram shows how much time a request waits in its queue before being executed, and it shows it for each priority level. I think that's it. With this, I'll stop for questions. Oh, the default, so it's the client setting. The default value is 10, but I think an author can actually modify that settings. Are you referring to the... Oh, I see, yes. No, yes, you can actually change the... These are actually... Oh, sorry, your question was, can... Yeah, I'll go back to the slide, yeah. Oh, this one? No, this one, sorry, this one, I think. Yes, yeah. So the question was all the different configuration. You said there's default value from API missionary that's set in the core right now. You said operators can reset them or set them to different values if they want to. Is there any specific type of use case you see people having for this type of cluster you want to build, you want to set different values, or...? So these are designed in a way that is enough for, or approximately enough to prevent, to have the API server functioning even during the degradation, right? You can, a cluster admin can actually change those objects. These are API objects, basically. And a cluster operator can change its objects, but it's not recommended. I think only use cases you are experiencing and degradation in the cluster and you find out that some private level here is doesn't have enough room. Maybe you can tweak it to make room temporarily. This is maybe once in a row I can think of, but usually we recommend that the operators first identify the requests that are valuable to them, right? And then try to see if there is a flow schema that matches it. Otherwise, you can create your own flow scheme object and then you can assign them to the desired priority level. Yeah, that'll come with the follow-up. I think in the graphs you showed, one of the example was like the special type, like OpenShift1, that was a custom booster configuration probably made as well. Like in the graphs that you had shown, I think one of the custom type was like, that was gonna be the follow-up by the, I think you already answered that. I see, okay. That you can create custom configuration. Okay, cool. Thank you. Is there any kind of guidance on how much to tweak the amount of seats available? There was 600 as the default from the maximum flight, but as a function of course, RAM, you allocate? Yeah, that's a very good question. I think we're still trying to find that answer because this is all like a very rough approximation, right? So when we say that APF uses this as a fence for protection, but it's definitely not an accurate fence, right? So 600 means you can use the number of cores to map to this value, and you can run experiments and see like how well it performs. It also depends on the workload you have, right? Some like not all workloads will act like similar on the APS over in the side, right? So because requests have different costs. Yeah, thanks. So request traditionally had a timeout of 30 seconds. What can I expect, Jela? How long does my request stay in those queues? Okay, so we made a recent change. The queue await time is depends on the request timeout. So how do you specify time on a request? A user can specify in the request parameter like timeout equal to 10 second, right? I think we limit the request timeout on the server side to be at most 60 seconds for the regular requests. And the queue await timeout, like APF queue await timeout, is basically one fourth of the request timeout. Previously it was like hardcoded 15 seconds. Now it's like based on the actual request timeout. If the user doesn't specify a timeout, on the server side we default it to 60 seconds. So it'll be 15 seconds for those requests. Hey, thanks for the talk. I just have a quick series of questions. So the bootstrap config, I assume you can't delete because you were saying if you make edits it reconciles once you override. Yes, if you delete the controller will recreate it. Okay, great. Second question, did you drop then in client go the request per second client side throttling? We have not yet, but we did an experiment in Kubernetes CI, in upstream CI, by disabling the client throttling for everything in KK. And we have some results to like share for that. But the result looked good, but it's not in production. I think in production client go still enforces the client rate limiter. Do you know when you'll drop? We will want to see how it performs in production. Usually the production environments are not up to date with the latest releases. So I think there'll be some soak time and then we'll probably have to figure out from there. So you're saying there's no upstream tests for that production environment. We had a test, I think it ran on the 5000 node test. Okay, awesome. For you said client go does retries when does the 429 conflict or the rate limiting does that stall? Like, does client go wait and retry their quest? So essentially if you're a controller author does that mean you'd be stalling reconciliation when that happens? Yes, so client go if it sees a 429 or any five xx and it also sees a retry after header and the header has a value like today's integer. So it'll sleep for that number of seconds before the next retry. Is it possible to configure client go to pass that through to the controller author so they could essentially re-queue? The, yes, I think an author can like set the retry, the retry count, but the actual weight is basically instructed by the server, right? So server sends a header. It says you should wait like three seconds and the client will wait three seconds before retrying. Yeah, the reason I'm asking is because like for some controllers you have like two concurrent reconciliation at a time as an example. So if you have two of those hitting a potential retry because you're hitting some resource endpoint, there could be other work to be done while it's waiting for those retries. That's why I was wondering if your thoughts are on that. If I remember correctly, I think the request that the client object that sends a request, it's I think the same thread, it's same go routine as the, as the application. So I guess there will be this, I don't think there is any way for you to avoid that latency. That block, you're saying, okay. I think that's all I had, thanks. But if you have an application that is built to be concurrent like independently, that's something I think the author has to consider, I guess. Yeah, that was a good point. I guess adding to that just would maybe make sense to have an error type with timeout for those. If you set retry to one, then it would give you an error instead, saying that while you should, the controller should wait or requeue after a minute or something. If that's the thing, so that is something we probably could add to the client go. I think in client go. Just inform the controller that. I see, okay. But in client go, I think if you specify the timeout, right? The server enforces the timeout on the server side, but the client also will wait at most that many seconds. So let's see if you say timeout equal to 30 second. If the server takes longer, the client will timeout after that 30 second. Even if you have retry count available to you for retries, I think it'll be like context timing out basically after 30 seconds. I don't have a question. I just wanted to say thanks for working on something really, really hard and subtle. Thank you so much. Hey, it's me again. I have one more. If I do something similar to the endpoint controller and do endpoint management, should I be reusing that priority level? That priority level? Or do you recommend I create my own? That's a very good question. If you have your own flow schema and you want it, you want to assign it to any of like this very critical priority level, you run a risk because those priority levels will be shared to some extent, right? And if you work on this, 10 RAM and Mac and flood API server, it could impact those more important requests, right? I would say like, if you're not sure, I think the best bet is to create a new priority level and assign your flow schema to that level and start with that configuration. All right, we're done. Thank you so much for coming.