 Hello, everybody, and welcome to our 2021 CIG API Machinery Maintainer's Track. My name is Federico Bonjavani, and I'm the co-chair of CIG API Machinery in Kubernetes. We named our talk today, Applying What We Have Learned. Because on one hand, we are going to talk about servers I applied, a feature that has just graduated to GA in 122. And on the other hand, we are also going to share some learning, some practical experiences on priority and fairness. So this is what we have for you today. First, now that server I apply is GA, Joe is going to talk us through how you should be using this feature when you are dealing with controllers. Second, Abu is going to share his deep knowledge and some practical experiences using and applying the feature of priority and fairness. Finally, I will use two minutes at the end to give you some practical information about the C and how you can contribute and get more involved. So with that introduction done, I will leave you with the speakers. I hope you learn and enjoy. Hi, my name is Joe Betz. I'm an engineer at Google and a contributor to CIG API Machinery. Today, I'm going to talk about some of the new server side applied functionality that we've added for controllers. This is part of the Kubernetes 1.20 and newer releases. To give a little bit of context, let's talk about how you can access apply both as a human and as a controller. In the original client-side implementation of apply, you could access it through kubectl apply and you would provide a file that included the subset of fields of the object that you cared about. Whenever you changed any of those values and called apply again, it would tell the server to change just that particular value. There was no support for controllers for this. They're no reasonable way to use it. First of all, all of the apply logic was bundled directly into kubectl. The only way to even theoretically access it would have been to shell out to kubectl, but kubectl is a stateful thing and controller sometimes needs to be HA. It just really wasn't a good fit. For server-side apply, it's really easy to use. The user experience is almost the same as client-side except for you just add the dash dash server-side flag and you get all the benefits. Now, all the merging happens on the server, all of the fields are tracked on the server, so ownership is very clear. There's a bunch of conflict detection and resolution that goes on on the server. It's a much better implementation of apply. I would recommend everybody switch to it. And nice conveniently, it's accessible to controllers. You don't need kubectl to access it. You can access it through the client-go API. And today I'm gonna focus on a bunch of things that we've done to make this really convenient to use from controllers. To help motivate the example, let's look at what was available in the client in 1.20 and earlier, and then we'll show how we improved on this. So in Kubernetes 1.20, this is what the API looks like. You access server-side apply through the patch operation by setting the patch type appropriately and then you can make your call. So let's try and actually use that and go through an example case. And I'm gonna pause here and give a warning. I like the Doctor Strange idea of you should put the warnings before the spells. So I'm gonna do that. Do not do this. If you do what I show in this example here, the ghost trucks are going to set fields that you didn't expect and have all kinds of problems. All right, with that warning out of the way, what we're gonna try and do is we're gonna try and programmatically create an apply request. So the obvious way to do that, when we do create an update, the first thing we do is we use the ghost trucks to construct the request that we wanna send to the server. So we can do that for apply. We're gonna create a request that just sets the min replicas value to zero. All the other fields here are just the coordinates of the object that we're changing. Next, we need to call patch with the typeset to apply. And that takes the data as a byte array of YAML. So we have to do some kind of conversion. I'll do that here. And that's it. We've constructed and sent an apply request to the server. On the left here is what I intended my client to send to the server. And the right is what actually got sent to the server. And you're gonna notice a couple of differences. In particular, there is this max replicas equals zero that got sent to the server. This is pretty bad. So what this tells the server to do is whatever service this auto-scaler controls, so it's a downscaled to zero pods. If this is a production service running hundreds of pods across the cluster, we just downscaled that to zero. Pages were definitely going off. This could be a major outage. And this is particularly worrying because there was nothing in our request that said anything about this field. There's nothing here about max replicas. So why did this get included in our request? Well, the answer lies in the way that the ghost trucks work. So here is the horizontal pod auto-scaler spec, the max replicas field. You'll notice that it's an int32. It's not a pointer. There's no omit empty JSON tag set on it. And so the way that this gets serialized is that if you don't actually set max replicas to zero or to any value, it's gonna get zero valued automatically by go and that's going to become the value that gets sent over the wire. And that's why you see what happened. This is pretty surprising and you should get a couple takeaways from this. The first takeaway is that the ghost trucks are absolutely not safe to be used with apply. They're perfectly fine for creating update. They were designed for those use cases, but apply has an additional constraint that you need to only send the fields that you care about. So all the defaulting and zero valuing that can happen in the ghost trucks, which is safe for creating update is not safe for apply, you cannot use them. There are many other examples throughout the code base. Some of them are more or less subtle, but the warning here is the same. There are just too many foot guns in the use of ghost trucks. You should just not do this. The other takeaway you should hopefully get is this is totally a fine way to send a YAML file that you've handwritten. You can load that file from disk and then you can send it to a server-side apply through the patch operation. That's fine. It's just the ghost trucks here that are the problem. All right, so what are we gonna do instead? We still need some kind of support for controllers and the ghost trucks in the work. So in 121, what we introduced is a bunch of packages bundled under this apply configurations package in client go. And there is a new ghost truck that we've generated, which we call an apply configuration where all the fields are optional and they're all pointers. So if you don't explicitly include a field, it's just not gonna be sent to the server at all. This just kind of solves the immediate problem that we described. We added some conveniences around these newly generated types. So each one has a constructor. So you can very quickly construct an object. It automatically says the name, the kind in the API version for you here. We also have these builder functions. Since all the fields are pointers, we provide these utility functions to set the fields and you don't have to provide the pointer value. You just provide an actual reference and this is a lot more convenient, especially for primitive types. And I'll show some examples of why that would be difficult otherwise in just a moment. And then of course, the last thing you do is you call apply. We generated a new apply function right next to create and update and delete. So you no longer have to go through patch to do server-side apply. Also, this takes the apply configuration type as the argument. So you don't have to do any serialization and it's really clear what type you're expected to construct and send in here. All right, so I mentioned that we needed these kind of builder utility functions to get around some problems with pointers. Let's break that down a little bit. So one, there's a limitation in Go where you cannot do an inline pointer to a literal. So ampersand zero is not a legal expression in Go. You can assign a variable to zero and then you can take a reference of that. But then you would have to, in this example, you'd have to put a line up above your actual like inline Go struct declaration saying every single literal that you're gonna then include in your actual struct. That's inconvenient. So there's other people have run into this with Go. This is not a problem specific to our use case. And so there's libraries that people have written to kind of get around this. Kubernetes has this pointer library. You can provide a literal and it returns a pointer of a literal. Unfortunately, this doesn't really fully solve the problem. Kubernetes has a lot of enum types and you can't pass them to pointer upstream because they're their own type. There's some type ref to string but they're actually, in this case, the protocol one is corv1.protocol, that's its type. And since Go doesn't have any generics, there's no way to write one of these pointer libraries forever. Now, if you still don't like the builder functions and you would rather you prefer the use of Go structs, that's fine, you can actually use the structs. When we implemented these, when we generated these apply configuration types, we exported all of the fields. So you can still use the fields directly if that's what you prefer. But I'll warn you, every field is optional. There are a lot of pointers to deal with and you might end up changing your mind if you go that route and actually find the builders are pretty helpful. The other thing I'll say is in the future, this could change, right? Go could add support for generics. It could fix the support for inline pointer literals, in which case this might become a lot more convenient and the builders might not be as useful. If any of that happens, since you've already exposed all the fields, you could just start using that new functionality. So that way we've kind of made our implementation future pure while it's still at the same time trying to make things as convenient as possible for people today. All right, so now that's how you use controllers in the basic case. Let's talk a little bit about migrating controllers to server side apply. So in the basic case, you've got a controller that just has a single reconciliation. That's actually really easy to migrate to apply. Usually you just find the one update call you've made and instead of calling update, you call apply and set all the same fields that you are changing when you call to update. You do it unconditionally, right? Even if the field hasn't changed in the current state of the object, you still call apply. One of the properties of apply is that if you don't include a field in your apply request, then the server interprets that as you're not caring with that field anymore. And if you're the owner of that field and you don't care about it anymore, the server believes it's safe to delete it. So in code where you're doing an update, if you have something like, if the field has changed, then set it to something, you wouldn't do that. You would always set the field to whatever value you want. That's kind of the only caveat here. Other than that, it's really straightforward. Now there are much more complicated use cases. We've seen a number of controllers that have multiple code paths where in some cases they update some set of fields and in another code path, they updated different set of fields. For that, we've added a special feature to try and make those easier to migrate. And I'm gonna walk through that. Typically what you have is a code when you're using update, is that you read an object, you modify it place, and then you update the object. If you want to convert those types of things to apply, this is a simple approach you can take. The first thing you do is you read the object just like you did before. The next thing you do is you do something called extract. There's an extract function on every single apply configuration type, and you provide in the object that you've read plus the name of the field manager that that controller identifies itself as. And what this function does is it reconstructs an apply configuration based on the field manager state from the server. So if previously you applied field A and B and then you call this extract function, it's gonna reconstruct you an apply configuration with A and B again. As long as no other field managers claim ownership of those, that's gonna be your apply configuration. So now you can go ahead and just change any fields you want and then call apply again. This works great if you have a controller that has multiple code paths where one path changes some fields and the other path changes the others. Cause in both cases, they're gonna blow it up to full apply configuration on the server before they apply their changes so no fields get dropped. The next thing you do is you set any fields that you care about just the same exact way that you would do if you're doing an update. Then the last thing you do is you call apply. You set force to true cause you're unconditionally changing the fields the values you want them to be. This is typical for controllers. The one nice thing about calling apply here is you don't have to worry about optimistic concurrency control. If you're doing an update, there's a possibility that somebody did a right between when you did a read and when you do your right in which case your rights gonna fail and you're gonna have to retry, you're gonna have to read again, try reapplying the changes you wanna make a second time and then try and write again. You could have to loop and do that as many times as you get complex. With apply, you don't have to do that. All the merging happens on the servers. You just send the request once and you're done. You're also sending the minimal request about the information you need to change whereas updates have the entire state of the object. So there were some advantages. Let's just recap. So if you're doing a controller migration to apply and the controller is really simple and straightforward, just rewrite it to use apply. You don't need to use this extract function. It's easy, it's safe, it's efficient. It's pretty low risk. If the controller has multiple code paths and some code paths change some fields and some change another, that's a good time to use this as at least it's a first step in your migration because you avoid the risk of accidental deletion in any fields. And that's a really big caveat to be aware of when you're working with controllers is that if you forget to include a field when you apply the server, probably will delete that field. If it's a required field, then you'll get something else. But that's a real danger. And so that's something that this migration path allows you to avoid. And that's it. Thank you so much for listening and I'm gonna hand it back for the rest of our presentation. Hi, everyone. My name is Abu Kashim. I'm an engineer at Red Hat and a contributor to CKPM machinery. Today, I want to share with you some of the practical experiences with priority and fairness. First, let's do a refresher on priority and fairness and how it is typically configured in a cluster. And then I will share with you some of the practical experiences. Priority and fairness is a new self-protection mechanism for the Kubei K server. Self-protection implies that we need to prioritize cluster critical requests for self-painting. We also need to prevent a flood of inbound requests from overloading and potentially crashing the API server. Priority and fairness is available since 118 and it has replaced the max in-flight filter. The max in-flight filter limits the total number of executing requests at any time. The Kubei API server provides two command line options through which a cluster operator can configure the concurrency limits of the API server. Max requests in-flight, this is the maximum number of read-only requests in-flight allowed at any given time. And then we have max mutating requests in-flight. It is the maximum number of mutating requests in-flight allowed at any given time. When any of these two limits is exceeded, the server will reject the request instructing the client to retry. If we sum up these two max in-flight limits, we get the overall server concurrency limit. Priority and fairness improves upon max in-flight filter by using the rule-based FlowSkin API. We can classify incoming requests into a matching priority level. Each priority level has its own concurrency limit and key configuration. This is the flow in actual. So a request arrives at the API server. API iterates through the FlowSkin objects and selects the one that matches the request. The FlowSkin object has a link to the associated priority level. Shuffle sharding technique is used to pick one of the cues from the matching priority level. The request is then put to the designated queue and the scheduler pops the request from the queue, dispatches it by using fair queuing algorithm. The pf takes the server concurrency limit and divides it among the priority levels appropriately. Now let's take a quick look at the API resources for APF. Each FlowSkimmer maps to exactly one priority level. Multiple FlowSkin objects can be associated with one priority level. Rules inside a FlowSkimmer object dictate whether an incoming request matches the FlowSkimmer. The cluster operator can allocate a concurrency limit for a priority level and also specify queueing configuration. A vanilla queries cluster ships with a set of bootstrap API configuration objects. This chart shows what percentage of the server concurrency limit is allocated to each priority level. For example, we reserve about 12% of the server concurrency limit for Kubelet traffic via the system priority level. Let's go through some of the experiences that we had. We read into an incident where the Kube API server was overloaded but somewhat functioning. If you probed the health Z endpoint from inside the API server, it would have returned an okay response. But Kubelet health check was on the API server was failing with four to nine, too many requests. Consequently, Kubelet killed and restarted the API server instance. This is not good. The cluster is degraded as it is and Kubelet killing the API server is worse. We know that Kubelet liveness checks are anonymous and further investigation revealed that all anonymous requests are assigned to the global default priority level that has a lower concurrency limit. Due to degradation, the global default priority level was already saturated and consequently Kubelet health check was being throttled by APF and resulted in four to nine. The resolution was simple. Kubelet should not kill an overloaded API server. So we added a new APF rule that exempts all liveness probes. For example, live Z, ready Z and health Z. The next one is priority inversion. Many operators extend their cluster with aggregated API server and admission web books. In this slide, I'm going to walk through an example of priority inversion where an aggregated API server is involved. So user sends a request. It's labeled as A in this diagram. The aggregation layer inside the API server forwards the request to the corresponding aggregated API server. In order to serve A, the aggregated API server spawns a new request labeled as B. Now, when B arrives at the API server, APF is not aware that B is in the same request chain as A. If B has lower priority, it means A has a higher chance of being rejected. Ideally, B should have higher priority than A since A is already executing. For now, we work around priority inversion by first identifying these spawned requests originating from the aggregated API server and we then create APF rule to mark them as exempt. For example, if you have delegated authorization, the aggregated API server will send a subject access review to the Kube API server in order to determine whether A should be allowed or denied. So to avoid priority inversion, we create a rule to exempt all success access review requests originating from the aggregated API server. This one is we allow the cluster operator to set the server concurrence limit via the max and five settings. What if it is set at a much higher threshold than the actual server capacity? For example, the number of CPU cores allocated to the API server may not be enough to sustain the load at the concurrence limit. Let's look at the graph. The x-axis represents requests in flight or concurrency. A higher value indicates a higher load on the API server. The y-axis represents CPU usage. If the CPU usage goes beyond the yellow line, we will start experiencing cluster degradation. The blue line depicts a relationship of how CPU usage relates to increasing load where the blue line meets the yellow is the saturation point. Ideally, the server concurrence limit should be below the saturation point. Since the actual server concurrence limit set by the cluster operator is well above the saturation point, degradation starts well before APF has an opportunity to protect the API server. Ideally, we need to tune server concurrency limit based on the number of CPU cores available. We can find a good heuristics of how many concurrent requests should be allocated to one CPU core. That makes it easier to calculate an effective server concurrency limit. The graph you see in this slide is an actual Prometheus query screenshot taken from an HA cluster with three KubeK server instances. It shows the number of registered watches over time per API server instance. Each API server is represented by a color label. The gap on the line indicates a downtime for an API server. The cluster was degraded for some reason. Let's focus on the highlighted area. The blue instance had about 2000 watches established. It dies due to some underlying condition. The watches on the blue instance re-established on the green instance almost immediately. A watch storm like this can overload or even crash an API server. Watch requests was originally outside the scope of APF, but since 122, APF accounts for the initialization part of watch requests. Okay, here on the left, you have an overloaded priority level. Requests are being rejected since the concurrency limit has been exceeded. On the right, you have an underutilized priority level. The number of currently executing requests is far below the limit. Even though the server has more capacity to spare, the overutilized priority level cannot borrow from the one that is underutilized. This is because each priority level manages its concurrency pool independently of the other and there is no concept of borrowing yet. This slide is about cost of request. Today, priority and fairness assigns the same cost to all requests. Even though cost of request varies, this is not fair. Starting with 123, APF will estimate cost of list and mutating requests like create, update, patch and delete. Today, irrespective of its cost, every request occupies one set or concurrency from the priority level while it is executing. So a list with 1,000 items is treated the same as a simple get. Starting with 123, APF will estimate that a certain list request might occupy more than one set or concurrency from the priority level since it will draw more power from the server. This estimation process is back of the envelope type in the initial release, but we plan to improve it in future iterations. This one is about retry. When APF rejects a request, it sends a 429 status code. It means too many requests and the client should retry later. APF also adds a retry after header in the response as an indication that the client should retry after n seconds. Client Go is built to automatically retry a rejected request, but we found out that watch and stream did not have the retry mechanism enabled. We added retry for watch and stream starting with Client Go version 0.22. That's it. Thank you for listening in. Hello again. I hope you have enjoyed Joe and Abou's talks as much as I did. I truly always learn a lot when I listen to my colleagues and it's one of the things I like the most about this community and this SIG. Thank you both, Joe and Abou for presenting today. Now to close, let's get some quick info about the SIG. We meet every two weeks on Wednesdays. The agenda is public and the topics are usually very interesting. On top of that, we meet twice a week for regular bug and peer triages. There is no need to be an expert to join either of the meetings. Everybody's welcome. The triage meetings are a great concrete way to get familiar with what's going on in the day-by-day of the SIG. The SIG meetings are recorded and they are available in our YouTube playlist. To get all these invites, you have to join the mailing list and then you will get them automatically. API Machinery also owns the API Expressions Working Group and the Q-Builder Working Group. These two have separate meeting schedules and they have dedicated Slack channels too. The leadership of the SIG is David Itz and Stefan Schimanski from Red Hat and Daniel Smith and myself from Google. Finally, I will leave you a list of useful links so you don't have to be searching in case you need them. I hope you have enjoyed our presentation today. I want to thank everybody for your time and for coming and showing your interest. Thank you so much.