 Okay, cool. Apologies for my raspy voice. It's the Midwest Air, I guess. Coming from Toronto. Welcome to my talk. It's about adopting server-side apply. My name's Dave Pardasowski, I'm gonna jump in. I am first intro, I'm a staff engineer at VMware, potentially soon to be Broadcom, to be determined, ooh, mystery. I work on the K&E of open source project. I'm the serving lead, and I'm also on the technical oversight committee there. Some links, I still use the Twitter bluebird, because I like it. I haven't updated my Twitter app in two years, so it still works. For the agenda today, I'm gonna cover kinda like what's the problem with client-side apply, kinda go into a quick overview of server-side apply, kinda talk about key-native, and then how server-side apply impacts how I view, we can adopt it in key-native, and sort of like the status of my work there, and the learnings that I have. So client-side update is essentially kinda like what most controllers are doing. Also, when you do cut-off applies, the default application of applying resources to a server. So kinda like an example of the problem is, let's say we have a config map on the API server, and we have two people trying to update it. So the first application from, let's say that's me, we'll update it, and then subsequently another person doing an update, potentially can cause a conflict if they have like resource version set and so forth like that. So what does the other person have to do? They might need to redo the application, but if they don't properly update that, config map with what's from the server, then eventually you lose and you have some data loss. And kinda what that means is you'll get conflicts, or most controllers encounter conflicts. So if you see 409s, when you do a lot of controller updates, that means that you gotta do refetch and retry the request. If you're refetching and retrying is very dumb for whatever reason, you have potential data loss. And a lot of issues that I see, or we've encountered is you get infinite churn, where you have controllers fighting each other, one's trying to update one, the other one's trying to update the other. There's loss, so they see something's not there, they keep going forever, and then you'll see your observed generation spike up quite a bit. So surface site apply, I'm gonna steal the definition from the docs. It's actually pretty useful. So first the fields of a single object, I kinda just wanna cover this a little bit in detail. So what is an object in Kubernetes? It has a structured schema and there's endpoints. You can see the structure on the right. You have API version kind, you have metadata, and then you have spec and you have status, and you can kinda see that you have below that the different endpoints. There's also other endpoints that I haven't mentioned, like scale, sub-resource instead of status, and technically like different resources of other ones. Multiple pliers, in the example I had before, we have me and technically that's github me at the bottom, and what are they trying to do? They're trying to manage what the changes that they want on the resource. So what surface site apply lets you do is when you use this flag and you specify who you are as a field manager, it will essentially merge these in very cleanly because it kinda knows that hey, I'm only trying to apply the changes I need, and I'm not like trumping and overriding the entire resource. And this is kinda what it looks like when you get the entire resource and show the fields, and it's these managed fields which Kubernetes uses in order to manage who has ownership of which fields. So in the first part, you can see young me over here in the blue, I wanted to apply the city, and I have sort of like, let me see if this works. You can kinda see I own the city property in this config map, and likewise, you github me down here owns the conference field, so then any potential further updates to those fields won't necessarily cause a conflict. I'm not gonna go super deep in what server-side apply can do, there's already been better talks to see. You can see it on YouTube, and also the docs are very good about the semantics of how it works, the details of it, and so forth like that. I'm gonna kinda talk about how that applies to K-NATIVE, so what is K-NATIVE? K-NATIVE is an open source project. There's a whole bunch of building blocks, but you can use them all to build essentially like a serverless kind of platform. Found in 2018, and it's been incubating CNCF as of last year. I suggest you visit the website, go do the Getting Started tutorials, and yeah. So kind of the high level, what are all the components? In K-NATIVE, you have kind of serving that does the auto scaling, scaling to zero, runs your workloads, eventing, lets you bind events from sources to the sinks, so you can handle it. You also have sort of the client, it's like a CLI that lets you create the K-NATIVE services and so forth, and then you have functions that lets you turn a function into a container. All these things can be used separately, independently, and can be used together to build stuff. But I'm gonna kinda cover just the serving bit, that's kinda where I have the specialty, so I'm gonna keep going. So I'm gonna run through this really quick. So the resource model, at the top, when you have a K-N service create, we create like a service object, that actually is just a helper to then manage two separate objects, a route and a config. Config ends up producing and stamping out revisions. On the route side, you end up getting a K-Service and Ingress, and eventually this boils down to programming some networking layer. And then from the revision side, you end up getting a deployment and a whole whole bunch of resources. And those resources end up creating many more resources. Hey, why is this so complex? Let me show you. It's a pluggable auto scaling system. And this pluggable auto scaling lets you, it's what captures all the metrics, lets you scale the zero and so forth. And then the routes, the HP routes here, kinda point to this public service. In terms of actual components that are running, well, let's say we have the KCPI, we have controllers and web hooks. Eventually when you create and stamp out the workload as a pod, you have a user container, and we have a sidecar proxy that helps you get metrics and so forth like that. We wire in the Ingress to serve traffic to that workload. We have an auto scaler that scrapes those metrics. It will then scale that down to zero when it has nothing. When there's nothing, we need something to receive a request. That's what we call the activator. That sends metrics to auto scaler saying, hey, we have traffic, and then it'll scale back up to one or more. I'm gonna kinda dig into the controller a bit. So our controllers, I kinda describe me like the practices we follow. These I think I consider like best practices for Kubernetes controllers like level-based. You don't react to edges. You should be able to do resyncs of your resources and kinda take action on that. Item-codent, it should always be same input, same output, and reconstructive. So if I need to recreate the resource from scratch, I should be able to get the same result. This is where it's interesting. So the control flow we have in our reconciliation is, honestly, it's kinda straightforward in theory. Hey, you wanna list your resource and get it. If it doesn't exist, then you go create it. And then if it does exist, we wanna compare what's on the server with what we want. And then if it's different, we have to do an update. If it's the same, we don't wanna do an update. But guess what, we're not using server-side-apply. So what do we have to do? Oh yeah, and we also have multiple-appliers, which means then we have some conflicts. So the way we kinda get around it, and I kinda put like the fire and the poo because it's kinda like a necessary hack in a way where because we have multiple-appliers, we have to know what fields we're gonna be trumping and conflicting on. So what we actually end up doing is preserving fields from the actual resource to our desired spec and then doing the comparison. So I'm gonna kinda highlight the contention points that we have. So we have a webhook, and our webhook actually updates the webhook and figs with like secrets and client configs and stuff like that. But unfortunately, there's other things that mutate our webhooks, which is very confusing. And this was like an initial bug report where Kineav didn't work on AKS because the AKS control plane is actually adding and mutating the mutating webhooks, adding an extra selector to exclude some namespecies. There's an issue down there. It seems like they're gonna resolve it. Essentially, this is an example. We have a revision that owns a deployment. What does that mean? It means the component, the controller reconciles the deployment. And it kinda owns the metadata and the spec. But hey, we have an auto-scaler. It needs to be able to set the replicas. And we also have some networking controllers trying to set some labels. So the question is, hey, do we still need server-side apply if we have this fiery poo of a workaround? Do we need it? And this is, I think, my favorite issue so far is the answer is yes. So I'll explain why. So this issue is from Sasha. Sasha works at IBM. And Sasha works on Code Engine, which is the host of Kineav on IBM's cloud. And what he discovered is that even though we have these checks to do this comparison, it never returns true. So even though we had all these mitigations, we thought we're being all clever with our flaming poo workarounds, this check never worked. And the big reason why I highlighted is the defaulting. And Paul helped find that out too, thanks, Paul. So what does that mean? That means that when we create a deployment, well, guess what? Kubernetes API is going to default some properties and we don't really know. Well, you could find out what they are, but then a new version of Kubernetes comes out and then you don't know what it is. And also technically any webhook can default properties when you create it. It can set something and so forth. So I'm gonna give you an example. So when you do apply of a pod and you look at the managed fields, you can kind of see, well, I didn't really set the DNS policy. I didn't set enable service links. I didn't set restart policy. Like these are things that come from the defaulting of Kubernetes. And when you do a regular apply, which is like create or update, it's going to assume your intent is you own these fields. In contrast, if I do server side apply, you can kind of see that like, hey, actually what I applied is what I want, right? I want to have ownership of these fields. And this is sort of the comparison. The right, you have the server side apply. On the left, you have the regular update with the defaulting. And the next thing I kind of want to cover, so we kind of covered the contention points. I kind of want to cover, hey, our API, the serving API at least kind of was created in 2018 server side apply when GA in 2021. And what does that mean? Well, if you take a look at kind of like our resource graph, you can kind of see all the arrows point in the same place or sorry, it's essentially a DAG. Like you don't have arrows pointing to each other. And what does that mean? That means like one resource usually owns a different resource and you don't have any shared cooperation. And what's kind of weird too is like, hey, why does this HP route point all the way to this? Like public Kubernetes service, wouldn't it make sense that that might be some shared resource? And what also ends up happening is in order, because we have these like this hierarchy, some properties that shouldn't be in the auto scaling system are being propagated through our specs. So for example, when the revision creates like a pod auto scaler, we say what protocol that revision is running and that propagates down because we're creating the service, right? Ideally, I think what would make sense is if instead the revision could create this public service, specify the container ports and all that stuff. And then we have this other serverless service just essentially owning the selectors. You can remove the selectors or change it. And then instead you have this cooperation and then there's actually would make the pluggable auto scaler simpler to write if you wanna change it up and do something else in my mind. So implementing the server-side apply, I'm still prototyping it, even though the talk kind of was a little bit of bait and switch because I'll tell you why. My rough plan though is like I want to get replace the web books with apply, then I wanna solve the deployment, like the reconciliation of our deployment to use server-side apply, then apply the auto scaler, do performance analysis and then maybe we can create server-side apply when we create our internal CRDs and so forth like that. But from the work I've done, these are kind of like the learnings that I have. So if you haven't played with it yet, so in Client Go, you have API types. When you use server-side apply, you're actually using a distinct set of Go types. They call them apply configuration. The reason why you do that is because in the existing Go types, you can't tell if something's empty because you set it to empty. And in these apply configurations, everything is a pointer. So a string, it's unset is nil. If it's empty, then it'll be a pointer to an empty string. And then there's some tooling to help generate these apply configurations. Similar to how you have like the client-set tooling, there's tooling in code generator project called apply configuration gen. And then likewise, the client-set tooling to generate your client-sets, now it actually takes in this apply configuration package. The tooling needs a little bit of work with custom resources. So for example, I was trying to add this to some apply configs and update the gateway APIs client-set tooling. And I kind of noticed that in addition to the supply gen config, you have to run some other tooling and then Kubernetes has some special code in order to process some open API output that feeds into the supply gen configuration. So it kind of needed, so I kind of copied that tooling from Kubernetes to this gateway API, so we could do it, but that could easily be pulled in, I think, into the code generators. So that's something I've observed. So if you kind of want to see, hey, I want an apply config client and also these toolings, I would say, look at this PR, this will show you what you kind of need to do at this point in time to get it working with CRDs. The next thing I learned is unit testing. So for those that aren't aware, KNITV has its own kind of controller framework. It's sort of just a symptom of it was built before controller runtime was around. And I think in contrast to controller runtime, we actually do testing with client go and use fakes so we don't spin up an API server locally. We don't spin up that CD. It's really just more like being able to test things quickly and have that type feedback loop when we do unit testing. But unfortunately for server side apply, the there's a PR to land server side apply in the fake support that's in the client go, like the fake Kubernetes client. So there's a pull request that's been open for a while. I've been talking to Antoine, like it'd be great to get that moving along. So I'll probably help review some of that stuff and test it out with KNITV at least. But at least for now, this is I think what's blocking our adoption because I don't want to do server side apply without having any unit testing for it. That would be kind of scary. And here's another interesting thing I kind of observed too. So when you, so this managed field, when you look at it, you can kind of see this operation. It's an update. This is what happens when you, since you do your controller does an update or you do a kubectl apply as an example and switching from an update to an apply, let's say you own the same fields as well, requires you to force a conflict. So if you kind of go back in time to the slide where I linked out to the server side apply docs, when there is actually a conflict between two suppliers and they want to own the same field, you have to specify you want to force ownership of it. So that way if someone owns it, it will return a conflict. If you want it, you say force it true. I found it a little bit surprising that if you already own the field via the update operation, then you have to do an apply, but I think that's what it is. That might be something that's worth to bring up with the API machinery folks to see if, like I don't know what the intent was, but it's just something that caught me off guard. Probably the other thing to talk about too is in the blog post, which I ref tier, which is great, kind of what they're recommending as the control flow for controllers is get the resource, extract from that resource what your intention is because you're able to use the managed fields and the spec to figure out what you want, make the modifications and apply them again. But I don't think this solves Sasha's problem of the excessive updates. So this is where like I'm currently thinking like what's the best way to kind of get around this problem where I think this is like meme using now and I think what we need to do is generate this apply configuration, get the existing resource, extract what we want from it, but we also want a pruner apply config. So as an example, if the auto scaler or actually we've had in the Knative serving Slack channel, so I'm saying, hey, I want to pause the auto scaler for a little bit. I want to set the replicas to one just for temporary. And it'd be great if we could like hand off ownership using servers that apply to a user, they could do some debugging, then they can hand off that back to the auto scaler when it's ready to resume the default auto scaling stuff. And I think we need to have this kind of like ability to prune our like config and see who owns the fields do that. And then we only want to do the apply if what we, the prune config, if it is different from what it's existing on the server now. And overall, this is, I think, a talk that's to be continued. And I would also want to say huge thanks to the KTHMaintenors for landing such an amazing feature. Like it's not simple to do, it's actually seems very complex when you dig into it like the structured diffing libraries and so forth like that. There was a maintainer talk, I put it in my schedule, but it's actually already happened this morning. So thanks for those that went there. And I would also say, I encourage a lot of people to get involved. So as I've discovered where server-side apply is, there is I think a lot of opportunity for contribution and improvement. So I think Antoine who did a lot of the work, I think there and he actually had a doc of all the things that he had in mind to improve the workflow like more tooling or et cetera, et cetera. So I kind of linked that doc here. So I would suggest anyone, it's like a call to action. This feature I think is like 99% there and there's like a little extra 1% to make it very useful for a broad set of people. And that's all I have. So I have my pouty face here. So please don't be harsh on the review or you can be very harsh. Maybe I won't read them. So. And that's it. I don't know if anyone has any questions. I think my takeaway is kind of learning about Knative and how server-side apply can be applied to the problems we have. I hope you see that there's similar problems and the controllers that you're writing. And then we can rally around improving the tooling and so forth in the upstream case community. I'll say. I'm assuming you're gonna take questions. I can, yes. And wow, that's the voice of God. Okay. So it seems like there's a big dance to get this stuff right. Do you have any thoughts about how the libraries could be structured to make it easier to not wander off into the weeds? Because right now it seems like there's some choose-your-own-adventure going on. Yeah, so I've seen in the K-Slack channel some interesting discussions and I don't know, there's gonna be a talk later today from John Howard who works on the STO control plane. And it's, I feel like the future and I think John agrees with this. I think this is his idea. I'm gonna claim it as my own now but it's actually his idea. Where really your reconciliation should just be returning apply configs, right? It doesn't actually have to do any updates to the controllers and then it could be a shared framework that takes those apply configs and applies them. And I think this also leads to, yes, there is this concept of needing to force the conflict. So I feel like even the current API could benefit. It's fine to do two calls but you might wanna control that. Hey, I'm okay to own these fields or I'm okay to relinquish certain fields but certain fields I'm not. So really I feel like as part of our conversation you need two apply configurations, one where it's like strict and one where it's kind of loose. I don't know if it makes sense to collapse that into a single API call or not. I, but there's a clear workaround for that. But yeah, I agree where I think with apply config you could kind of tweak all the controllers to make it much easier to write controllers in my mind. Thanks for the question. Hi, I'm David Eads, one of the tech leads on API machinery. I was just gonna ask if you could make time to come to the meet and greet on Thursday with your wish list. Yours coincides well with mine. Okay. I would love to talk about it and I'll let the next guy go. Okay, cool, yeah, nice. Okay, hi. If I understand correctly, if a mutation web hook changes the resource and you in the matching part, you accept that change from the mutation. It gets to the managed fields part and you accept it as, okay, this was a result of my update and if you are using SSA. But still if you have lists of resources like multiple containers in the spec part and you have a desired state in first reconciliation like multiple items in the list but in an extra reconciliation like instead of two, let's say one, then you still see in the server that in the managed fields there are like two and you don't know if that two is was added there. I'm in the second by the mutation hook or for your desired state from the first reconciliation. So if it's the mutating web hook that, yeah, I guess you're saying, I guess to recap you're saying you can't distinguish between a mutating web hook and or if you were to reconcile it later on, I think what would happen is you would probably end up with trying to think, I don't know, this is one of those things where I would need to test it. So because of implementing the same thing and it's still a problem. Also, what did you find any proper documentation like in the managing fields that are these I's and V's? And I know from the developers that one of them is not to use at all. Yeah, so this is what's interesting. So that field, I guess, let me show it for everyone. I'll go quickly. Whoa, this is what happens when I don't do animations. Yeah, so, yeah, you can see the F and then and then those ones. Essentially everything under fields V1 is considered internal at this point in time. This is something I mentioned to Antoine where in order to that sort of like extract and prune, I need the parser for the resource. What does the parser do? It knows like what fields are, where in the spec and so forth like that. And it's, and I was asking why is it internal? And they just took the cautious approach of just not opening it up because they may want to change it or something like that. So I think if there's no, for example, consensus on exposing it, then we might need to generate like our own schemas or our own parsers. Like technically all the tooling's there and it's all open source. So if you need to just like change internal to like not internal is the package name, then you could do that. But for the schema, I don't know, but I think it's all the structured merge, it's all in the structured merge repo and I haven't read the repo in detail. I mean, but still you need it for pruning the object from my check. So that's why if there's no consensus on making that parser public, then we'll probably generate our own parser from the tooling. So yeah. Thank you. Yeah. I'll stick around because I wanna hear more about the, yeah, I'll put it in my notes. How does this compare to patch? Oh, so what's interesting, even though it's called server side apply, it actually uses a patch call to the server. I think it's just a special type of patch. So yeah, when you actually look at the implementation of the clients, like the, that it generates, like there's I think like structure merge patch, which will like try to merge nicely with an existing resource. This would just be called, I think, I don't know what the exact mime type is, but it is like a new mime type. And the API server knows how to handle this new mime type to do a proper, proper merge. Okay. And do you have to declare the managers for the fields to be able to use server side apply? Yes. So that's why I like, I guess I kind of glossed over it. So this field manager, so I think by default it probably uses your username when you use kubectl is my guess, I forget now, but you can override what the field manager is. So that's why I kind of just joked that who am I here as a sub shell? And that's just a string. Yeah, so yeah, exactly. So for example, a controller, what we would do is like for the web hook, I could just hard code it as web hook. And so if you look at the manager here, it's just like a random string. So in theory, you could have a bad actor controller for some reason using the same string as you. Maybe actually you might be very common because you could deploy something HA and there's multiple players if you don't do lease coordination correctly doing the same updates. But yeah, that's just some random string. So when you generate these types apply methods on your client sets, one of the options that it requires is the field manager. Okay, last question. You may have touched on this, sorry. But controller runtime has a method called create or update. It takes a mutate function. So when it's updating, it always fetches the latest copy of the object and then runs the mutate. So you're mutating just the fields that you want to mutate and it applies immediately. So the time between fetch and apply is so small that in my experience, if my controller owns those two fields, I'm always updating just those two fields and the nature of fetch immediately apply has gotten rid of most of those resource version conflicts. Yeah, exactly. So I would say it's only when you have these distinct suppliers work like that is where servers apply really makes sense. If it's just a single controller doing update, do a single resource or multiple resources, you might not need it. But that's when, for example, that AKS example can approve. You might still need it, but yeah. And that's multiple suppliers changing the same field, right? Technically, well, let me backtrack it a little bit. If you, it's not the same field because what service I apply will do will merge ownership nicely. So I would say that server-side apply would still be useful in the example that you mentioned. If you have multiple suppliers modifying different fields, I think of the contention point is not the field, but the actual resource. So to recap what Evan's saying is sort of like what I kind of highlighted here where you have anything. It's like your platform engineering team could include a webhook that does anything, like set some labels, like force some field. Because it's so open-ended, you, the like fetch, what Evan's looting is, the fetch you're describing, even though you're fetching it, if you do a, technically the create and update, since it does a patch to the existing resource should be fine. But if you're doing any sort of comparison, you might recognize like, hey, there's a new field I didn't set. So might trigger the update. Yeah, thanks. Last thing I wanna say. On the internet, most common advice for resource-version conflicts, retry, retry, retry, horrible idea. It's that information's all over the place, whether or not. Yeah. So I hope this reduces the amount of retrying you need. Like we literally have in our controller framework like when we update status, like, oh, we got a conflict, just retry it. We have a conflict, just retry it. We loop like 10 times. Yeah, so I was just looking at the controller runtime issue that was on server set-apply, which I think has been open for like two years now or something like that. And I'm curious if like you've thought at all about how this needs to get adopted into something like controller runtime, which probably a lot of us use for running controllers. Yeah. I candidly have not used controller runtime because I've been in, like I'm aware of it. I know kind of like how it works. I just been in the key need of controller space because one, I helped build it, so it's kind of, but also I worry about too much adoption because it's the same problem with controller runtime where I think there's only so many maintainers for a lot of feature work that needs to happen. So I don't have strict opinions about controller runtime, but I would suggest asking one of the maintainers. Yeah, I would suspect that that's probably hindering adoption pretty heavily because yeah, people are not using it if it's not coming in the common library package. Yeah, like I'm not advertising people use the key need of controller framework here. But I think, yeah, controller runtime supported this. I think the adoption would be much higher. So yeah. Oh, time and we're ending at zero. Perfect. A plus.