 Cool, thank you very much for coming everyone, we're going to get started pretty promptly because we've got a lot of content to get through and everyone's got quite a schedule today I'm sure, so thank you for coming. This is a talk on API evolution with CRDs, so best practices for authoring CRDs, fuzz testing, all sorts and like how to maintain your CRDs. So, I'm James Munley, I'm a field engineer at Apple, it kind of means to do all sorts and this is my colleague here. Hi, I love you everyone. I'm Andrat Ozathl, I work in SRE at Apple as well with James, we both work on the Kubernetes platform at Apple and this talk is really like about like our journey like internally in like building authoring and also like advising partner teams on how to extend the API using CRDs. And the journey with CRDs actually like start really like with the authoring part, right, is like any API like when you get like to designate and implement it and think like how to do things. The first challenge is really like about the modelling. In the context of Kubernetes there are a lot of examples, right, in the community the community is quite big. At times too many, it can be overwhelming like to find like the right information, the best practices. Something that to me is also quite funny is that often like when we write code we use to look into the standard library as a reference on best practices. But Kubernetes has been here very long and not always like the core resources are the best representation because there are some trade off and the community evolved their understanding on the API. So maybe not always the core resources are the right place to look at on how to implement things. And also like there are also other challenges related to the design of the CRDs, right, in do we model everything in one single resource, do we use more? And actually like Kubernetes give us like this object ref field which help us like to break up CRDs but at the same time really increase the complexity of controllers because now you need to start like handling into the code like these references and as well create cognitive overhead for the users. But generally like one best practice that we learn is that it's really worth like breaking down like the model into more resources itself and a good example if you think about it is like the certificate resource for instance. Like you have one resource the map 101 to certificate or as well as well like for instance like external secrets the map like to one concept and then you can build the resources like that group all together these entities into for instance a concept like a store. So yeah I think there's as Andrea says once you kind of think about what your concepts are and you think about kind of like where where you have all these different concepts and how they relate to each other you then obviously get on to probably the probably the thing you actually do first which is actually writing a schema. And writing a schema can be quite difficult it takes time to evolve it you make mistakes as you go. And yeah I think one of the first things that you then need to sorry as you're as you're doing this the API convention stock which I think most of you probably I'd hope have seen you should definitely go check out. It's really exhaustive and this is kind of like the reference manual for best practices for like what what a what a Kubernetes feeding API is like and that building it like this makes it feel more natural people understand more. Kind of like what like people more naturally understand the resources and so on I think one of the things that I always say here is really kind of focus on talking to the community about these things because it's very very hard. As Andrea says there's a lot of resources out there people make mistakes as they go though V1 API still have problems in them pods have been around for years. So I think reaching out to the community is really a big big first step. So writing a complete schema it's required for V1 CRDs now we should stop talking about V1 beta one CRDs because it's you know it's the thing of the past we should be looking forward. It enables really useful things Qubectl explain who knows Qubectl explain here where it's used it before yeah. So that that it's a really nice CLI you just Qubectl explain your resource name and it will actually go and explain to you all of the different fields that you can specify your descriptions are pulled through there you can display kind of what the valid values are all of these sorts of things. You can make sure that you've actually got a full complete schema by looking at the conditions on your CRD so Qubectl describe on your actual CRD resource will show you those conditions. And once you do this you've got far better validation the API server can kind of actually evaluate all of your resources as they're submitted by your end users and make sure that they're valid that they don't have any typos in there which is actually a really common problem we see. People even capitalisation or just the wrong word if users don't know that there's an error there they end up pushing these things out and they might make assumptions about how their application or their resource is going to work that just don't hold true. And yeah this as it says matches the behaviour of other core built-in types which I think for me at least we're good. Yeah I think that's a really important piece there because people kind of become familiar with Kubernetes it's already complex we want to really reduce that cognitive overload and overhead. So what can we do with open API schemas? First of all as it says we can actually perform all this validation defaulting without expensive network round trips to webhooks, validation webhooks, mutation webhooks which we'll talk about in a bit. You've got quite a lot of flexibility here things like max length on strings it seems really like small but if you don't specify a max length for these things you can have a completely unbounded amount of data up until the XED limit which can become really problematic. Same with lists items we've seen people accidentally insert 10,000 list items into a huge list and that can be really problematic too for the API server especially when you look at managed fields and some of these more advanced server side of fly stuff. Minimums, maximums just helps users once again. Regixes for strings we can do all sorts there too. Enum values to actually make sure that the values that your users specify are valid and make sense and that makes your controller code far simpler to write because you're not having to deal with all of these values that are unknown. CEL, who's heard of CEL here as well? Common expression language. This is out of Google I believe and originally and it allows you to do some really, really advanced stuff with your CODs and far more validation so here you can see we can do comparators, we can say one field has to be greater or less than another and things like that. We can check to make sure that certain values are within certain lists and so on. You've even got set operations in there and making sure that things are immutable so CEL equals equals old CEL, make sure that we can't change anything. Defolting, if you have a resource, someone submits it, if you think a deployment or something, when you create that, if you don't specify the number of replicas we can default that to one because you probably do want that to not be zero. That kind of intent. When a user reads that object back they can then see that they've got replicas one. If we don't have defaulting it might be unclear to users when they actually go to inspect their resources what's actually going on there. Image pool policy is another example. If you didn't know and you didn't just happen to know, new users are really not going to know what the behavior is going to be. Doing defaulting in your schema as opposed to in webhooks and so on means that these things can actually be applied on read operations as well so if you have resources that were created previously you can then still make sure defaults are applied. Just one other little tip for making APIs that don't confuse people. Having fields that default based on the value of another field can lead to really tricky, confusing cases for people because they might submit one thing which then changes there and then you have a server side of play, kicks in, another user, another entity starts setting things and then that changes stuff up and it gets really confusing so just as a best practice keep your default simple as well. That leads on to a pretty chunky topic which is versioning and conversion which I'm going to try and take us all through today. Who knows about how? I can't really do one to ten with a whole crowd but how do you feel about versioning and conversion? Have you looked into it before? Everyone raise your hands if you have? Yeah? Okay. Cool. So versioning is really really fundamental principle in Kubernetes. You've seen v1, beta one, you've probably suffered the pain of these APIs being deprecated before as well and removed and had to deal with all of your users going through it. But at its core an API version is an API is an endpoint that gives some kind of guarantee that if you go and submit a resource today to this endpoint it will continue to work in future. We won't change the language, the schema and the shape of that resource. The difference between alpha, beta, GA here, alpha no rules at all. We have had to really strongly discourage people relying and leaning on to these alpha APIs. Beta, similarly, we might not change the schema in a breaking way however that resource can go away too. We can still remove that within a few releases and like stating that up front with your users makes it a lot clearer and to be honest, again, be careful with beta APIs because a migration where you're actually removing something can be super painful because users come to rely on certain functionality and features. We haven't seen a GA API go away yet as far as I'm aware. We haven't done Kubernetes 2.0 but CRDs, they can move a lot quicker. You've got your own project so yeah, possibly we'll see that at some point. Has anyone ever removed a GA API from there? Cool, nice. That's good I guess ish. Conversion enables the API server to kind of, in fact I've got a nice diagram here, it enables the API server to help coordinate different clients that are working at different versions. Here you see in XED it's hard for me to point up in XED all objects are stored in some API version. So in this instance here we've got the storage version set to V1, beta 1. That means as a user submits something they could speak V1, V1, alpha 1, V1, beta 1, whatever. The API server will handle converting that into the storage version V1, beta 1, persist that into XED and then from there if any other client asks for it in a different version or anything like that it can handle that translation. You can see here in like internal core types we have the idea of like an internal version and a hub which uses the hub so we write conversions between our external versions that users use and this internal one. Controller runtime is similar except it just gets rid of the concept, it gets rid of the concept of the internal version instead just uses one of those external ones as its hub which is kind of a little bit less to think about and deal with. So yeah, what can we do with CRDs here because the API server obviously compiles in a lot of this code and we need to actually make sure that we can't compile code into our API servers when we're using CRDs. So first of all, nothing at all. We can keep everything V1, alpha 1 or we can call it V1 from the start and be really careful. I think that's probably a pipe dream really. No op conversion. I know in the gateway API they've been really trying to push for let's call this V1 alpha 1 but let's try and treat it like V1 and go for a no op conversion so we don't actually do any conversion. We'll see how that goes. They haven't got to V1 yet. I think it's not a bad way to do it and it really forces you to think about your API up front too. Who's used a conversion webhook or built one? Who loved it? Yeah. Conversion webhooks introduce a lot of operational risk. It's on the hot path for read operations from your API server so that means things like first of all we need to actually make sure they're reliable and running all the time but it really is an extension of your control plane so if you've got a team that manages the control plane they really need to know about the conversion webhook system. They're running because that can like breach SLOs. It can breach everything. It can really you know bring a cluster down. So yeah great escape hatch if you do need to do something but in the cert manager project we had a conversion webhook for a number of releases. I'm really glad to say we don't have one anymore. We have V1 resources and I'm really reluctant to actually ever introduce it again to be perfectly honest. It's critical API server dependency like you say and also I think a lot of companies and a lot of teams don't actually have a lot of experience running these and they don't fully know what those risks are. Yeah so yeah key takeaways try and make sure that you're like keeping this for as little time as possible I'd say really really push through with your conversions. Like if you keep it one release we made a mistake of having conversion webhook for about seven releases and yeah it's a lot of pain and also I'd say consider publishing V1 only variants of your CRDs too for those that actually don't need all of that legacy. For example they can just start with V1 today if they're picking up your project and yeah just quickly before I hand it back over to Andrea there's been some really interesting work in the KCP project. Some of you may have heard of it's talking about using common expression language again CEL to actually define your conversions in a declarative way on your resources so you can there the API server can handle the conversion without calling out to webhooks and it avoids a lot of these complex kind of operational risks. So it's really interesting work it's very experimental as you can see it's in a pull request not in a like merge branch but it's definitely worth checking out and yeah over to Andrea to talk about validation and mutation. Yeah thank you James so well I mean conversion is one aspect of the resources here this life cycle but as well I would use like to deal with validating and mutating webhooks. They generally like validating webhooks are generally used like to apply additional constraints which cannot be expressed in the open API in the open API specs for CRDs like for instance like. Even if we really advise not to do that but some the state of some fields might be valid depending on what other like properties are set on the CRDs. There is no good way to do that with the schema itself so at times this is like something that people need to resort to and yeah but generally like the way we really like would like to pitch the idea of validating webhooks is really like a way to do policy enforcement on the life cycle and what the user and the properties of these CRDs resources that the user are supposed to be using. It's really powerful because in the context of webhooks and we will be seeing this later you also know information about the users like the namespaces the service account and things like that that you can really implement use leverage to implement like policies based on who is the client that is requesting that is making the specific creation of the CRD. At the same time like mutating webhooks are also like very flexible mean to apply additional changes on create and update and the real like advantage of this compared to the conversion webhook is also like that they are not executed on read operation. So like in term of scalability depending on the resource very often like there are less creations than the reads so they even if they have like some of the operational burden of that is common to all of the webhooks generally like there are more occasion way of dealing like with mutations in general. If we look at webhooks in general like a way more like a good example of when this can be used is for instance like to ensure that for instance is specific platform depending on implementation that all the pods have a resource class set right. Actually a priority class set. This is a typical example of where this can be used right and this is something that is not enforced in the scheme of the pod object but for instance you can enforce like specifically to your platform. There are very nice community tools that allow you to implement like webhooks in both validating and mutating your cluster, OPA, Qiverno are two of example of project. And the other thing that I believe like that is really powerful of webhooks is that they are really like combined with the warning API. They allow you to provide like early feedback to users on the reason why for instance like a certain request was like forbidden. And again like push left the feedback to the user rather than relying for instance on specific conditions like ready or errors and having a validation of the other resources created. Of course like depending like on for which objects we define these webhooks like they might come like with operational burden especially regarding the scalability. I think about like a webhook for pod objects like this would be essentially like consulted every time like that a pod get mutated. So it's really like the challenge which webhooks is essentially is that they are they sit on the back of your API server and depending on the configuration the API server might be dealing like with forwarding a lot of requests there. So they affect then to end like experience of the users with your API. There are also like some challenges like especially with CRDs that are not that mature when we kind of like are still building the schema because we are learning about the resource or things like that that are related like to the life cycle management of the webhooks themselves. Right. So it's really important to upgrade the webhooks before the CRD get upgraded and at times this can be also like an operational challenge. One thing that we want to mention which is really interesting I believe is this cap which is essentially like about again using cell in order to move admission into the API. We definitely think this is like an interesting feature because I love essentially like to directly evaluate these rules and perform these validation and mutation operation in the API without requiring like any extra component. And again like dealing with the scalability aspect the latency and the operational management of this additional component. So back to James on the testing methodologies. Cool. So yeah we're going to talk a little bit about testing like to say. The one thing I say to begin with with all of this is really think about your API server testing versus your controller testing. So testing your APIs and then separately testing your controllers. They're obviously linked here but it's really important and I think we often don't actually see enough testing in APIs. Controllers kind of is where people go and then that's that. So I'm going to go through kind of the different strategies for this unit testing integration testing end to end testing with some examples. So on the API server side unit testing. Is anyone ever in a wrap like use the round round trip stuff. Yep. So yeah basically this will generate completely random values for like of your objects convert from one version to another version. And then make sure that they the object still the same. So it makes sure that it's completely like round trip with round tripable to use the word that it is. But it makes sure that your conversions are correct basically. And that's really really important because if it's not then if a user reads in one API version they might start getting different data. And especially with these conversion web books or any any kind of conversion to be honest that can lead to a really confusing state for your cluster. And for your client and that can really kind of start to break things again and it will be subtle and you won't notice it and it will be horrible. Schema fuzz testing is a way to actually ensure that your schema is correct. So that's taking your like go type that you've defined generating random values and then applying your your schema to that to that type and to that generated instance of your type to make sure that we're not losing any data again. So that makes sure that's effectively applying things like the pruning that the API server does and validation there to make sure that we are we have a complete schema that actually does account for all the different fields in your go type. Perhaps more of yous here web books that's firmly an API server side concern unit testing make sure your validations mutations have tests. I think they're possibly a bit simpler for us to think about we can you know it's more traditional thing here but make sure we're doing things right. Another one I really like is writing a corpus of actual valid resources possibly some invalid ones too. As we evolve things and make changes you know projects expand new people come on board it can be easy to accidentally make these mistakes and just having that as kind of like a safety net there can really just give you a bit more confidence that we're not accidentally going to you know change something and cause existing resources that users have already created to suddenly become invalid. Integration testing so I. I think integration testing here for. We can make sure that basically things work. I think a lot of a lot of the tests that we write for the API server side can be encompassed into integration tests. Basically making sure that we can create a resource in one version and then read it in another really simple stuff but like that will catch the majority I think of the issues that you tend to find defaulting make sure things are set. Validation make sure it says no if you said something wrong really really kind of trivial stuff but it's important because we don't do this that this is you know this is the whole point of the API is to do these things. An e2e testing I tend to think e2e tests for the API side is more like if I create this you know webhook deployment if I go and create this CRD does it work. You know have I specified the right ports on my thing so that's far higher level more operational sort of stuff but yeah and then yeah things like our back to we've got quite a few things to consider. On the controller side. Yeah on the controller side like we are normally like we do right like go test if you use like for instance client go and controller runtime you might be used like to interact with the fake client. We normally like do testing starting from the reconcile function which is the function that is handling all the object events. One of the caveats that we found like with a fake client is for instance that any we believe is something that is very important like to do is that it's very hard like to simulate error condition of the API. What if the API is returning you for tonight go away. How does your controller behave. What if he's returning you an error like on an update operation. This is something that is actually pretty hard like to model with a fake client. We often end up mocking the Kubernetes client in order to end all of this potential error condition. But it's really an important aspect to make sure about enforced and test the correctness of the behavior of the controller which might lead also like to scenarios in which for instance controller are stuck. They cannot progress the reconciliation or eventually like a day back off too often and therefore like you don't see like the operation get into an end. Or worse they don't back off at all. Yeah, exactly. Which is a real problem. And actually like something that is interesting about controller on time is really like testing the result. Like the result is what we do expect right. Do we want to bake off in this case. Do we want to return an error so that like controller on time is handling for us the back off. Do we want to have a fixed retry window. What do we want to do. This is actually something very interesting. I think to model as we in order to be robust controller on top of the API. And while everybody helps with that. I think is really like and test which is essentially a kind of evolution of the fake client in in the Kubernetes. Go libraries which really allows you to spin up like a real API server and I need to see the end as part of the execution of your test. And it's really powerful because allow you like to test all these aspects that exist in the API. That otherwise will be in the API server that otherwise will be very hard like to test. And if you look at some of the things that are happening like in the community again sell or web books or things like that. It's really allowing you to glue all of these things together and test the behavior of the controller with all these elements. Not only like with the API resources themselves as they get created without all these add-on. Really like it's really lightweight and this is something pretty incredible. Definitely I think this is must do for all the controller like being able to test against the API without using like fake clients. Like testing the real behavior of the API as well as because the API evolves. There are new version of the API server that come along and it's important to test the upgrade path and this is the right way to do that. Not only updating the version of the libraries but as well testing the real API server behavior. The last thing of course again about controller. As James said like it testing is mostly like a way to validate the behavior like of the controller in a cluster. Like with all of the other controllers that they might be interacting with, coordinating with. It's mostly like if I put my SRE up on it's mostly like for us a way to make sure that everything still works as we do expect. That we don't break like the expectation of our users when they create the resources and the controller get to handle them. More than a real way to kind of like test in the development life cycle like the controller behavior itself. It's mostly like something more targeted to the later feedback which is the operational feedback. And also as James pointed out this is where we test our expectation about air bug permissions and interaction with other people. One thing that is important especially like for our project we often like kind of like provide instance examples to the users on how they should be using. Or they are expected to use their resources. It's important to keep testing them. Like making the documentation always like actual and this is something where again it is kind of a lot. So few closing thoughts we are about a time. Yeah definitely rely on the schema as much as possible and I like to say this again. Rely on the schema as much as possible because it's the simplest things you can do. And without all the complexity and operational burden of additional extensions like web books. And regarding web books really be mindful of the impact operationally and the cost of running them in the platform. They might easily affect the customer experience. They might be hard to maintain, hard to evolve. And so really like again rely on the schema as much as possible. There is really like a lot of, there has been a very rapid evolution in the last years in the community about testing methodologies. And most of the code that out there use fake client and test is now a thing. Really like revise your testing and make sure to use M test in order to cover more and learn of the aspect of the controller interaction with the API. And as well like again this is another that M test is not covering. Make sure to test also the error condition and the behavior of your controllers in case of errors. Back off 500 whatever error that the API is returning. M test again is not giving you any real helpers today to test like all of these error conditions. So this is where you need to do extra work maybe mocking the client but do it because it's important to ensure the resiliency of the controller. And your code in production. Thank you.