 First, a little bit about myself. My name is Sisi Huang, and I'm currently working at Google as a senior software engineer. I just reached my three years of contributing to a community's community. I initially started in CIG Cloud Provider, and I won the community's contributor award, I think, that year, on 2020. And then I shifted to CIG API Machinery. I'm currently a contributor to CIG API Machinery with focus of the extensible features. And additionally, I have served as the release lead for the community's 125. And I was the release manager for the past 127 release, which we just released last week. So today, I'm going to talk about the customer resource definitions and the effort we did to make it more self-contained. We got a bunch to cover today. I'll begin with reviewing the journey of CRD in the past and then talk a little bit about the common expression language, which also known as cell. And then talk about how we leverage the power of cell and make CRD even more self-contained. And also, I will talk about the other areas we were able to explain to the power, things like policy enforcement area. And I'll leave a couple of minutes for QA in the end. Let's get it started then. The journey of CRD. So in the very early stage in Kubernetes, we have realized that keep adding the built-in APIs is not going to be a longstanding solution for a variety of usage. That's when customer resource definition comes into picture. And it initially known as the third party resource and has been in stable since 116. So it allows user to extend and customize the Kubernetes API with their own resources. And it still remains one of the most important extension mechanisms inside Kubernetes. So many core Kubernetes functions are now built using customer resources. And it makes Kubernetes more modular. So in the past, we have made a lot of effort to try to make CRD as good as built-in types. So we added versioning support. We added sub-resources. We added open API schema, structure schema, and many other features. But along the way, there is a topic continuously being brought up, the validation. So why are validations so critical? I guess the answer lies in the fact that if you don't validate the data, it comes into your system. Later, things are going to break in a way that is hard to reason. And to debug at that point is going to be much more difficult. So it is important to tell developers what they did wrong at the time given the request submitted and give them the chance to fix it right away. But unfortunately, it's not what's happening with CRDs for quite some time. You just can't write all the validations you want with your CRD. And since your controller doesn't support, you just have to wait till a controller break and report an error somewhere and go ahead and fix it. So I have listed a couple of examples of the validations people really wanted to do with CRDs. For example, they want to enforce a field as immutable. They want to do some cross-field checks. And they want to make sure that two fields are mutually exclusive, or they just want a specific format to be applied to their fields. And for quite some time, the building invalidations for CRD was mainly done through a structural schema and the OpenPI v3 validation. So for any use cases that cannot be supported by those two, a validating admission webhook is required, which causes a lot of pain. So introducing a production-grade webhook is not only a substantial development work, but also increased the operational complexity dramatically. As a basically separate component added into your system, whenever you try to introduce a webhook, there are a lot of things you need to think carefully, things like how to package it, how to release it, how to integrate it with your existing monitoring or alerting system, how to upgrade it, or how to do it back when needed. What about the latency added, how to scale it, and many others? And to make it even worse, webhook is very easy to misconfigure. So a common example would be the failure policy, which you have to choose either fail open or fail closed when you try to set up your webhook. So if you go with fail open, which basically says if your webhook are running into issues, either it's become unavailable or it return errors, you just will let the request throw anyway. And if your webhook is set up to do some security check, it's clearly be a problem. On the other hand, if you go with fail closed, which basically says if your webhook has any issues, then you're going to reject all the requests which was routed into your webhook. And if your webhook is designed to match all the pores of something, then you basically lose your control plane availability for that. So I guess over time, people have learned to be more cautious about their webhooks. But webhooks still remain the leading cause of control plane outage. So for quite a while, webhook is the only solution for the functionalities we want. And what we could do here, so after the research, we found that the vast majority of the use cases people want to do with their CRD validation are pretty simple. They want to ensure a field is immutable. They want to apply a specific format to their fields. Or they want to do some basic cross-field checks. So the question becomes, can we use something simpler than webhook? Here is the two we proposed. The common expression language, also known as cell. And before we even dive into the documentation from cell, let's first take a look at a couple of examples. I'm not gonna read through all those three examples, but if you know any of the C-like programming language, you might find that it's easy to guess what those code are doing. And your guess is probably right. So here is the documentation I took from cell, which explains it really well. So cell is an open source portable expression language which implements common semantics for expression evaluation. It's designed to be simple and efficient. It's a typed language, which comes with a nice syntax checker. You can just run type check on it. And it's easy to extend and easy to embed it. We have successfully integrated it with Kubernetes data system for both CRD and native types. And worth to mention that it got a pretty solid adoption now. So here is two major limitations you might want to be aware of for cell. The first one is cell doesn't come with the native support for fall and while looping, and you have to use the comprehension form for that. And cell also don't support if else conditionals, and we have to use the ternary operator instead. A question might occur to you pretty quick is that is there any utility library in cell I could take advantage of? The answer of course is yes. Cell comes with a standard library together with extended library. And we also went ahead and built an even more extended library, which is available now in Kubernetes, including things like more list processing, more regular expression, and the first class support for URLs. So due to time, I am not able to show like many examples here, but please feel free to check out the documentation we just added in Kubernetes past couple of weeks. And together with the cell documentation for more examples and guidance on how you can write your own cell rules. So now let's dive deep into the area like how we can leverage the power of cell and make CRD more self-contained. The answer is the validation rules. We introduced a new feature called CRD validation rules in 123 as an offer feature, and it now stays in beta since 125. We are working on to promote it to GA pretty shortly. So all the magic was done through one single field, x dash Kubernetes dash validations. Let's take a look at how it could be used. So as an extension field we added into CRD, you basically can write it anywhere under open API with three schema. And then you can just start writing your cell rules under this extension field. So when creating or updating a customer service, sorry, when creating or updating a customer resource, it will be validated against the rule you defined. So in this example, we want to make sure that the replica's number you set is no greater than the max replica set. And excuse me. So inside your cell rules, self is a cell variable. It provides access to validations sculpted to current schema. So as an example here, we put the x dash Kubernetes dash validations in the spec level. So the self will be sculpted to spec, which gives it access to all the fields following in this example, the replicas and the max replicas. Also, we provided another cell variable called old self, which refers to the existing data while you're updating your CRD. And if you use that, the cell expression is called a transition rules, which could help you to be able to enforce the immutability of a field, or you can just verify the list which are append only, or you can do something like you want a map with the mutable values, but immutable case. And in this case, the cell rules we wrote here is ensures the immutability of the full field. And my colleague, Alexander, then this case has already wrote a blog regarding this, the use cases about the immutability. Please feel free to take a look if you're interested. And another thing worth to mention that you can write multiple cell rules in multiple places. So here in this example, those two cell rules are doing the exactly same checking, but scoped differently. So the first one is scoped on a spec level, which requires you to first check if the full field is existing or not, because the field is optional. But the second example here, which the cell rules is scoped in the field four level, so you don't need this tense check. And it will be validated only when the field is present. So sometimes you need to port the rule a little bit higher in the schema tree just to give it access to more fields if you do some crossfields validation. But we will always encourage you to scope the cell rules as nearly as possible, and it makes it easy to write. As I mentioned earlier is that cell is a typed language, so we're going to support the type checking in your CRD. So in this case, if you mistyped the field replicas, it will be caught immediately when you try to create or update your CRD, and a nice error will be returned. Also, we supported you to use the cell expressions to specify a human readable failure message through the message expression field here. So just like other features added in Kubernetes, when we added the CRD validation rules, they wanna prevent it from being abused. And even though cell is designed to sandbox code execution, it's possible to write cell rules which will take too long to run. And we have some safety guards in place for this. We do static analysis and cost estimation ahead of time. So if the cell rule you wrote might take too long to run, we will fill the request while creating or updating your CRD with a nice message. And we also have limits set in the runtime as well. On top of that, we don't want to waste the time on validations if the resource creation or update request has already been canceled. So we also support the context cancellation. So with CRD validation rules in place, if we take a look at CRD, there's still one missing piece before we can call CRD fully self-contained. The version support, the version conversion. So as I mentioned earlier, we added the versioning support to CRD to offer the possibility to indicate the stability level of your customer resource definition or advance your API to a new version. So it means that the customer resource could possibly be served in a different version than the version stored. To make it possible, the customer resource object must support conversion between the version they are stored at and the version they are served at. And the current way of supporting CRD version conversion is through the webhook conversion, which is stable as a feature since 116, but it is pretty complicated to configure, as you can see here. And it also involves, it also requires a webhook to be in place, which falls into all the pinpoint we mentioned earlier. And besides that, it also dramatically decreased the CRD scalability number due to the latency it added. So we found that it falls into the similar pattern that the majority of the use cases are as simple as people just want to rename their field, or they want to put their field in a different location, or they want to change the type of their field. The pain is really to use webhook to do so. So can we leverage the power of cell in here? The answer is yes. And we know that the KCP project has already having a working implementation for this, which we're really excited about. And we hope we could bring it to Kubernetes. And here is a draft cap for supporting CRD conversion with cell. Similarly, as CRD validation rules we just talked about, it's going to be specified inside your CRD, and it will make your CRD even more self-contained. And we hope it could be a sufficient substitute to the conversion webhook. And the worst mention that nothing is fixed yet. So it's still in a pretty early stage. We are working on collecting use cases and prototyping in the past couple of weeks. So we really want to have a general way to do object mutation in Kubernetes. And rather than making it like super tight with the CRD version conversion. So please feel free to join the discussion if you're interested. Now, after the declarative validation and the conversion with cell in place for CRD, if we really think about it, CRD will become even more self-contained than the native types. So what are we going to do next? We plan to catch up on native types with the declarative validations. So here is an open draft cap against 128 already. And please feel free to join the discussion. So things CRD also embed native types, for example, the port template. So we believe with the declarative validation on native types in place, it will also benefit CRD. After a bunch of work we did in CRD, we started to look for other areas which we can expand the power and the policy enforcement enters the picture. We found that the biggest area where webhooks are used is actually policy enforcement. So how normally people will do policy enforcement? They either do it through some internal support or they do it through some policy engine, for example, O-Pocket Keeper or Kvarno, or they have to build their own webhooks for do so. And it took us a lot of time to understand what people really want to achieve in this area. And we go ahead and talk to many peoples. We talked to people who need policy. We talked to people who are the maintainers in the major policy engines. We talked to other people who suffered from like the webhooks they have to build. And we think it's really a big space and people already did a lot of awesome stuff there. So what we try to do is we try to really focus inside Kubernetes, things we could do in Kubernetes to make this one policy enforcement point better with minimum the usage of webhooks. So the outcome is validating admission policy. They introduced this feature as alpha since 126 and it added a bunch of awesome stuff in the past of 127. And I'm gonna spend a couple of minutes to talk about this. So what we have learned is that there are usually two major roles involved in policy management work. The policy author and the cluster admin's trader. So they are usually not the same person or not even in the same organization. So for the policy author, what they cared about is the correctness of a policy. So in our case, they will be the ones who write the cell rules. So they want to make sure the rules they write is correct. And they also care about the reusability of the policy because they want their policy to be able to serve for multiple organizations. So they want to make the policy sufficient configurable to be able to support more than just one organization. So on the other hand, the cluster admin's trader more concerned about the policy that matches the goal of their own organization. And they also concerned about the operability of the policy. We all know that rolling out a security policy could be scary and they want to make sure that it works as safely as possible. So what we've done is we introduced a couple new Kubernetes resources that aligns with the different responsibilities. So the policy author will be responsible for writing something called validating admission policy. And here, in the match constraints, they will define which resources this policy applies to. And it works very similar to what Webhook works today. And then they can just start to write a bunch of cell rules to explain what this policy does under the validations. So they can refer to cell variables like object or object and the params which is used to make this policy configurable. And then they will use the param kind field to define which resources they're gonna use to parameterize the policy. So in this case, they're using config map, but you can also use other resources or even CRD if you need. Next, the cluster administrator is going to create something called validating admission policy binding. So what binding is doing is really to connect this policy to their cluster. So here in the policy name, it says which policy they're binding to their cluster. And the param ref says which object they are using to parameterize their policy. Excuse me. And also they could use match resources to further constrain what resources in their cluster the policy applies to. So in the example here, the cluster admin really just wanted this policy to be applied to the test name spaces with the max replicas number set through three in their param resources. And of course, the cluster admin can go ahead and create a separate binding which bound this policy to their production name space and use a different number for their max replicas. And lastly, the cluster admin can use the validation action field to declare how they want the validation of the policy to be enforced. And we'll talk about it later. And if you really think there's no need to have a parameterization in your case, you can easily simplify it to by removing the param related fields. And the cluster admin's trader just needs to single bind it to your cluster and then you're done. Due to the time, I might not able to touch all the details, but I do want to mention a couple of best practices we would recommend here. So the first one is of course the parameterization. As we talked earlier, it really may introduce the param resources as a way to improve the configurability. And it greatly improved the reusability of your policy as fine ground the usage to better align with the specific goal of your own organization. And the second is the way you can use to specify which resource the policy gonna apply to. As we mentioned earlier, the policy author can use something called match conference in the policy to define which resources they want this policy to apply to. And the cluster administrator can use something called match resources in the binding to further constraint the resources they want to apply it to. And in the past of 127, we added something called match conditions, which will give you the power to further fine ground the request of filtering using cell expressions. So previously, sometimes it was tricky if you want to be able to match like multiple resources, you have to either specify all the resources you want this policy applied to, or you can use a wildcard, which basically matching everything. But with the match conditions filled in place, it becomes easier as an example shows here. You can see I wanted this policy applies to everything, but RBAC repressed or something, or depending on like what you really need. The third one is the way to define the expected behavior when something go wrong. So failure policy really defines how to handle failures for the admission policy. And the failures could occur either like something wrong with the cell itself, or like something went wrong in the runtime, or like you have some misconfiguration there. And we previously mentioned that using fail closed in web hooks can negatively impact the cluster availability. But a fail closed sometimes is very usable, especially when you're enforcing a compliance. So I want to specifically mention that the in-process admission control which we provided has a fundamental advantage over web hooks. It is far safe to use your fail closed mode because it removes the network as a possible failure domain. And also worse to mention that if you set your fail policy to be fail, which is basically fail closed, then the failures will be enforced as the way validation action defined. So on the other hand, the cluster admins will be given the power to specify how validations are enforced using the validation action in bandings. So if a validation evaluates to force, then it's always enforced according to these actions. And we previously talked about how tricky the rollout could be. And the cluster admins really want to make sure it's being done as safe as possible. So, and sometimes the cluster admins might want to roll out a bunch of policies without knowing all the details inside of the policy. So here we offered a state which the rollout policy cannot cause the rejection. And then the cluster admin can monitor metrics, warnings, or audit events along the way until they think it's ready to deny the request. So the last one is the authorization check. So we added the ability to perform OS check for the admission request user through authorized variable. And for example, we can check if the principal submitting the request is authorized to 13 parts, resources, or even the service account. I might don't have enough time to talk about it in all details, but my colleague Joe Best has wrote a very nice documentation on the capabilities the OS check will provide. And together with a couple, a lot of examples. So, I'm really excited to share that the whole ecosystem has been aware of the effort we did. And I heard our feature has been mentioned like a couple of times in the other talks in KubeCon in recent days as well. So, and also from like a main policy engines such like OpaGateKeeper and Kavanaugh is working on adoption already. And the maintainers from KubeScape has already tried our feature and wrote a nice blog about it which just published last week. So we understand that this feature is still in alpha and it might take a long time before it's ready to be used in production. So what if people wanna try it early? So we'll be offering an out-of-tree implementation for the same functionalities and to whoever interested to try it early. And we also hope this will help the future possible migrations as well. So what's the key takeaways? We all know that we have a bunch of use cases which are pretty familiar with, like development, like deployments, jobs which can covered by declarative APIs. And we have another bunch of use cases which can only be done through extension mechanisms. Things we talked about today like CRD advanced validations, the version conversion, the policy enforcement. We really think Thal gives us the power to expand the declarative API to be able to cover a lot of use cases which can only be covered by extension mechanisms previously. What's our next plan? I already touched like a couple of performance here previously, but I wanted to mention we also plan to have something called mutating admission policy to support the mutation use cases in admission. And my colleague Andrew Sankem already go ahead and raised the draft cap for that. And also there is another draft proposal raised from 6CRI to have a client-side validation two in place, which offers ability to do some shift-left validations. So if you're interested, please feel free to reach out. And I'm open for questions. Thank you.