 All right, I'm going to go ahead and get started. Thank you, everybody, for coming this morning. I'm excited about this talk. I hope it is interesting to you all. My name is Joe Betz. I'm an engineer at Google. I've been working on Kubernetes for about five years. I spent a couple of years as an LCD maintainer and quite a bit of time as a contributor to SIG API machinery, where I've been working on extensibility features. Today, we're going to talk about some enhancements that take advantage of something called the common expression language. We're using this to try and address some of the problems that people have been facing with web hooks with Kubernetes. So I got quite a bit to cover today. I'm going to get started by giving a brief history of web hooks that kind of motivate the problems that we're trying to solve. Then I'm going to dive in and give you a more detailed explanation of what the common expression language is and why we think it is a useful enhancement. It's going to help us solve problems. I'll then dive into two major use cases. The first use case is going to be CRD validation, which most people would prefer not to use a web hook to do. And then the second one is admission control, usually focusing on policy enforcement use cases. I'll then wrap it up, tie it together, talk about the future, and some areas where we can use some help. So let's get started by talking about the history of admission web hooks. So these were introduced back in Kubernetes 1.7, roughly. This is about five years ago when I was first starting to work on the project. CRDs at the time were pretty young. And there was a lot of people looking to use Kubernetes for all sorts of things that it couldn't do natively. And so the kind of the extension ecosystem was flourishing, but not all the extension points were available at that time. And I went back and looked at some of the documents explaining what people were trying to accomplish when they introduced admission web hooks. And I think they got the list pretty right, because when you go and look at what web hooks are used for today, it more or less matches this list. So they had identified that organizations needed better policy control, that cloud providers needed better extension points to integrate with Kubernetes, and that extension authors needed better way to perform CRD validation. It's actually a pretty accurate list. And at the time, there was a couple different options. There was a feature called initializers that was in alpha at the time. There was the built-in admission controllers in Kubernetes that are compiled into the API server. And then there was an early form of admission web hooks that was also in alpha. And so after some kind of comparison of these options, the decision was made to invest in admission web hooks mostly because it solved more of the use cases than any of the other alternatives. And in some sense, this was a success because it turned out to be a powerful pressure leaf valve. And by that, I mean, it was an extension mechanism that allowed people to do all sorts of things, both anticipated and unanticipated. And people were successful in implementing a lot of what they needed to build with this. A ton of great stuff has been built. You can see web hooks in pretty widespread use across Kubernetes. But unfortunately, that's where things started to get into trouble. Reports started coming in of people. Cluster admins and cloud providers were reporting control plane outages from web hooks. They were reporting failed upgrades. They were even reporting failed rollbacks. And it took a while for everybody to kind of start to get their head around what had gone wrong. But I think there was two major categories of problems that were causing web hooks to cause so much trouble. One is that they're operationally complex. So every time you want to introduce a web hook into a cluster, you're introducing a new binary. It has to run somewhere. You have to figure out how to deploy that binary, upgrade it, roll it back, monitor it, have run books to deal with any problems with it. It effectively becomes another component in your control plane. And so if you have a lot of web hooks, you have a lot of components in your control plane you now have to manage. The other problem is they're way too easy to misconfigure. I think the most obvious example of this is that whenever you install a web hook, you have to decide what its failure policy is. You either have to choose fail open or you choose fail closed. If you choose fail open, what you're saying is that if my web hook fails, the binary stops serving requests, either it's become unavailable or it starts returning errors, then I'm just going to let the request through anyways, even if that web hook would have rejected them. So if the purpose of your web hook was to enforce some security policy, that's clearly a problem. On the other hand, if you choose to fail closed, what you're saying is if that web hook's having a problem, it's returning errors, if it's unavailable, I'm going to reject all requests that were being routed to that web hook. So if I was matching all pods or all deployments, now I'm going to reject all those requests and I'm basically losing my control plane availability. Those are your two options, right? That's kind of at the heart of some of the problems that we have with web hooks. The one way I like to think about this is the more web hooks you're using in the cluster, the lower the expected cluster availability and the more types of issues you can expect to have happen. I think over time, cluster administrators have become aware of this and have become more and more wary of web hooks, but we're stuck in the situation where there's a large classes of functionality that we want and web hooks is the only way to do it. So the story isn't over. We're going to try and tell the rest of the story. Hopefully we can turn some things around here. And one of the pieces of good news is that if you actually go and look at what web hooks are doing, the vast majority of them are doing super simple stuff. They maybe need to make sure that a particular field in a CRD is immutable. They need to make sure that a label works the way they want. So maybe they always check that a label's value conforms to a specific format. Or maybe if they have a policy in their company where they want all pods or containers to have some particular values set in a particular way. Most of these share the common property of being able to be expressed in just like a line or two of code. So this is a good chance to apply the 80-20 rule. We can say that there's 80% of these use cases and probably more than 80, honestly, could be easily handled by something simpler than running your own full binary with all the power of a complete programming language. There's probably a small percentage of these that still require a call out to some other system, right? Having a fallback as where we continue to use web hooks is probably not a bad thing, but I think if we can solve this 80% use case, we've already made a huge amount of progress on this problem. So I'm gonna focus on that today. And the tool that we're gonna look at is called the common expression language. Let me give you just a couple examples of what that looks like. So here's some code. There's three examples here. The one thing I hope everybody notices is that this is a pretty unsurprising syntax, right? This is, if you've worked in any C-style programming languages, you can probably guess what this code does and you're probably right. I'm not gonna read through all those examples, but I am gonna give you a little bit of access to some of the documentation from CEL. I think the authors have done a pretty good job of explaining what the purpose of this language is. Now it is only an expression language, it's not a scripting language, so you're only gonna write a single expression assay program. And that limitation causes some problems, but I think in our case, it's more of a benefit than anything. It's easy to run CEL quickly. It comes with a really nice syntax checker. You can run a type checking pass on it. It's easy to extend and it's really easy to embed and integrate with languages. So we've been able to successfully integrate this with the Kubernetes data system very well, both for CRDs and native types, and it works really well. So here are two of the major limitations you should know about CEL right away. The first is because you can only write a single expression, you don't get any native for or while looping. Instead you're gonna be using kind of a comprehension form. So there's the comprehensions listed here. Also, you don't have an explicit if-else, so instead we're gonna be using the ternary operator. I just wanted to show a couple examples of that because this is one of the less obvious things about CEL. So in this first example, what we're doing is we're verifying that two sets are disjoint. So we're using that all comprehension to iterate over one set and then we just check that none of those elements are in the second set. The second example is even more complicated, so what we're doing is we're taking a list of objects where every object has a priority field and we're making sure that no two of those objects have the same priority value. This is possible to do, it's kind of on the edge of what you can do with CEL and it's not super obvious how to write this one. I wanted to show a more difficult example to give people an idea of kind of where the limits of CEL might be. Now one of the things that you'll probably occur to you pretty quick is, well if I can only write a single expression that does limit me in many ways, are there utility libraries that I can take advantage of to do more things? And the answer, of course, is yes. CEL comes with a pretty good standard library. There are extension libraries for things like strings, regex matching and things like that that we're including in Kubernetes. We also went and looked at four or five major programming languages and just kind of went through all of the most fundamental utilities available in those languages and built out an even more extended library that we're gonna make available with Kubernetes. So that includes more things for regex, more things for list processing. We also identified that even though you could handle like URL parsing with regex, that that's pretty awkward to ask people to do. So we're gonna provide first class support for that. So here's some examples of some things you can do with CEL using the function libraries. You do get kind of like a method type syntax or if you're go a receiver type syntax where you can say dot to call functions. We have added a couple special functions like the ability to check if the list is sorted because we think that's really useful in validation. And then the URL processing, we were pretty careful. We looked at a lot of examples on how to do that right. If you wanna learn more about CEL, I would encourage you to check out the spec. It's being used in a variety of policy systems. It's been used in some systems that extend Kubernetes. It's being used in other cloud provider systems. It's got pretty solid adoption and we've been working a lot with the authors. We're pretty confident in it. So next I'm gonna dive into two major use cases. The first use case I'm gonna look at is CRD validation, which all had to do with a single field. So this is that field. It is the x-kubernetes-validations field and here is it in use. You can use this anywhere in a CRD starting in Kubernetes 1.25. We put it in beta there. We hope to bring it into GA sometimes soon. And what you can do is you can just start writing cell rules under this extension in the OpenAPI schema. So in this example, the self variable refers to the location in the schema where we put the CRD validation rule. So this rule is written right under the spec. So self refers to spec and the spec contains both a replicas and a max replicas integer field so you can access both of those in the dot operator. If you then update your CRD after writing this, any custom resources that you try and write will be validated according to this rule. You can add a message next to the rule if you want to give it a human readable error. If you mistype the name of a field and try and update the CRD, you're going to immediately get an error when you try and write the CRD. This prevents you from actually updating a cluster with a malformed rule. So we're doing type checking. It's type checking against the fields down below. If you just fix the field name, then you can write the CRD and you're good again. Now you can put multiple validation rules in a CRD. Here's an example of me placing the same validation rule at two different locations. So one is up at the spec level and one is down on the field foo. They're the same rule and they do exactly the same thing but the one on the foo field is more convenient because the field's optional, if you put the rule up at the top at the spec level you have to first check if the field's present but you don't need to do that if you put it on the actual field itself because it'll only be run if that value is present. So there are some conveniences that you can get for choosing where to scope the rule. We encourage you to scope rules as narrowly as possible because it makes them easier to write. Sometimes you need to put the rule higher in the schema tree so you have access to more fields if you're doing things like cross field validation. In addition to the self rule we give you access to an old self. What old self is, it is the value before your update. So if you're updating the foo value from four to five old self will be four, self will be five. You can use this to do things like an immutability check which is what we're doing here. You could do things like enforce that a value is only a monatomically increasing. You could have an append only list. There's a large variety of use cases for this. My colleague Alex Zelensky wrote a nice blog entry explaining a bunch of the use cases that you can support with this. One of the things that comes up pretty quick when you think about adding a programming language into YAML is like what types of abuse are we prone to from this? So the good thing is cell is by design preventing a lot of types of abuse. The way it's interpreted means that you can't really break out of a sandbox. The fact that you can't allocate variables means that there's tight memory constraints but there's still the ability to write programs that could take way too long to run. So we're using a kind of a three prong safety approach against this. The first is we're gonna take advantage of the fact that cell is not a Turing complete programming language and you can use static analysis to figure out what its worst time running, worst cost running time is. So we make some assumptions about how big lists can be by the fact that you're processing YAML and there's a three megabyte limit on that YAML. So if you intentionally write a cell rule that does like an O-cubed operation on the longest possible list, we're gonna actually detect that when you write the rule into your CRD. We're not gonna allow you to do it. You'll get a narrowsay in something that you've exceeded the estimated cost limit and you're gonna either need to rewrite that rule or give it hints about the size of lists and things. So that's kind of our first prevention. That happens statically when you author CRDs. The second safety functions very similarly but it happens at runtime. So again, we're gonna use this abstract cost unit measure and when your program's running, cell is going to start accumulating the cost and that's platform independent so it doesn't really matter if your CPU is running slow or you've got a lot of load. It's still going to run the same programs towards the same completion but if you do hit that limit, we're gonna halt cell execution to prevent it from running it, definitely. We even have a third fallback which is why it goes context cancellation into cell so that if the request is canceled in any way, either because the client went away or because we've hit a timeout, we're also going to stop cell execution immediately. So we've put a lot of safety nuts in place to try and keep cell runtime under control. Just to summarize, you can use CRD validation rules to write complex validation for your CRDs. You just use cell, you put it into these x-cubernities-validations fields, you can do complex, multi-field validation, you can use transition rules to do things like immutability. We think this is a sufficient substitute for the vast majority of things people are using webbooks for when they do CRD validation. We think this is mostly a solved problem. I'm sure people are gonna find corner cases that we can't support come talk to us but I think the vast majority of these are pretty well solved. So once we got that in place, we started looking for other opportunities to use cell. In fact, one of them was the thing I'd wanted to use cell for in the first place, which was policy enforcement. When we looked at all the things people were using admission webbooks for, this was by far the largest bucket. Everybody was kind of telling us, generally the things I'm trying to do fall into this category. This is a big and complicated space. There are many systems built to do policy enforcement and it took us a while to get our heads around a lot of the things that people were trying to achieve. And we looked beyond just Kubernetes. We talked to other people doing policy enforcement, other parts of cloud infrastructure. There are entire programming languages written around this. It's a big space. But what we did is we tried to focus down on what we needed to do in Kubernetes to make this possible with a minimal use of webhooks. Ultimately, we wrote a cap for this. It's called self-admission control. You're welcome to check it out. And what I'm gonna try and do next is explain what's proposed in that cap. We're working on implementing this right now. We're hoping to get our first alpha in 1.26, which should be out in around a month or so. And to kind of motivate what we're trying to do, we wanted to make a clear distinction between two major roles when you do policy enforcement. You have the policy author, and you have the cluster administrator. These are usually not the same person, and they're usually in completely different organizations. The policy author is concerned with the correctness of a policy. So in our case, they'll be writing cell. They wanna make sure they wrote that right. They wanna test it. They also want to write reusable policies. Usually they're trying to support more than one organization, so they're very concerned with making their policy sufficiently configurable to handle more than just one customer. Cluster administrators, on the other hand, are very concerned with making sure that a policy matches the goals of their organization, and they're very interested in the operability of those policies. Rolling out a policy can be kind of a scary thing, and they want to make that as safe as possible. So what we've done is we've introduced some new Kubernetes resources that are aligned with these different responsibilities. So the policy author is going to be responsible for writing a validating admission policy. Here they'll define in the match constraints which resources that policy applies to. This is very similar to how an admission webhook works today. Then they're going to write cell rules that express what that policy does. In that cell rule, they can reference both the object and some other fields, and they can reference the params that they use to make this configurable. Last, they define a params kind which says what type of resource they're using to parameterize their policy. In this case, they're using a CRD. In a simpler case, you could choose to use a config map. Next, the cluster administrator is going to create what we call a binding. The binding is what connects this policy definition to their cluster. And they do this with a couple fields. The policy name says which policy they're binding to their cluster. The params ref says which object they're using to parameterize that policy. And then lastly, they can have what's called match resources which further constrain what resources in their cluster this policy applies to. So in this case, the cluster administrator has constrained this policy to just the objects in their test namespaces by using this environment label on their namespaces. And they've set the max replicas to three for that environment. They could create an additional policy binding for their production environment and set this to a different limit for that. So you can have more than one binding for a policy and you can have as many params as you want. You can share params between policies if you need or you can create new ones for each. Now one thing you might notice about this which is a bit of a downside is this is quite a few resources to create to get something done, right? Here in this example, you're looking at four different Kubernetes resources. We can simplify that. If you don't need parametrization, we can go down to just two resources. If the policy author doesn't specify any need to parameterize their policy, then the cluster administrator just needs a single binding. You're done, we're down to two resources. We take that even further in the most basic case. If you're writing a one-off policy, you can write the whole thing just in the validating admission policy. Basically what you're gonna do is inline your binding information into that and you can do that as a single resource. We think that's useful for cases where somebody just needs to do something simple. So one-off case and they're not building like a larger policy engine or something like that. We're just getting started with this feature. So we're working to try and get it out in 1.26 which is in development now. In the future, before we bring this to beta we hope to support a variety of other things. We hope to support more actions than what we support now. So right now you can only admit or deny a request but we don't think that's enough. We want to be able to have you support like creating an audit annotation or sending a warning back to the user but not blocking the user if the policy fails. We want to support what we call secondary authorization checks. This is an interesting case. So the example I like to give is imagine that you want to set up a policy that only allows certain users with like a certain RBAC role to change an enum to a particular value. That's actually really easy to do in cell because you can check whether or not somebody's changing something to a particular enum value. So all we need to do is give you some access to the user's permission sets and you could write a check like that. So we do intend to support that use case. We also intend to add a lot more support for the rollout of new policies. So as a cluster administrator, you probably when you first introduce a new policy, you probably don't want it enforcing yet. You probably want to just turn it on in kind of a dry run mode and see what happens. So we're looking at some ways to do that in a really safe way so you can gather metrics, you can look, you can see what the policy would do before you turn it fully on. There's a fuller list of features here in the cup. I would encourage you to check it out if you're interested. This is still in development, so we're very much looking for feedback from the community on what would be the most valuable. All right, I'm gonna start drawing some conclusions. The way that I've been trying to think about the work that we've been doing is that you've got this established set of use cases for Kubernetes. Many of these are supported by declarative APIs. This is the stuff that you're most familiar with, jobs and deployments and RBAC. Then you've got this other category of emerging use cases. This is the kind of stuff that I've been talking about. So these are things that are not yet supported by the declarative APIs. We're using our extension mechanisms to handle them. The way that self fits into this is it allows us to move this boundary. Now that we believe that these use cases are pretty well understood, we wanna move that boundary and make these supported by declarative APIs. So you don't need to get web hooks involved in things like this. There will always be things that you need web hooks for and that's okay, but we want to lower the barrier of entry to get things done. You should be able to do a lot of these use cases without the operational complexity of web hooks. Cell moves the boundary for these use cases. We think there are a variety of other use cases that we have and have not thought of that would also be useful. I've identified a couple of them here. So right now, if you do have multiple versions of a CRD, you need to do CRD conversion. Right now you have to introduce a web hook to do that as well. We think that's another good place to use Cell. The KCP project has already been working on the implementation of this. We've been talking to them. We really like the work they've been doing. We hope to bring that into Kubernetes. Also, I've been talking mostly about CRD validation, but we do validation of all the native Kubernetes types as well. So there might be a benefit to the developers of Kubernetes to do that in Cell also. That could have impact to users if we do it right, because right now what happens is if you want to do a lot of shift left validation of your Kubernetes YAML, you don't get all the validation rules that are compiled into the API server. But if we were to express those in Cell and then put them on the open API, then that would be available to tooling much earlier to check. So a lot of the static analysis that we actually know for sure would be invalid. We could then check much earlier and get ops flows and things like that. The other thing I was going to mention is right now we're focusing mostly on validating admission, but there's a whole nother class of admission called mutating admission. And we are interested in supporting those use cases in the future. This would probably be a separate KEP. But Cell does support the construction of data and manipulation of data in a way that would work for this. It's one of the things that we considered when we chose Cell. So we think it's possible. We hope to do it sometime in the future. If you want to learn more, here's a couple of links. The documentation for CRD validation rules is in the main Kubernetes documentation for CRDs. The KEP has a lot of useful information. You can learn a lot about Cell, just Google for it. You can reach us at the API machinery mailing list. And we also have our own Slack channel just for talking about Cell. You're welcome to drop in. We'd love to hear from you. I am going to stop there and take questions. So one of the co-authors of this KEP is Max Smyth that works on OPA. So we have been working very closely with the OPA team on making sure that we build something that they can use. Now, it's possible that they might not be able to get all their rules directly into the API server this way. But we believe that the vast majority of OPA would not require a webhook once we do this. They intend to work on an implementation. I have two quick questions. Do you have some tips around how to debug Cell once it's in YAML and how to test it once it's in YAML? Yeah, so that's a really good question. I think this is something where we're going to have to do some work. So the good news is because we get type checking, you get a lot early on. I have seen people do things like run little kind clusters and or mini cubes as a way to get a control plane and just test things that way. So for a stock CRD, you can actually do a lot with that because you don't need the rest of the cluster. So that might be a way to do it. We are also providing a lot of our cell stuff as a pretty accessible library. This is something we want to make better over time. But it should be possible to run some cell just in isolation in the Go program. So Luke, like for someone trying to extend. So it's extensible in the sense that whoever brings it into a binary can change the libraries that it uses. There are some settings that you can change. So it is going to, as a Kubernetes author, we can extend it by adding more functions and things like that. If you were to use it in your own programs, you could also do that. We don't have any way to say, have you registered new functions at this time? So in that sense, it's not extensible. I hope that makes sense. Thanks for the talk. Could you speak to when this might get integrated into a cube builder at all? So extouch validation rules are already there. There is, I don't have it listed here. But if you just look at the open API extensions available, you can use it today. Yeah. So when you use cell today with the expressions you were showing, and it refers to self, is it limited to just looking at self, the resource that's updated, or is there a way to do cross-resource validation, like the must have a unique name? No, we can't do cross-resource validation today. So that is a limitation. And that is a much harder nut to crack. So if you have any ideas, let us know. It has come up before, so we have heard about this problem. But it's a harder one to solve. Can you declare these validating policies on a per namespace basis? So different policies are in effect in different parts of your cluster? So we think that we want to do that before we go to beta. So the way that it's defined right now is everything is at cluster level. But we think that it makes sense to also have a local validating admission policy binding. So you could bind it at the namespace level. You could put that resource in the namespace and it will automatically apply to that home space. Something we have also listed as a feature before we go to beta. For the parameters that are referred to by the validating admission policy, is it the author's responsibility to define the CRD for the parameter kind, or will it auto-generate something there? We leave it up to the author to define the CRD. If you don't need a whole CRD, you can just use a config map, which is appropriate for really simple use cases. Yeah. If a user fails a policy, or if they apply something that fails a policy, how does that return to them? Is it just in the terminal right there when they do apply and it just fails? Yeah, you'll just see it. If you're using kubectl, you'll just see it just like you would most other validation errors. It actually gets surfaced in the same way. Is there a plan to do like a send, like an HTTP send that you can like give it to like a Slack webhook or anything like that? Sorry, I didn't follow that. So like we have like a, like we use ultima. No, I haven't heard of that. I would love to talk to you more about that. You can understand what the use case is. Thank you for talking. I was thinking if only in a specific case that I need to make sure that if a service name exists across resource validation, do you think we can see anything like this in the future release? Crossfield stuff's harder. We don't have any immediate plans for it. I would love to talk to people and understand more of the use cases. Right now, we're mostly focused on the per resource case because that's really useful. I know when you talk more about the policy admission case, which I talked about second, that there are people very interesting cross-object checks. So I would love to understand more of those use cases. Maybe we can find a way to do something in the future. Right now I'm a little pessimistic because it's hard, but who knows. We have time for one more and it was over here. If you have a question, cell expressions. All right, this one better. Okay. Web hooks and cell expressions. What's the order of execution? Dict each other. Are there any built-in safeguards? Oh, yeah. So the order of execution is pretty well defined. So validation is comprehensive. So CRD validation happens at the same phase that CRD validation has always happened and it's comprehensive. You get all the errors that you get back, right? So it's gonna check all your rules. The admission policies are gonna happen in the admission control chain, which has a very particular flow. So first it runs all the built-in controllers and then it's going to run the policy ones and then it's going to run the web hook ones last. We always run mutators before we run validators so that your validator is guaranteed to see whatever's gonna be written. So there's a very well-defined order. All right, I think we are at time. Thank you everybody for coming. I appreciate it. Thank you.