 Good afternoon everyone and thank you for joining us. So for the next 30 minutes, we're going to talk about Kubernetes policy management So really excited to cover, you know, and we have quite a lot of topics prepared So we're going to start with some quick introductions talk about why you need policies and Kubernetes What Kubernetes policies are what types of policies are supported then we're going to go into some details about, you know specific policy recommendations, including pod security admissions or for also like talk a little bit about PSPs why they were deprecated and then go into Validating admission policies, which is a new type in Kubernetes Compare that with what dynamic admission controllers can do And also introduce cell, which is the language used by validating admission policies So quite a lot of topics to cover and we'll go through these and of course we'll try and save some time for questions and answers So happy to have the conversation So to start off with just some quick introductions on who we are and why we care about this subject So I'm Jim Begwadia co-founder CEO at Nirmata I'm a co-chair in the policy working group within CNCF also a maintainer on the Kivarno project Which is a CNCF policy engine And I'm Andy. I'm the CTO at Fairwinds I've been working in the cloud native space for about eight to ten years at this point I'm a maintainer of several open-source projects that you may be familiar with Goldilocks, Pluto, Polaris and Also a co-chair of the policy working group as of late last year All right So just to kick things off Introducing what we do in the policy working group and this is a community forum everybody's welcome to join We have bi-weekly meetings. So the charter of the working group is to define or at least catalog architectures different policy implementation types and provide some guidance on what users should use right so We've done things like the policy reporting API is one of the initiatives that came out of the policy working group There's a number of different producers and consumers including Kivarno, Falco, Trivi All of them can report policy results in a common structured manner and we're looking at other initiatives as we move forward Including like you know, of course white papers and things like we can do to help out again educate or inform the community So starting with what is a policy right? So there's a definition and one of the things we contributed from the working group There's a chapter on policies in the Kubernetes docs So you're very quite simply, you know what we think of as policies are configurations that manage other configurations or behaviors right and if you think about that a policy is nothing more than just another Config object a resource it could be a custom resource could be a built-in resource or some type of control You're providing which is defining behaviors for other things in your cluster or in what you're managing, right? There's a more formal NIST definition too, but it's very similar. It's talking about you know making sure you're managing expected behaviors and You know what we think about over here is it's not just about Restricting or validating checks, but you can also generate mutated full configuration lifecycle management, right? So why do we need policies we have config objects you can configure them you can manage them What's the value policies bring in? So the one thing to always think about with Kubernetes in many ways Kubernetes is the first platform Which really is built for developers operators security to all collaborate, right? It's a set of standard APIs where all of these roles can interact and the idea behind policies is if you have these shared Resources who's managing it? So if you have a pod or a deployment there's parts of the deployment that security might care about There's parts of the deployment the application developer cares about and there's other parts that the operator cares about So how do you bring all of these together and have them collaborate on the same? artifact or on the same resource and that's where policies can add a lot of value by providing that way to collaborate across those and But Kubernetes, you know, it's not designed to be secure by default. You need to add some Tooling some other ways to secure it Policies become in an ideal way for these various constraints to be applied Security teams to get the compliance and results they need whereas operations and development teams to be able to do their work Without stepping on each other's toes, right? So that's the main value and of course if you extend that to policy as code Why not use the same get ops and the other best practices we love in cloud native apply that to policies as well, right? Now you can do full life cycle management Manage your policies across clusters get, you know Reports things like that and even manage exceptions over but using the same Kubernetes APIs So in Kubernetes, there's four policy types, right? And this is Even as we were going through this and we added this to the Kubernetes docs took a while to classify this for main There's several other ways of configuring policies, but there's built-in objects like obviously network policy, right? So the name itself says policy So it defines what you can do on the networking side So it matches that description or definition we talked about So those are built-in objects and there's our back and several others which would qualify as policy objects Then you have admission controls and these are flags or these are configurations on the API server Which you can enable disable sometimes there's additional settings for these but things like you know If you want to have your default ingress class You can actually specify this on the API server and say yours my default ingress That will apply to all your cluster manage that configuration for you That is not in its different than dynamic admission controls Which is the next category where you have tools like you know a caverno or opa gatekeeper Which are not built into the API server, but they can receive requests from the API server and apply policies, right? And finally now we have validating admission policies, which is currently beta moving to graduation And that is also a built-in type So that allows you to now configure the API server with some customizable policy checks and we'll talk about that too So next Andy is gonna go over a timeline of how all of this happened. Yeah And before you all start squinting and trying to read that we're gonna break it down here in a second but we set out to write this talk and we Thought it started talking about like all the different historical pieces of policy Throughout the timeline of Kubernetes and even you know us in the policy working group We're like, oh when did that happen and when did that happen so we decided to put together this timeline to show Where we started and how far we've come and maybe talk a little bit about what's coming next So if we go all the way back Kubernetes 1.3, I think this might have been before I even started working on Kubernetes We had security context constraints. These are a thing in OpenShift. I actually didn't know what they were until about two weeks ago So we're gonna skip right over that And then we get into Kubernetes 1.3 and beyond in 2016 to 2018 and a few things start happening So we get our first policy engine accepted into the CNCF So this is when OPA went into the sandbox category and then PSP was introduced and became kind of the standard for a long time and we'll see that Throughout the rest of this timeline and that gave us granular control over lots of different settings in the pod And we were able to control it via RBAC and it was rather powerful, but it was kind of difficult to use So we step forward into the future 1.4 to 1.20 PSP was refined maligned hated People talked poorly about it. People didn't use it. People didn't adopt it. And so there was a cap introduced to talk about pod security admission and You know get us into the future and so that started becoming another thing that people wanted to talk about and OPA moved into the incubating status. So it started becoming much more popular and Coverno entered the scene into the CNCF and was released its GA release came about and then We got the policy report CRD from the policy working group in what is that? 2021 so That was kind of the formative time for what we see now as the current state of policy So then we step into the last few years and a lot of different stuff happened across all of these different projects So PSP was officially deprecated and removed from the API And then pod security admission moved into stable and then we started seeing validating admission policy So validating admission policy using the cell expression language is now in beta in the current versions of 1.20 1.28 and on and then Coverno just applied for graduation just last year and so it's becoming much more stable and OPA also graduated during this time So we're starting to see all of these different engines mature and the entry Policy engines mature as well Now what's going to happen this year and moving forward ideally Coverno will graduate At some point. We're not entirely certain when validating admission policy will move to stable and we'll also see An extension to validating admission policy, which will be mutating admission policy Which will allow us to mutate objects rather than just reject them and The policy working group is hoping to finish moving the policy report API definition Into a more stable state a more permanent home underneath sigoth So next let's talk a little bit about why PSPs got deprecated and you know what some of the challenges there were right and there's a great You know white paper article blog written by Tabitha who leads sick security in Kubernetes Where you know, they mention a lot of the usability challenges a lot of problems and they were prior You know Kubecon talks on this too and the main challenge was like Andy also mentioned that although everybody Understood why PSPs are important It was just hard to configure and use correctly especially at scale right and there were just too many variations Too many ways to trip yourself and misconfigure things Which is sort of the opposite of why you want a policy, right? You want policies to help with configuration not create more problems in your cluster, right? And diagnosing troubleshooting was also a major problem with this right so with The deprecation as this was being discussed in sigoth sick security a lot of people felt that yes We need some something basic in Kubernetes, but it's also time to acknowledge. There's other powerful policy engines There's other solutions So let them do some of the more complex things that are required and you know instead of just Sort of having these configuration objects. Let's think of this almost like a compliance standard, right? So if you look at PCI DSS or HIPAA or you know any compliance standard, it's a document It's well-defined. You can go and see exactly what you know you need to do So one thing which is great which emerged out of this all these discussions is pod security standards This is part of the Kubernetes docs itself It's maintained by the Kubernetes maintainers for every release of Kubernetes and it defines about 17 controls for pod security and and all of this everything we're talking about here is for the security context within pods, right? So within pods as well as containers inside of pods you can configure security context which various settings So these controls go into a lot of detail about what should be you know What's safe to configure or how you should configure them what the allowed values are and they evolve with new versions of Kubernetes You will see changes in this right so it's not Static it's not a one-time thing as Kubernetes evolves these standards also evolve So once you have these standards you can have multiple implementations and the entry Implementation of this is pod security admission and pod security admission the idea was let's provide again Something basic which you know can be easily enabled which may not cover all the use cases But at least it lets you to get to a secure default and the idea here was to you know I have three levels of privilege which of course allows everything But then you have a baseline which any known vulnerability any known issue or misconfiguration is covered by baseline and then Restricted which is the you know recommended way of running highly secure clusters To go into the restricted level of policy and and this allows you to set it very easily Just using labels at a namespace level so that you know again is super easy to configure But has some limitations and in the real world what we've seen is it often doesn't get used because it's only at the Namespace level and we'll talk a little bit more about this later Next Andy is going to go into validating admission policies and sell so We removed pod security policy and then we created pod security admission and Like Jim was saying it's not highly configured pod security admissions not configurable. It's not even highly configurable it's just not configurable and So we were relying on dynamic admission controllers to do all of our policy and then Some folks brought up the need to Write policy at admission time and some people wanted to do it in tree. So what we got out of that was this Validating admission policy. So validating admission policy uses cell. We've mentioned it a couple times. That's a common expression in language It's a relatively straightforward language for defining policy It requires bindings to bind the policy to a specific object or set of objects You can declare as a separate CRD Parameters that are used by that policy. So it is configurable per Namespace or per binding to allow you to modify how the policy behaves and as we mentioned before it's currently in beta Keep forgetting I have a clicker. All right, so If we look at sell how these policies are defined, it is fairly developer friendly. It's Sort of small. We'll get into that in a second It's very extensible. So there's quite a few different macros that are in the current implementation of validating admission policy That allow you to do fancy things like regex matching Optional types you can compare different types of insta gather and one of the interesting ones that I learned about recently is the authorizer so you can actually Check groups and users of the whoever's performing that admission action But really the issue with Cell that we see is that it gets rather verbose. This right here is just one check To say don't let the container run its route and the reason that it gets over both is because I have to check every single container I have to check and make sure that the run as root the Run as root section exists and then I also have to check the init containers separately So we end up with this very verbose policy for what is a relatively simple and straightforward check In Kubernetes, so we could split this into multiple Policies, but then we're running multiple checks against every object every time we do that and also here We don't get great error messages because right now it's like well One container in this pod somewhere is running this route But we don't know which one because we just had to run 12 checks as part of this single check So this is sell it's relatively straightforward to use for simple things It gets you a lot further than pod security admission But I think there's more improvement to be had here And then the next thing coming up on top of that is mutating emission policy So there's a cap out there for this. This is just kind of a side note here Currently in it's not even an alpha yet, right? Yeah, so it's coming hopefully in alpha in I think 1.31 possibly 1.32 So just keep an eye out for that So we put together this table to compare all of the different policy options and talk about You know what they're good at what they're not good at we've talked about this a little bit already But it's nice to see it all in one big table And so we've got you know the first column there with pod security admission We've got validating admission policy which gets you a little bit further It gives you more granular control. It's still built in just like PSA is And then you know, we've got a weird little symbol there for exception management to denote that Since validating admission policy is opt in not opt out Exception management is a little bit odd. So, you know, if we say this applies to every object in the cluster There's no way to opt a single workload out of it But you could write in a cell expression to You know exempt various workloads or pods or containers from your policy But then you're modifying your policy every time you need an exemption Which we've found across the various people that we talked to is not the ideal way You want to sort of have one solid policy that works all the time and then make your exceptions somewhere else and then None of the built-in controllers support the policy report API which the working group has published And neither of them work in in the CLI And so really what we end up at is dynamic admission controllers Still as kind of the de facto way to do all these things if you want to do them all I think Jim wants to talk a little bit more about some of the the other bits Yes, the one question that comes up often and you know, like Andy was describing is So you can do checks and sell but Validating admission policies do have some limitations to what they can check, right? So what exactly can you do and can you not do and when do you need dynamic admission controllers? So, you know, the the usual answer as well for more complex policies use Something like opa gatekeeper or caverno or other tools But if you kind of dig in a little bit into more detail Really the way to think about it is any policy That operates on a single object, which is the thing that's being changed Added mission controls you can write a cell expression to do that There's also built-in, you know, like andy was saying with the authorizer if you want to do Like a self-check review for you know access control You can do things like that because the kubernetes authors have put in extensions Into cell to allow that right, but you can't look up some other api object as part of your policy Classic example if you want to limit every namespace to a single load balancer You know opa or you know with drego or caverno can do that fairly easily With something like cell and validating admission policies you can't do that look up So that's one thing to think about anything that you know expands beyond the scope of that single admission control It's not possible to do in the you know with validating admission policy Other things like if you want to do image verification, right you want to check signatures You want to do that you can't you know now call an external registry to fetch the signature artifact and to compare that You it's not something validating admission policy will support It's not something you can want to do in the api server because of the latency and other things that it introduces, right Also, like of course with policy going back to the definition we started with If policies are configurations to manage other configurations the keyword they're being managed You want to think about the full lifecycle of configurations, right? So you want to be able to mutate which will you know again simple mutations will be allowed In the api server and but you want to do even complex mutations on existing resources, right? So let's say some api gets deprecated you want to write a mutate policy and you want to update it to a new group A new type why not write a policy for that? It's a great way of you know handling that Things also like you might want to generate resources when a new namespace is created Generate secure defaults generate fine grained rolls and roll bindings generate network policies So all of those things when you think about managing configuration are within the scope of policy management Which you know will not of course be handled within validating admission policy or the upcoming mutating admission policy You'll need external tools for that, right? So that those are some of the tradeoffs to think about as you're comparing these tools So ultimately, you know one of the things to think about is so you can use validating admission policies Then you should whenever possible, but you'll still need dynamic admission controllers So what is the best you know way to manage these and the good news here is both Kivirno as well as opa gatekeeper both cncf projects They're fully embracing and supporting validating admission policies So with the latest releases of these, you know, these policy engines They will generate and manage the life cycle of validating admission policies for you So you don't need to necessarily think about which one to choose And when to do what if you're using, you know gatekeeper and you use their way of you know declaring policies Wherever possible, they will execute that policy in the api server Same thing with Kivirno if you write Kivirno with sell It will automatically generate and manage the life cycle of validating admission policies As well as bindings Automatically for you so that these policies can execute in the api server Rather than receiving the web book Which you know is the normal way of processing policies in Kivirno, right? So gives a lot of power in terms of how you can kind of balance between those two Yeah, so just you know on that and there's a lot of discussion on So what are the challenges? Or why run something in the api server if i'm still gonna Use you know a dynamic admission controller, right? So why not just put everything there the problem with dynamic admission controllers is few things like first off Once dynamic admission controls came about It's like you know, they say if you have a hammer in your hand everything looks like a nail Every problem in kubernetes was solved by another dynamic admission controller Obviously not a good scenario where you have if you have a half a dozen Dynamic admission controllers running in your cluster all trying to mutate or validate things You are going to run into issues. So don't do that, right? Just pick one or two try to keep it as minimum as possible But as you're running dynamic admission controllers, these are Highly mission critical workloads. These are things which have to be managed carefully If you just take the default helm chart from any of these projects and slap it into your cluster And expect it to work in production. That's not going to happen, right? You need to tune this you need to understand what is happening in these controllers They receive like, you know in the very early days of even qverno Naively it was receiving every request from the api server and of course, you know keeping up with that It will you know, slow down your api request. It will impact other workloads So there's a lot of tuning that can be done with these projects And the way to think about it is it has to be a highly available highly secure Low latency type of workload and you want to minimize what goes to these dynamic admission controllers And do as much as possible in the api server itself, right? So and that's what both of these projects now are moving towards Is very fine-grained configuration of the web book configuration objects Which is where you can define which requests go to the you know, the dynamic admission controller like the qverno workload or the gatekeeper workload and which requests should be handled directly in the api server And you can tune that in many ways. There's also defaults of excluding certain namespaces One common problem like with coop system or your cni namespace If you have let's say for example qverno now Policing your you know coop system namespace or your cni namespace And if you're trying to do a cni upgrade that might get blocked now That will of course impact your cluster So to avoid things like that you want to make sure you have configured it correctly Based on your set of add-ons based on how you're managing qverno And you continue to monitor that right so there's a lot of metrics Even traceability in these projects which show you exactly what requests are being handled how much latency it takes for each So keeping these within milliseconds anything that goes over 10 milliseconds is a problem webbooks can also they have this failure mode called fail open and fail closed And basically what that means is How do you handle a failure fail open means? Well, even if there's a failure something unknown happens continue with the operation But the problem is you have to wait for 10 seconds for that failure to happen So let's say all your you know admission controller pods are down And you have to wait 10 seconds for each api request which should go there Of course, it's going to slow down other things right so these are things to be aware of And just use best practices much like you would with the database or with other Mission critical workloads you're going to apply best practices from monitoring managing that workload Apply the same for your dynamic admission controllers And try and keep them to a minimum in the sense again if you have too many of these It starts creating problems in your clusters Yeah, so just you know in terms of guidance and some of this we covered already so wherever possible Use what's built in right so start with that as validating admission policies are going to j If you haven't kind of started, you know experimenting with them start using them try out what can be done You can like in within the Kivano project. There's also a library of policies for everything for pod security You now there's a cell version of those as well So they can be executed directly in the api server So take a look at that or just take a look at other examples for what needs to be done And and then for you know as you kind of pick and choose different policy engines Figure out what applies best to your workloads to your use case But try and prefer tooling which automatically uses and leverages these tools right So again, there are other tools which may not know how to deal with the validating admission policy So then you're missing out on certain features coming in kubernetes Because all of this has to be now offloaded into some other Check or some other configuration tool right and for you know thinking about like other stuff like Applying things in pipelines so as much as you can if you're following get tops and if you're not you should be looking At just seeing how you can leverage get tops If you can apply these checks directly in your cica cd pipeline But then apply the same policy as well at admission controls and periodically as background checks Now you have defense in depth at every layer right so but at the earlier you can apply this the better So definitely executing some of these um, you know in your pipeline should be something you look at as well all right Now it's time for the audience participation part of the talk Um, we always like to do this when we talk with various folks And this is the biggest audience we get because people not this many people show up to the policy working group meetings So i'd love to ask a couple questions of the audience. How many of you in the audience by showing your hand Are using pod security admission in your workloads? I've got one two three four five people It's like Three percent of the room. All right. How many people are using some sort of dynamic admission controller? for their policy A lot more people. Sorry Sure Why not we still love python. There's nothing wrong with it um And how many people are using validating admission policies even though they're not ga yet Got one. All right How many of you are planning to use validating admission policies once they hit ga? Okay a few more a few more. All right Great. Um, and then Uh, we'd love to hear your questions. So if you have any questions, please raise your hand And also we'd love to see you know as a policy working group What you would like to hear from us next what you know, what can we do that's valuable for the community? And uh, you know, what guidance can we provide or what content can we put out? Or you know, is the policy report api useful? We love all that feedback We are going to be working on putting together a survey for the entire community So keep an eye out for that and please tell us all of your policy wants and woes Does anybody have any questions There's a microphone coming to you Thank you. Hello. Oh this worked immediately. Nice. Um So a question and a request so a question of We obviously like using a bunch of policy stuff of the three that you just said we don't use the alpha one, but we use the other two And We'll like apply things in the cicd pipeline But we also want to apply even earlier like shift left. We want to get it down to being like Our developers are writing java code. They get compile time errors. They write the yaml They get errors in their ide that say this is going to break the policy before they even commit Um, do you know if there's like further development within the policy tools for developing plugins for like vs code or IntelliJ or any other of the ide's that poppily used For hooking that in to make sure that policies can work There's one project that was attempting that and I Don't recall the name. I'll have to look it up But they were using a cli Both from keverno and gator, which is the gatekeeper cli to do almost exactly that To provide validations and those you know again vs code plugins, of course Can be developed so Both as long as you have a cli you should be able to do those type of validation checks and apply them as well Yeah, but I'll look up the thing and maybe try and post it at least in the you know policy working group slack sweet And then my thing for the future is like we use backstage and we See a lot of other people using backstage as well and getting those policy reports and things into backstage and visible for developers as well to see like Hey, maybe we're allowing things, but you have a bunch of stuff That's kind of a bit of a red flag that you should probably fix soon and getting that reports easily available So Similarly to vs code plugins is like backstage plugins and the tooling for visibility Would be amazing As just a general request to the room We will also be probably doing development work, but if anyone else is interested in that that would be really appreciated Yeah, I think that would be a great addition And then if anybody wants to come work on backstage plugin as part of the policy working group I'm sure we would be happy to talk about that I think we had a question over here Thank you both for the talk We've been using dynamic controllers now for probably three years and we've racked out perhaps 100 200 different policies mutating about a dating and generating and Now with vap I've not thought about it much But have you got a view as to whether we should start thinking about refactoring some of these Think about cl How do you see that? Tension between the two in a sense for someone who's already been doing this for a long time and has a body of policies already in in production Yeah, I think jman probably has a gradient So if you're not having any issues or problems with what you have does no You know necessary reason to change right so But at the same time if there are these any of these simpler checks which can be moved to the api server Think of it as offloading your dynamic admission controllers Executing these you know further down into the system. So you're just embedding it at the api server level So it really comes back to how expensive are your policies? Are you seeing issues in the workload? Which you want to offload and if you don't have any complaints or any challenges with that It's fine to stay as is No, no All right, we're being told we are out of time I think they're going to kick us out. So so we do have a session tomorrow, uh, there is a come like a meet and greet At 11 tomorrow So again, please stop by if you have any other questions If you want to know more about what we do in the policy working group or even join one of our future meetings Hope to see everybody there and thank you. Thanks everybody