 Hello, welcome to using Open Policy Agent to meet evolving policy requirements. In this talk, I'm going to cover how my team has been using Open Policy Agent or OPA for around the last year in order to meet evolving requirements that we've faced as we've moved into new regulated environments. My name is Jeremy Rickard. I'm a software engineer at VMware. And my team has really been focused on doing things in the Kubernetes space for the last few years. I'm also the Kubernetes 120 release lead, and I've worked on a number of open source projects like Virtual Cubelet and Service Catalog for Kubernetes. If you'd like to reach out to me after this talk, feel free to ping me on Twitter or the Kubernetes Slack where you can find me as J-E-R-I-C-K-A-R. I'm also happy to respond to emails. So what does my team at VMware do? We're called the VMware Developer Platform, and we've got this long collection of words that describe what our team does. But if I boil it down to a really simple explanation, our team provides managed Kubernetes to VMware SaaS services along with supporting infrastructure like maybe Vault or creation of resources in AWS that might support what those teams are doing. This project has been around since mid-2018, kind of generally available for VMware SaaS teams. The genesis of this thing was really focused on deploying clusters for multi-tenant use. So we would deploy kind of shared clusters, and these clusters would have multiple tenants on them. These were deployed into Amazon, so running in public clouds, and available to the SaaS teams to deploy their workloads. Since we are running multi-tenant clusters, we kind of use namespaces as the level of isolation where we're enforcing the multi-tenancy, and we used RBAC pretty extensively in order to make that happen. So to really facilitate that, we also have a data plane that we run in something we call our management clusters. So when a user is going to use VDP, they use a CLI that we've written to maybe create namespaces, or label namespaces, do things like that, then once they have the namespace created, they're able to do pretty much whatever they want inside of that namespace, and it belongs to them. They can define network policies, things like that. But we very quickly came to see that that's not really sufficient for all the use cases. There were teams that needed to do more than just exist within one, maybe two namespaces. They needed the ability to create more namespaces, do other things on the cluster. And doing that doesn't really fit into that multi-tenant cluster kind of model, so we also have evolved to support non-shared clusters. We call these tenant cluster pretendants. And in those cases, they get much more access to the cluster. They can do things like create namespaces. They get nearly cluster admin access instead of having to rely on our data plane to do a lot of that stuff. But at the same time, we have resources that we deploy that are really required for the cluster to operate successfully. So we come to this challenging point of tenants really having what amounts to cluster admin that can do a lot of things that might impact the operation of our system. But we also have resources that we want to protect. And RBAC wasn't really sufficient to handle all of those things. Maybe we need to validate that somebody can do something based off of their org membership in an external system. So thinking about how we might solve that problem, we realized pretty quickly that Webhooks, the mutating dynamic admission control that's available in Kubernetes, would really solve this problem or give us a point where we could write some extensions to make that happen. If you're not familiar, when you do something like a CQube CTL apply, the tool in the CLI makes a call to the Kubernetes API. It does some basic authentication checks to make sure you're authorized and authenticated to do what you're trying to do. And then it moves into the dynamic admission control. In that, you can define mutating Webhooks, which might make changes to the request coming in, maybe inserting a sidecar pod or changing labels, doing things like that. Finally, it moves on through the chain until it gets to the validating Webhooks. And there, it makes decisions about whether the request should be allowed or not. Maybe it's checking parts of the request against other things in the system to make sure that that should be allowed, or maybe it's checking something against an external system. So we ended up writing a Webhook to do some of this. And it was focused on protecting the resources that VDP manages in the clusters, but allowing them to do most everything else. So when a request comes into a VDP managed namespace, where we have deployed things, we can analyze who's making the request and either allow it or not allow it. This really became a great just general extension point for us to add new functionality that we couldn't directly express with role-based access control. But as our scope started to grow, we onboarded more tenants. We also started to pick up some additional places where we needed to run. We needed to follow our tenants to where they needed to be. And the first place they needed to be beyond our normal commercial AWS regions was GovCloud and Amazon GovCloud. And our tenants wanted to start pursuing FedRAMP certifications, starting with FedRAMP Moderate, moving into FedRAMP High. So we recently just completed an effort to help the VMware Cloud and AWS team secure a FedRAMP High certification. And doing that meant that we needed to evaluate a lot of what we were doing and look at the requirements for that certification process, find the gaps in what we had deployed already and start to evolve to fix those things. Shortly after, we started to support a PCI certification effort for VMC again. Each one of these new environments brought new requirements. When you consider FedRAMP High, there's over 400 different controls that you have to meet in order to get that certification. PCI has a completely different set of requirements. A lot of them are similar, but there's also differences between them. You need to really review each one of these things against what you've deployed and how you're operating to make sure that you're fitting into those requirements. Does Kubernetes directly meet all of those things? Probably not. And in our case, we didn't try to justify each one of those things with Kubernetes. One of the nice things about getting these certifications is that they've realized that not every requirement that's written can be directly applied to every business case or every computer system. They've allowed for what they call compensating controls. And compensating control can be applied to almost all PCI requirements. And it really says that if this requirement can't be directly applied for technical reasons or business reasons that are documented, you can go ahead and identify additional problems that help mitigate the risk that those controls are meant to address. And for us, that was a great way for us to take the Kubernetes clusters and the other stuff that we've deployed for our tenants and figure out how we can augment those things, maybe with policies or maybe some additional things we deploy, that can help to really reduce those risks. And as we looked at each one of these things and considering that we have lots of different clusters, we're deploying in the commercial regions US West 2, US East 1, various APAC or Europe regions. When we look at those and compare them to the GovCloud deployments, they're pretty different. And the requirements for them are pretty different. We do have a base set of security things that we have to follow for VMware security, obviously. Any VMware service that's going to be deployed has to go through a set of security validation to make sure that it's going to meet our internal requirements. But when we move into these other environments, there's more and more restrictive things put in place. So we obviously don't want to force all of these requirements onto the tenants that don't need them because that would make their jobs harder. We want to be an enabling feature for them, help them be successful. That doesn't seem like it fit really well with our webhook model because we would be adding different features that we would have to probably feature flag and different clusters, key track of all those different things. It's additional code we'd have to write and test every time we wanted to make one of these new features available. And then it would have to go through our whole rollout process and just be a little bit more complicated than we think would be great. And we also, thinking about this problem, want to make sure our users don't hate us. Additionally, some things we really wanted with this change, these new things we wanted to apply, was that we didn't really want to require new code for each one of these things. We didn't want to have to make changes to our existing webhook code. It's written in Go. We build it into a Docker container. We deploy it. It rolls through our pipelines. Goes through a full upgrade process. If we wanted to make individual changes to that thing every time we had to identify one of these new policies that we needed to enforce, that would get a little bit complicated. So we wanted to require something that didn't really require new code. We also wanted to make it easy for the team to learn. So we wanted to not require them to learn a brand new programming language from the ground up. Obviously there's probably gonna be some domain-specific language involved. Something that looks like code. But we didn't want to force, everybody in the team that's not a Go developer to learn Go in order to build new policies like this. And finally, while we don't want to go through the process of doing a full upgrade every time and rolling these things out and going through the whole process, we do want to make these things testable so we can make sure that when we're defining these new policies, however they're gonna be applied that we can test them before we roll them out. So we're not breaking things down the road. So we looked at all of these requirements, the fact that we wanna have this kind of applied on a cluster by cluster basis. We wanted to make sure that we could satisfy these wants. We did a search across the CNCF landscape and we really identified something that we thought would help us quite a bit. It turns out that was OPA. Open Policy Agent. Open Policy Agent is pretty extensible. It provides its own language for defining what policies look like and we'll look at that in just a second. And it turns out it integrates pretty well with Kubernetes. There's a project called Gatekeeper that I highly recommend you take a look at. We ended up not using Gatekeeper for a few reasons that I'll get into as we go through the talk, mostly because when we started this journey, it was pretty early days for Gatekeeper and we ended up going with the cube management approach which runs cube management and OPA together like in a sidecar manner. It ends up looking something a little bit like this. Just like we wrote our own webhook, validating admission controller, this plugs in pretty much the same way. So when you deploy OPA and cube management together, you can register them as validating and mutating webhooks. They plug into the API server just like any other webhook would. So when a user is making a request with kubectl, CSED pipelines, maybe are using the API directly, maybe using kubectl themselves, or when controllers inside of the cluster are making changes to objects and resources via the API server, everything goes through that normal admission process. An admission request hits OPA. OPA looks at that request, determines if any of the policies you've applied should result in a deny or a block, and then it sends that response back and the API server handles that appropriately. So let's look at a really simple example of what a policy might look like. Here we wanna deny any request that comes in that's labeled with a certain value. So you can see that this is really a declarative language. We're saying a series of facts or in this case, really just one fact. And then if that fact is true, then we're setting a variable value of this getting returned. So we start this off with a deny block. So the keyword deny, and in that is a message that's gonna be returned. And then the first line in this is really the statement that we're checking and the policy we're enforcing. So in this case, if the metadata has a label called pants with a value of sweatpants, then the message we're gonna send back is you can't sit with us. If you notice in that line, input.request.object, that's really coming from the Kubernetes admission request. If you look at the JSON that makes up a Kubernetes admission request, it's got those pieces of it. So it's really great in this policy you're able to say, I wanna work at the metadata of this object that's coming in, or maybe I wanna look at the spec of this object that's coming in. Maybe I wanna look at the verb, is this a create or an update? Maybe I wanna apply policies differently that way. It's really flexible and gives you a lot of power without having to go write new code. It's still code, obviously, you're still writing some declarative statements, and you still have to end up putting those in the cluster somehow, but it's a much simpler path forward. To test this, OPA provides a lot of tooling, and you can actually take this stuff and put it into OPA Playground. I have a link to that at the end of the presentation. Just to test these things without having to run anything locally on your machine, you can build out a sample test document and build out your sample policy and run the validation in this Playground, and it's pretty cool. So with all of that in mind, let's talk about a few use cases that we have solved with Open Policy Agent and Rego. And for each one of these, I'm gonna go through three examples. I'm gonna loosely tie this back to some control or some rule that we found in FedRAMP or PCI that we needed to apply to our system. And the first of those is the use of external information systems. Inside of this requirement, there's a whole bunch of different rules and a lot of different individual control points. But the one I'm gonna focus in on is information systems that are outside of the authorization boundary really qualify as those external information systems. So we deploy our, in GovCloud, we deploy Kubernetes into those FedRAMP environments. And we deploy a lot of other things in there. We try to minimize our reliance on external resources, things that are outside of that authentication boundary. And one of those things is the Docker registry. So in production, in our commercial environments, we're using a hosted service from JFrog that's not available to us to use directly as part of our FedRAMP offering. So we needed to run our own registry in boundary. So inside of that GovCloud environment, we have our own Docker registry that we're running and we push all of our images to that. So then when we wanna deploy stuff into the cluster, we need to reference those images. We also wanna make sure that the cluster isn't running things that it's directly pulling from the internet. There is some connectivity. There was originally, we've locked it down since then, but originally you were able to pull things from Docker Hub or pull things from our JFrog hosted solution. So the first thing we looked at with OPA was how do we restrict the use of those other registries? We wanna really lock it down to just the one. So we wanna make sure that requests that are coming in only come from that registry that we want them to come from. So one of the first policies we built was a pretty simple one that would look at the image that's being used by containers. So this policy really is cool and it lets us restrict any request that's coming in to only those that come from certain repositories. So in this case, we start off again with that deny block. And the first thing we look at is, does this kind, the kind of this request, so just like you deal with kinds and Kubernetes, we're checking that here. So an admission request will come with whatever type of object you're dealing with. So we really only wanna apply this to pods. And we can do this at different levels. We could look at the deployments replica sets. This was the simplest for us to just look at when a pod is created is the container using something using the registry that we expect it to use. It's a pretty simplistic check. So we iterate through all of the images. So obviously you can have multiple images in a pod spec. And we wanna make sure that each one of those things is valid. So we start off with the second line of the block, some I. So just for every image that exists in this array of input that request that object that spec that containers, let's grab that thing and validate it. So then for each one of those images, we basically just say, does this image start with what our gov repo is? I replace it here with VMware is awesome just for notional purposes, but you can see the, we're making a little bit more complex policy here by calling into that function. When this evaluates true, when the first line is it's a pod and when this is not a gov cloud image, we're gonna return the message pods container is not allowed to use the image from a non-approved repo in Gov. So what's that look like in practice? So using this deprecated functionality of creating a pod with QCTL run, we're still running fairly old clusters. So I can still do this. I'm gonna try to run an MQ test client from my personal Docker hub account. So I run that with QCTL run. That actually behind the scenes right now in that version of Kubernetes creates a deployment and that deployment that will then spin up pods. I don't get an error here though because my policy was really applied to just the pod. So to kind of work around that or to see how like what feedback you get, let's take a look at the events. We can run QCTL get events and filter that down to open policy agent in the string. And you can see that we can't actually create pod. And if I did a QCTL get pods here, you would see that there were no pods created for this deployment. And it's specifically showing that error message that I created before. So this was great. And we were able to lock all of the registries down, make sure that we weren't deploying anything from the non-controlled things that were inside of the boundary. But now we have some fairly unhappy users. And our goal all along, I mentioned this at the beginning, was to make sure that the users didn't hate us. We wanted to make sure that things were as easy as possible for them. And not every one of our tenants is super versed in Kubernetes. They're using Kubernetes. They realized the benefits of deploying their stuff under the platform. They're along for the ride for GovCloud. But us adding this constraint makes it a little bit more difficult for them. They either have to go maintain a separate set of values files if they're using Helm or some other tool that does a kind of template in an overlaying. Maybe their Helm chart doesn't even allow them to really template that because they've not done a super great job of templating that stuff out. So there were changes that had to be made there. So we thought, what can we do to help with that situation? And I mentioned this earlier, but you can actually run OPA as a mutating webhook in addition to a validating webhook. So what's the big difference there? Well, when it runs as a validating webhook, we have those blocks and they started with deny. And what happens there is when all of the rules match for a deny, the validating webhook functionality will say this request is not allowed. Here's the error message. But just like every other mutating webhook, OPA can also update your resource. And it does that by generating JSON patches. The syntax gets a little bit more complicated and I'm not gonna show you the entire thing here, but I'm gonna show the relevant parts. So here we've defined two variables, VDP repo. So it's gonna be whatever our upstream public managed JFrog thing is. And then also whatever our GovCloud, the host name for our GovCloud repo is. And then instead of using the deny block, we're gonna define a patch block, which is gonna return whatever JSON patch needs to be applied. So in this case, we have a couple of extra things here. I probably should have removed from the example, but we first wanna make sure is mutation allowed. So we wanna to validate that the type of resource that we're gonna mutate is something we wanna mutate. We have some rules built around what namespace it's in or what labels it might have on it, specifically labels a brown disallowed mutation. So we have a label that we've put in place for some of our components where we don't wanna mutate like this because it could lead to unexpected consequences, but we essentially check to see if that exists or not and then move on. And then just like the deny rule, we're gonna iterate over all the containers and then we're going to check to see if that container matches the upstream public repo and replace it with the downstream value. If any of that generated a new value, then we're actually gonna make the JSON patch here. So I've removed some of the bits about actually making the JSON patch and I'll link the documentation at the end of the talk, but what will happen here is that when we make a request, so maybe we're gonna helm deploy some deployment and it's gonna reference our upstream JFrog repository, this code will actually get invoked. It'll look at that request that's coming in and say, oh, hey, you're using the upstream version. We can't use that in GovCloud. Let me go ahead and mutate that for you. So when this actually hits the API server, or sorry, the STD, it's gonna actually get the Gov repo instead of the upstream repo. So it'll try to pull that down and it'll work just like we would expect it to work, but we've made it a little bit transparent to the end users. So that's the first use case that we had that we solved with OPA. As we got further and further into the process, one thing that kind of bit us was this next requirement. It didn't PCI, but it's also in GovCloud, but I like the wording here a little bit more. And this is PCI requirement six, develop and maintain secure systems and applications. So there's a lot of things to unpack in that terminology, but specifically 6.1 underneath of this requirement is that you need to establish a process to basically scan for vulnerabilities. And when you identify these vulnerabilities, you have to remediate things that are medium or higher. So you get different severity levels and these are based off of CVSS scores. And when you get these things, you have, depending on the sort of, like whatever certification you've achieved, you have N number of days to fix them. So like in GovCloud, we have 30 days to fix things. It's not a long time, but it's also not a short time. But as we deploy a ton of stuff, we found as we went through this initial process, we found a lot of containers that we were deploying that actually had a number of vulnerabilities. So as I mentioned, these things are based off of CVSS scores. So in the PCI case, anything that's medium or higher, which is a CVSS score of four or more, you actually have to remediate or you will fail your PCI audit. And any re-investigation or subsequent audits that you go through, you have to demonstrate that you've been doing these things and fixing these things. And you can do this, you can check this yourself. There's a number of tools you can use. We happen to use Twistlock, but you can use an open source tool from AquaSec called Trivi that'll do pretty similar things. Just a really quick example of what that might look like. I scanned one of the images that we have deployed in our commercial environment. And you can see that it found a number of vulnerabilities. Two of them are critical. Three were high, four were medium, and three were low. So we definitely have to fix those criticals, those highs, and for PCI, we need to fix those mediums. What kind of things do you find inside of that? So these can be OS level vulnerabilities. You could say like the version, you're running an Ubuntu-based container and it's got a G-Lib C vulnerability inside of it. That'll come up in these scanners and that will get flagged by one of the auditors. So it may not really be a problem, but it's best for us at least to, the least amount of effort is to fix the problem. It can also be application level things. So in this case, this is a Java application and the problem here is actually the version of log4j that it's using. So how do we fix this process? How do we, the VDP team handle fixing these things? How would anybody else really handle these things? Well, you generally need to build some new container to do this. Updating libraries, maybe applying OS updates inside of the container if you're using something like Photon, if you're a VMware person or Ubuntu or WN as the base images. So we built a little process around that for ourselves and that involves taking whatever the base image that we have, so maybe it's some upstream component that's based off of Alpine or some upstream component that's based off of Ubuntu. We write a new Docker file. We take the old one as the from line. So if you're building a Docker file out, first line is from whatever. Then we run whatever the OS appropriate updates are just to make sure that we apply those things. And then we add in, it's something we've built. We add in, we make sure that we've rebuilt that with whatever the updated libraries are. And voila, that results in a new tack, hopefully without any vulnerabilities. Then we need to deploy that to the cluster. So we go and maybe we run a Helm update or a Kapp deploy, whatever functionality we're using. But then we have to repeat this process whenever the vulnerabilities happen. So for us, we scan this pretty regularly. And we automated that by building this, these images pretty much every day. So twice daily, actually, we run through that process. We build all of the images that we've identified in our inventory file, which is of course the animal file. And we update those things to generate new tags. Then we update the inventory file. And then we somehow need to deploy that to the Kubernetes cluster. And we wanna do this in a not really manual way. So we were able to pretty easily automate the front half of that process where we would rebuild these containers. We've already written the Docker files for them. We have the skeleton of the inventory. How do we then take that and deploy it? So one of the cool things is that OPA, especially when you're using the CUBE management sidecar, is that you get access to other resources in Kubernetes. And it acts just kind of like any other Kubernetes client would where it establishes a watch and sees things from the API server. So we put that into a config map in the cluster. It lists out the name of the image and then what version we wanna run. And then the OPA sidecar, CUBE management, sees that those things have changed and makes them available as a data field to OPA. You can access it like this. So next we can write a pretty simple policy that looks at that inventory and compares that to what's deployed and then mutates the tag. So we again start with a patch block. We iterate through all the images in this case the init containers. We do this for the init containers and the regular containers. Obviously we wanna update both. Call this update image version function which returns us back a modified version of that reference. And then we make a patch off of that and return it as just part of the mutation process. Just like we did with the repos, we're now updating the tag to match what we have in our CI CD system. So now our inventory file gets deployed to Kubernetes. We play that as a config map. That gets reloaded by OPA and then we run a small job that just labels touches the labels on all our deployments which then forces them to go through the emission process again. That forces the mutating web hook in this case OPA to update all the tags and then start up again with all of those new hopefully vulnerability free images. So the last policy that we really wanted to enforce was running as non-root. And running as non-root, we're doing mostly with pod security policies. Pod security policy works pretty well and it's pretty easy for our tenants to understand except when they don't pay attention to the notifications we send and don't make changes to their Riemel. So all of a sudden, my pods won't start. What's going on? Well, did you specify the context, the security context in Riemel? Oh, you mean I have to update my chart again? So again, we bring mutation to bear here. And in this case, we take again checking to see if mutation is allowed. We look to see if the spec already has a security context defined and if it doesn't, then we make another patch where we actually add in the run as user in FS group to make this thing run as non-root. The great thing here is though, when the users are making these calls and they're deploying their stuff, they can actually specify whatever security context they want and we won't mutate it. This is just a nice add-on for them when they don't have that done. So recapping, what did we learn, the VDP team learn? OPA is really flexible. Validation can get you pretty far. You can write a lot of deny rules to lock your clusters down and do a lot of things. But mutation can get you even further. You can do a lot of things when you combine these two things together. Rego is pretty easy to learn. The declarative nature of it just makes it pretty easy for people to pick up and we have found that it's pretty easy for all the team members to really learn it and start writing new policies or fix problems we found in the policies. And then with those in mind, we were able to balance our security needs pretty closely with our desire to make the user experience as nice as possible in this security world that we're living in. But let's go back for a second and talk about the mutation aspect of this. I have mixed feelings about mutating webhooks and if you read the documentation, the Kubernetes documentation, there's actually some call-outs to say, hey, you should probably be aware of these things and maybe it's not the greatest case. And one of them is that users don't necessarily know what's happening. They may be confused by like, what this thing that I've created doesn't look like what's in the cluster now, what happened? We actually had that problem with the security context mutation. As we were going through this process, people weren't specifying security policies on their deployments and then actually tried to run as root and gave us an exception request where we created security service accounts for them, things like that, but they didn't declare the security context because they were just depending on our automation, our easy mode access for that. So think about the things you wanna do with mutation and maybe use it judiciously and think about what impacts it may have downstream. So here's some great links if you wanna kind of follow up on these. If you wanna read more about the FedRAMP high requirements or the PCI standards, I've linked them both here. Play.openpolicyagent.org is great to go experiment and mess around with policy. There's a great tutorial link here as well and like how to validate ingress in the cluster. And then finally, one of the reasons we didn't use gatekeeper was that it doesn't do mutation yet, but there's an open issue here and maybe by the time cube comes around, this will be done. I would totally advise you to follow this and give gatekeeper a try if you're gonna look at OPA, especially if you wanna do, just validate and sort of things at the moment. So at that point, I'll turn it over to questions if you have any questions I would love to answer them now. Thank you so much.