 Welcome to our talk on enabling autonomous teams through policy enforcement. Before we get started, a quick introduction about ourselves. My name is James Alseth, and I'm a security engineer at Ubico, currently focused on cloud infrastructure security. Presenting with me is John Reese, who is a software engineer at Ubico with a lot of experience in Go and Kubernetes. Before we get started, let's go over what we're gonna be discussing today. First, we're gonna start with a brief history of Kubernetes at Ubico, how we got started on our Kubernetes journey and some of the previous gaps in our Kubernetes tech stack. We're then going to discuss how policy helps to address those gaps and enable more autonomous teams. We're definitely gonna talk about the awesome open-source tooling that enables us to enforce these policies. And we're gonna wrap up with discussing our journey thus far, sort of where we are now and where we see ourselves in the future. And of course, there will be some time with them for questions. So let's chat about how Ubico got started on its Kubernetes adoption journey. For us, that started about two years ago with an initiative led by our infrastructure team to standardize the platform that our services run on. Previous to this, we were mostly running on virtual machines and we didn't have too many containerized workloads yet. Like many organizations, we used a managed Kubernetes service to get up and running as fast as possible and to help avoid some of the pain points of cluster setup and cluster management. For us though, probably the first question that pops into our head is how can we ensure that the Kubernetes workloads are configured securely? And what that's really asking is, how do we control changes that are made to the cluster? Well, we started where I think most organizations do, leaning on three things. Authentication, authorization, as well as consistent peer review. Diving into the first of those, we were able to take advantage of the managed Kubernetes offering within that it allowed us to tie into our existing identity provider very easily. So we got up and running with authentication pretty quick. Additionally, it probably comes to no surprise to those of you who are familiar with Ubico, but we also require strong multi-factor authentication with WebAuthn and Ubiquiz. Additionally, we regularly expire the sessions for those with access to infrastructure, requiring re-authentication with multi-factor authentication frequently. Moving on to the role-based authorization. Thankfully, Kubernetes has role-based access control built in, and it has for quite a few versions now. This allows us to tie users, groups, and service accounts to any role. These roles can either be scoped to a namespace or they can be applied cluster-wide. These roles allow for extremely fine-grained permissions, allowing you to specify the exact verbs an actor can use, such as create, update, or delete. What types of resources they can act on, such as a deployment resource, and it can even go as far as restricting it only to specifically named resources. Again, our managed Kubernetes offering made this easier by tying groups from our IDP into this system. And the final tool in our tool belt was peer review. For us, we enforced this using GitHub branch protection rules, ensuring that all changes happened through pull requests. And on each of these pull requests, we required at least one other person to review. For us, most of this peer review work landed on our infrastructure team, because they had the most experience to Kubernetes and had spent the most time to learn about all of the best practices, security, or otherwise. About peer review, though, it definitely has some drawbacks when it's the only way that you're restricting changes to your clusters. For one, it needs to be consistent in order to be effective, and when you have consistent review, that's a significant time investment for the reviewers. Because of this, it often bottlenecks on the team or individual that has the most experience with the technology, which of course in this case is Kubernetes. And all of this adds up to slowing down the release cycle. And this is important, because when teams hit too much friction, they often start to work around your processes, whether you know it or not. This is of course bad for security because you no longer have control over these configurations, but they're also bad for just general cluster consistency and maintainability. None of this was really a surprise to us. We kind of saw this coming from a mile away, but that didn't make the problem any less real when we had to deal with it. So what we did is we spent some time researching what other organizations were doing and what the Kubernetes community was doing. And for us, the answer became abundantly clear. Policy was the way forward. When people think of policy, it's usually a negative reaction as they imagine having more hoops to jump through in order to get their work done. However, in our case, since all of these changes to Kubernetes happen through the API server, working with structured JSON data, we can automate the enforcement of these policies entirely. But what do we mean when we say that in this context? Well, it allows us to enforce what we actually care about. For example, we don't really care that a services team is deploying a new version of their service. That's a part of their core job function. However, we do care that when they do that, the resources are configured securely. For example, we probably want to ensure that the workloads aren't running as root and that they don't have any extra Linux capabilities attached to them. These policies can also easily extend past security related settings though. For example, we can require each resource in a namespace to have a certain label set that identifies the owner of that resource. That makes it easy. So when you're working on troubleshooting an issue or you just need to know who owns that resource, it's right there in the metadata of the resource. With that, I'd like to turn it over to John to discuss the tooling we've selected to enforce these policies. Thanks, James. So as James mentioned, we knew we wanted to use policy to solve a lot of the problems we were having at Ubico. We looked at a lot of the tools out there that solved this problem, but OPA was the clear winner in this space. We saw a lot of adoption with other tools that we knew we wanted to leverage and it came with its own policy language. But before getting into OPA itself, it's really important to understand what a policy is, what it looks like, and Rego itself. So Rego is the policy language that OPA knows, that understands, and on my screen here, you can see a policy for a Kubernetes manifest that says that it must have a owner label on it, specifically in namespaces must have an owner label. So when a request comes in in the cluster for a namespace creation, this policy will first check to see this input that's coming in, this input document, is it a type namespace? And if it doesn't have an owner's label on it, return a message that says namespaces must have an owner. So the user, the individual trying to deploy this namespace knows how to fix it. And so the important takeaway here is the input keyword. The input keyword denotes a input document for Rego and the input document is just the structured data. In our case, it's a YAML file. So anything beyond the input dot should look really familiar. We see kind, we see metadata, but the input dot and anything after that is just dependent upon the data that you give it. It could be terraform, it could be a Docker file, any sort of structured data. So again, very generic language, and extremely, immensely powerful. And so now that we have this Rego file, we need to wait to actually determine if the document that we give it would be in violation. And there's a few ways to do this, right? We could read the input document, the Rego to James and he can verify on a case-by-case basis whether or not the document is violated or not, but we're all about automation. So as briefly talked about before, we decided to go with the Open Policy Agent and just look at the logo, of course we did. There's no reason not to choose OPA, it's a work of art. But no, really, the community is great. Again, there's so much adoption around OPA. It's been a real joy to leverage OPA working in their Slack channel. Everyone's super friendly, there's always someone there to help you out. It was a really good choice for us. And so how this works, how OPA works as a service, you deploy it somewhere, be it Kubernetes, be it a web server, wherever you want to put it as long as you can get a web request to it. And then you also include your Rego files with that deployment so OPA can know which policies you want it to enforce. And so you submit an input document, again, be it a Kubernetes manifest, be it a Docker file, anything you want, give it to the OPA service, it will validate the document, run through the policies and tell you does this document violate any of the policies that you have loaded into it. And there's also the really nice piece of the fact that it can also take external data. So you can see here in the bottom right, the data document, that's external data. That can come from any number of sources. So if we build on the previous example of the policy where that teams must have a namespace label, like all namespaces must have a label of owner on it, we could also add a policy that says that owners, teams can only own a single namespace. And so in order to do that, we would need some form of external data, in this case, a count of how many namespaces that they've already created. And so when we create namespaces, when we delete namespaces, we can keep track of that number, give it to OPA and then OPA can use that when it's evaluating all of its policies. And so while that was an example of using OPA as a service, you can actually use OPA as a library, which makes it so much easier to enforce policies. A lot of tools out there that will actually take the OPA engine, import it as a dependency, and then run the same checks that OPA itself would. And so we really wanted to leverage this type of functionality in order to shift our policy enforcement to the left, because we quickly realized that when we deployed OPA, while we were able to have this policy enforcement, this policy validation, our engineers didn't really know whether or not the code they were writing, the policies and manifest they were writing would be in violation until they actually deployed it. So we really wanted a solution that we could give them, put on their local machine and they could just run it without having to worry about any of that. It had silly networking stuff. And so we found a tool called Comtess. And Comtess, more or less, lets you run OPA on your local machine. Like OPA, you can, but Comtess just makes the experience so much better. So like when I said before, you can give any input document to OPA. It doesn't really care. It's like, it's half true. It still doesn't care, but as long as it's YAML or JSON. So Comtess can actually take a lot of different file formats. You can see here INI, TOML, HCL, and it can convert those file formats into JSON and then shepherd it into OPA so that you can actually use any format with OPA. There was also some nicety about being able to take in files from different folders anywhere on your machine and then print the result again in a user-friendly way. So Comtess is for local policy validation. It's for pipeline validation in all of your manifest and then anything beyond that would be like OPA deployed as a service for continue enforcement. And so you can see here, this is another example of a policy file on the left. This time we're looking at deployments to make sure they're not running as root and that the container has app label for POM selectors. And on the right we have our input document, our deploy.yaml itself. And so if we were actually run this through Comtess using the test command, all you have to do is pass in the policy.rego as well as the deploy.yaml and then it will take that deploy.yaml and see does that input document violate any of the policies that we have set forth. In this case, there are two policies that were violated. And we can take this policy and we can apply it to the OPA as a service and get the exact same behavior. So we're checking both the local environment as well as the production environment. And then again, this can be run local machines, pipelines and you can get that immediate feedback. There's also the benefit with Comtess to be able to share policies across the organization because it's a fair use case, a valid use case that your policies are managed and written by a completely separate team, a security team, much like they are done at Ubico. We'll have a lot of different repositories. We want to make sure that our policies are being enforced and there's really no way to get those policies without trying them out in the cluster. But with Comtess, you can actually push policies, policy bundles to a OCI compliant registry and then pull them down for later use. So you can see here in the example where we're pulling down a bundle of cluster policies and then we're actually running the Comtess test command locally on that bundle that we just pulled on that same deploy.yaml and then the result is the same. So this is a way to be able to push policies out there and then pull them down for cross team usage. This is really huge as well in pipelines or other kind of approaches where you need to bundle together a lot of different policies. And so Artifact Hub is an attempt to be able to expose a lot of different policy bundles because more or less the policies that we really care about, things like containers having resource constraints, containers not running as root, those types of policies should more or less be the same. But we're kind of in a state now where teams, companies, organizations are all writing the same policies. We don't have a good distribution mechanism and Artifact Hub is again the solution for that. It's a currently it's a sandbox project. There are a whole lot of policy bundles out there, I think there's like one or two. But I would definitely keep an eye on this, contribute where you can. If you do have a public bundle that you want to contribute, I recommend pushing it up there and let's try to make this successful though. So we can actually start sharing policies with one another not having to rewrite them over and over and over again for every team who wants to be able to use policy. And so a lot of the talks that we've, a lot of the conversations we've had is all about policy enforcement on the local environments, but we also need to make sure that the policies are enforced in production. And so once you get beyond the local environment, we decided to go with a tool called Gatekeeper. And Gatekeeper lets us enforce these policies continuously inside of a Kubernetes cluster. It's a mission controller. So when you deploy a resource to Kubernetes, it'll go through the admission controller, look at all the policies that you have loaded. And if any of the resources that you are attempting to add to the cluster, it will reject or let the policy go through. And it also has some audit functionality saying it'll continually audit your cluster. Are there any resources that are violating policy if in the case that you're just starting to adopt Gatekeeper and you want to see if I were to actually enforce this policy, how many resources are kind of out of band or just in the case that Gatekeeper went down for a little bit, you know, want to make sure that there wasn't a resource that got in like during that small window. And again, the huge win here is using the same policies. So if you're checking on your local versus in production, we can always continuously use the same policies that doesn't change from environment to environment. But with that said, it's almost the same policies. So we go back to the input document. When you're working with YAML, the input document is going to be a YAML file. So we would expect input.kind, input.metadata, just like we saw before, but when we're in the context of admission control, it's gonna be a little different. I bolded here input.review.object because this is what a admission review is gonna look like to Gatekeeper. This is the document that is gonna be received from OPA. So again, it's not quite the same. You would actually have to write a policy that had input.review.object in it, but that would only work for Gatekeeper. And if you did just input.kind, that wouldn't work for Gatekeeper. So what we ended up doing was really adopting this idea of rego being the source of truth for your policies because it really is. Rego is generic. It doesn't matter the context that you're in. It's really all about does this input document that you're giving me violate this policy that you have defined? And so in everything that's based off of your rego should adjust the changes of your rego and not the other way around. And so to solve this problem, we actually wrote a tool called constraint and constraint brings to the table three really important factors. First and foremost, it actually does provide a library where you can write policies that work with both CompTest and Gatekeeper. It's really just a wrapper that normalizes a polyfill that will say if you're in the context of Gatekeeper, then spit out your policies for input.review.object. If you're in the context of CompTest, it's just input. And that handles all of that for you. So your policies can be completely unchanged no matter what environment that you're running in. The other benefit is the template and constraint creation and management. So when you saw on the previous screen, the rego there was actually embedded into the constraint template, the YAML. And that's because that's how Gatekeeper is able to load in your policies. It's just done through a YAML file. And so you're gonna have a rego file on disk. And if you were to change that rego file, you would also have to copy and paste your changes into that YAML, which isn't the most ideal situation. And so what constraint will actually do is it will look at all of your rego files and then it will actually generate the template and constraint for you. So you never have to touch YAML. You're only focused purely on your rego file. And then lastly, it actually will generate documentation for your policies. We really wanted to give our engineers the ability to see what policies were being enforced as well as how they can resolve them if they ever run into a policy violation. And so when it comes to the policies themselves, it'll look relatively familiar. The biggest difference here is definitely the comment header. We added some metadata to the policy in the form of a header comment where we say title is the title of the policy. Here we're saying images must not use the latest tag. And then why this policy exists or really any other flavor that you want to give it to give this policy be it a description or anything else. The enforcement type here is saying deny. The alternative is dry run in case you just want to test out this policy in your cluster and not actually do any sort of enforcement. And then which kinds, which Kubernetes resources that this policy will be enforced on be it just ingresses, be it just namespaces, workloads, et cetera. So we define here a list of resources that this policy is enforced on. And then we use this metadata to generate all the other YAMLs that we were previously talking about. And in the violation itself, you see here we're importing two libraries which is the constraint library. The important note here is that pods.container, our pods library will actually look at containers from any possible source because pods can come from Cron jobs, they can come from deployments, staple sets, daemon sets. There's a large list of those and they're all embedded in Kubernetes resources differently. So pods handles that for you. And then we can take the resulting container that comes out of that does that container have an image of latest? And if it does, giving note to the user is saying that images must not use the latest tag. And again, we have two commands for that, one to generate the template and the constraints and then one for the markdown describing our policies. And then this is an example of what the documentation looks like. You can see here the ID P1001. We also embedded something into the tool that you can actually assign an ID to a policy so you can refer back to it instead of just using a title. The severity is coming from the severity of the rule because in Rego you can have warning, you can have deny, you can have violation. And based on that rule, different things will happen. So we pull the severity out of the Rego and put that in the document for you. Same with the resources that it impacts, the description that dictates what this policy does or why we have it. And then this is the Rego itself. If you want that information, get completely configurable. If you change your Rego, again, all the documentation is regenerated. None of this is hand typed. And then you also are given a link to the source and that source can either be a relative URL if the policy lives in the same location of the source, the same repository or if it's a remote repository, the source can also be a URL. So we found a lot of usage out of this tool. We use it for almost every day. It's really ingrained into our workflows when it comes to gatekeeper and policy. So if you're interested, if it sounds something like you'd wanna use, always open to talk about it. Contributions, always welcome. And that's really all I have for the tooling piece of it. James is gonna kind of go into deeper detail of how we actually have leveraged these tools in our policies and in our pipelines and processes. Thanks, John. Now that we have an understanding of how the tooling works together, let's dive into Ubico's policy journey. In the beginning, since we didn't have too many workloads on Kubernetes yet and we were sure that we had consistent peer review happening, we started with a simple plan because we didn't anticipate too many resources would violate these policies. We would start with writing the policies and their tests, after which we would engage with our services teams to add these checks to their CI flows using Conf test. Simultaneous to that, we would be working on deploying gatekeeper to our clusters in audit only mode. And then finally, after we had worked with the teams to remediate all of the identified issues, we would flip the switch and move gatekeeper into enforcement mode. As you might have expected from how I framed the previous slide, this plan ran into some issues. We made it as far as writing and testing the policies, but when we moved to engage with some select teams to add Conf tests to their CI flows, the issues became pretty readily apparent. Well, for one, there were more violations than anticipated, which meant that there was a potentially a very long window between when we identified these resources and when we would actually be able to apply remediation across all of our clusters. This was compounded by the fact that we were all, not only migrating existing workloads to Kubernetes, but we were in a growth period hiring and starting up some new services on Kubernetes as well. Additionally, since the policies were always essentially production, there was no way to safely test new policies or changes to existing policies. This meant that in a CI flow with Conf test, the CI flow would fail immediately if a violation was found. And similarly, Gatekeeper would just reject changes to the cluster if it didn't meet policy. Additionally, adding Conf test to our CI flows wasn't as easy as it could have been. For one, it kind of required our services teams to know the Conf test flags and sort of how it worked. And additionally, these, it took multiple steps to actually get policies from a remote source into the repo and then run the test. But most importantly, the Conf test results weren't surface to the teams working on the resources. This meant that unless there was an actual violation, no one would even know what tests were run or anything like that. This was especially important to us because we do have some policies that we have labeled as just warnings, which aren't blockers for deployment, but they are a way that we want to communicate to our teams, that the way that they have the resources configured might not be best practices, practice. Finally, policy admins such as myself had no visibility into the test results. So we didn't really have any way to track which repositories or which teams had started using the policies or any of the results for which policies were making the test runs fail or which ones for just emitting warnings. With these pain points identified, we determined that there was essentially two things that we needed to do. We needed to build a policy pipeline and that pipeline would ensure that the policies are safe to enforce throughout our clusters and we had to make the policy adoption as easy as possible. If either of these weren't true, it was a pretty good chance that our policies wouldn't be adopted or it would just be a long uphill battle trying to get our teams to adopt them. So first, let's focus on what we did to make that policy adoption as easy as possible. And to do that, we created two GitHub actions. The first is a wrapper around Conf test itself and it addresses some of the pain points from the previous slides. It automatically pulls the latest policies from a remote source. It surfaces the policy violation warnings and violations and warnings into the pull request comments so that the teams working on the resources can see the test results. And it submits the results to a remote server so policy admins can monitor the deployments. The second address is an issue that we learned about later was that some of our teams had started to adopt Flux CD for continuous delivery and we're using its helm operator. And its helm operator has a custom resource that lets you specify the helm chart source and then what values you want to apply to that chart and then it will go and fetch everything needed and just make those changes in your cluster for you. However, since helm templates are just that, they're templates, they aren't covering these resources, we couldn't use Conf test on them directly. And this is because the data structure in the YAML is different. And even if it was the same, with templates, you wouldn't actually have everything you needed until after execution. So what this action does is it parses the helm release resource. It pulls the chart info, the version, all of those things. It automatically sets up a helm repository so that it can pull those templates from the remote repository and then it executes the templates. So with this, we can easily template out these resources before we run Conf test so that it's nice and easy and it's a solid flow. This makes it easy for our developers because they don't have to remember all of the flags for the helm template command or the argument order or anything like that. One thing to note here is that this currently only supports public helm repositories, but we're working on adding support for private repositories too. So with the ease of adoption addressed, let's move on to the policy pipeline itself. Early on in the design of the policy pipeline, we have this one rule, is that we must not ever break production. And for us, that also includes any of the pipelines leading up to production. This meant that for our teams that have an automated deployment from development to staging to production, using custom metrics, that breaking their access to the development cluster is the same as breaking their production pipeline. In order to accomplish this, we use two main methodologies. The first was data-driven policy promotion and the second was a get-offs deployment flow. So diving into our policy promotion strategy, we wanted to tackle one policy at a time, which again, is just reducing the window that new resources are introduced that violate the policy while you're working on remediating the ones that you already know about. Another key component is we use gatekeepers enforcement action property to introduce the policies in dry-run mode where they are not enforced. However, when they are in this mode, we can still use gatekeepers audit functionality to audit our resources in our clusters and see which of them violate the policies. We've made the decision to only switch to enforcement mode after we've identified that all of the offending resources have been remediated. And as a side effect of this, we have avoided sending hard deadlines for our teams to update their resources and make sure that they're remediated. So one thing that we've done to work around that is in the case where a team says they just really can't get this fixed in the next couple of weeks, we can add temporary exceptions to the policy for those specific resources in specific namespaces, but only when necessary. That way we still have good coverage of 99% of the resources in our clusters that are adhering to the policy. And finally, we only promote policies to production when they're linked to a change management ticket, which allows them to be scheduled. This of course ensures that all the potentially impacted parties are aware of this upcoming change. And it lets us schedule around times that we may be adding a feature release, a product launch or something like that. Moving on to our GitOps deployment flow approach. It's a pretty standard approach where we use pull requests to move policies throughout the pipeline. We use branch protection rules to ensure that peer review occurs and that all of our unit tests pass. A couple of things that we're doing there is we require each policy have a unique policy identifier, and we require that each policy have at least two unit tests. One is for the positive path where gatekeeper blocks something that we expect it to. And the second is the negative path where a gatekeeper allows something through that we expect it to not block. We also take advantage of the GitHub code owners features, which ensures that the peer reviews come from policy admins. And that's because the policy admins are the ones here that are most familiar with the regular language and also some of the more intricate details of Kubernetes. And what's probably the most important point is that we use automation to create the gatekeeper and contest resources. This means that reviewers can focus on the policy and not the gatekeeper resources, those giant YAML files. And it also makes it really obvious if someone is attempting to go around the tooling in place because no human should ever be trying to modify the gatekeeper or contest resources directly. With that, let's go through the life of a policy in this policy pipeline. So here we have a high level view of how the policy flows. We start with by introducing a policy into the dev branch through a pull request. And we wanna make sure that the policy always has the enforcement actions set to dry run. After it's merged, Flux will automatically pick it up and start syncing it to the development clusters. Next, we then promote this to staging and production branches, still in driver run mode. And this is to ensure that we have full visibility across all of our clusters. Because try as we might to keep all of these clusters configured identically through development staging and prod. There's almost inevitably some variation in how they're configured. So one thing to note is when we merge into the production branch is when we actually generate the Conf test resources. And one thing that we do there since Conf test doesn't have the concept of an enforcement action is that we parse that enforcement action from the common header and we inline rewrite any violation or deny rules to warnings so that when they're used in the CI flows, the warnings are still surfaced to the teams working on those resources, but it doesn't fail the CI flows yet. So once we have all of these set up and running, we then use the gatekeeper audit data as the source of truth to identify our existing resources that violate policy. And then we open tickets to work with our teams to remediate the issues that we've identified or to add exceptions where necessary. After we've ensured that those are all remediated and the audit data shows this, we switch the development clusters and the development branch to actually enforcement. This is done by changing the enforcement action in the header from drive-on to deny. And once again, after we make this change and merge to dev, we are continuously monitoring the gatekeeper audit data to make sure that there isn't anything that we didn't expect. Finally, once we do that, we follow the same flow to promote to staging and then production. It's worth noting again that when we make this change to production is when the Conf test policies will actually be changed back to either violation or deny. So at this point is when the CI flows will start failing if they have resources to violate the policy. The Conf test metrics here is useful for us to see what's going on across our organization to have a more full picture, especially of newer repositories and newer projects that haven't actually made it into a cluster yet, but they are not the source of truth for when we promote policies. So where is Ubico now on his policy journey? Well, we have gatekeeper deployed to all of our clusters and we're tracking the resources that violate our policies. Additionally, we're making our way through each policy and moving each to enforcement as we go. We've noticed that in a lot of scenarios, these policy violations are actually introduced upstream, whether that be an internal or external project. And whenever we run into that, we try and make our upstream our changes as well. Looking into the future though, there's a few ways that we can work on making this better. The first is to enable our teams to write their own policies for enforcing their own specific best practices. This can be anything from a custom label to requiring that every deployment have a horizontal pod autoscalar attached. Additionally, we're looking into seeking data from outside the cluster into gatekeeper to be used for more important policy decisions. For example, we might want to sync our on-call rotation schedule into gatekeeper so that only the person who's on-call for a given team can make changes to production in the event that a production issue or outage is occurring. Looking even further, we are considering adding mutation controllers as well and those are a different type of controller that the API server can work with, which rather than just rejecting a resource if it doesn't meet policy, it can inline change the defaults as needed. One thing to note there is gatekeeper currently has an open design for their implementation of that. They haven't started on it yet, so if you want to add your thoughts about how that should be shaped or anything like that, go ahead and just search on the opus slack for mutation design and it'll come right up. And that about wraps it up. Thank you everyone for attending and we are going to open it for questions.