 Thanks, everyone, for being here today in this quote-unquote advanced track for Cube Day. I'm Naveen, and I'm going to be talking on the topic of securing your Kubernetes empire with Open Policy Agent and Rego. But before we get started, just a quick show of hands. How many people here know or have heard of Open Policy Agent? Oh, perfect. Lots of folks here. How many folks here heard of OPA Gatekeeper? Perfect. And how many people actually use OPA Gatekeeper in production today? OK, much fewer hands. Perfect. So cool. We can get started. I'm Naveen. I go online as Mad Max. I did my master's from Triple I.T. Bangalore, and I currently work at this company called Obsource. I spent the first part of my life purely on the developer side of things, and now I've moved into the ops side of things. And we built ops tooling. So I keep telling my CTO, who's a career DevOps guy, that I spend the first half of my career yelling at my ops guys for being too slow. And now I spend the second half yelling at my engineers for being too fast and not being stable enough. So a quick thing about Obsource, we basically build open-source tool chains. We take open-source tool chains and build managed stacks out of them. So we have an observability chain. We have a deployment stack. We have Aden, which is a generative AI-based DevOps co-pilot. And we also do a couple of managed solutions. So if anyone's interested in chatting about this later, happy to talk about this. So let's quickly get into the agenda, right? I think the goal of this talk is to talk a little bit of policy as code. We'll talk about OPA. We'll talk about Rego. We'll talk about how to use OPA. And since lots of folks raised their hands, we'll kind of rush through this. After this, we'll focus on the meat of the topic, which is OPA gateway and Kubernetes and how we actually use this, both within our company as well as externally. We'll go through a few use cases. Through this thing, we'll stop at two points to actually go through some code and do a few demos as well. And yeah, we'll also then, at the end, we'll actually touch a little bit on the best practices for policy as code. So first things first, right? Like, how do you ensure security compliance requirements today? And this is something that has always been something that I've always wondered, especially when I was on the dev side of things, because it was a completely opaque process for me, right? It's always been a process where you go to the ops guy, you raise a request. In some cases, in a bigger company, you would raise a JIRA. You say, this is the access that I want. This is what I want my application to be able to access. And then a few days later, you get an update saying, hey, yeah, it should work. Can you test out now? But that opaque pain has always been problematic for me, especially coming from the developer side of things. And then now when I started working on the ops side of things, it kind of became very haphazard as well, right? So this is kind of a policy as code starts coming in as a concept. So basically, what policy as code allows you to do, and anyone who's worked with anything as code, right, star as code, as they call it, will understand why policy as code matters, because it kind of allows you to have a centralized place for your policies. It allows you to shift left. I don't like that phrase shift left. It's a little too catch-freezy, in my opinion. But basically, it says you move your compliance closer to where the development is happening. You have way more visibility, especially if you have something where you integrate with Git. Then everyone kind of has visibility into what your policies are, where they are set up, where they are updated. Both your ops and your dev teams can actually work together in order to create these policies, figure out issues, come up with better things. It's much easier to integrate with CI CD. And anyone who's worked with the ops teams know that CI CD kind of becomes a superpower when you're talking about policies. And then finally, version control, right? Which always means that you set up a policy. It doesn't work. Reverting it back is as simple as a Git revert. You want to have historical evidence of what your policies were six months back or a year back. It's very, very simple for you to do, right? Especially if you have a GitOps kind of workflow, then having policies, a policy as code, actually allows you to have complete knowledge within your team, within your external stakeholders, for what exactly are the policies that are managing. So this is where OPA comes in, Open Policy Agent. It's an open source project, which is quite obvious. It's CNC I've graduated back in 2021. And it does a very simple thing, in my opinion. But I think the way that it does it is quite nice. And all that it does, it says that you have your policies that run within your system. And it says that let's have the policy decision happen at one place and the action at somewhere else, right? So your services do the action. But all the policies are stored in a single place. And OPA kind of does this policy analysis for you, like to actually run the policy and come up with a decision. So you have a request. You have some sort of data. And you have your rule, your policy rule. And all that OPA kind of does is it says that based on this data, and this request, and this rule, this is what the output is. And then your service actually takes care of the rest. It runs pretty much anywhere, which is again a quite cool thing about OPA. And again, the nice thing is that can be run both as a service and as a library. So just to touch on that a little bit, what it means as a very common mode of running OPA is you run it as a HTTP service. And any other service that you have, right? It will actually make an HTTP request along with the request query that it has to the OPA service, which then goes through the policy, which is written rego, which we'll touch upon, and it goes to your data. And it actually returns back a response, which then your service can act upon. For some cases, especially if you are very latency sensitive, then making an additional HTTP request is something you want to avoid, in which case you can actually also run OPA as a library. Ideally, if you work with Go, it's kind of easier. But otherwise, you have like VASM runtime that can actually use within other languages as well, where you can compile your regos into VASM and then run it within your library. So we can actually touch a bit on rego, a language. It's declarative inspired by Datalog, sorry. But the nice thing about it is it actually builds on Datalog quite a bit by understanding very commonly used paradigms like what we use like JSON. It's declarative in the sense that it looks way more like SQL than you would look like, say, C or C++. And it's made up of assertions that determine policy decisions. And the nice thing about rego is that, like SQL, it is very, very easy to get started with. And we'll actually go through a quick demo about this. But then, yeah, obviously, it can get unwieldy over time. So we'll talk about some best practices later on as well. So we'll do a quick demo of rego, right? And we'll take the trademark example. For anyone who's not worked around with Open Policy Agent, the playground that they have is a really, really cool place for you to actually test stuff out. So it goes at play.openpolicyagent.org. So this is basically a simple rego policy that we have, right? Where all we are saying is that if a user is admin, if one of the user's roles is admin, then we kind of let them through. And then we have a set of data where it says that you have a user, Archana, you have Bharat, and you have Charu. And you have Archana as an admin, Bharat as an employee, and also a contractor, and Charu's an employee, right? So let's just take a very simple example. If you have a request that comes in and the user's Archana, and it actually evaluates to true. So what it says is allow this user to access whatever resource you're trying to access. So if you imagine you have a service, or you have multiple services where you say you want just admins to be allowed to access that service, or you only have admins allow access to that API, then having a simple OPA like this, a simple rego like this will actually do this for you. So the same way if you say Bharat, for example, then yeah, they allow automatically becomes false, because, again, as per the rego, Bharat is not an admin, hence he doesn't have access, right? The nice thing about regos is you can actually build on top of it with different code blocks, right? So another example is here, right? So we say allow if the user's an admin, or if the user's an employee, but not a contractor, right? This is a very simplistic example. You can probably just say like the user's not a contractor for the simple use case, but just to show it, if you say that the user's an admin, or if the user's an employee and not a contractor. So in this example, what should happen is Bharat should not have access, and Charu should have access. So you see the allow is false for Bharat, and if we actually go to Charu, then the allow is true, right? It kind of becomes very simple for you because of the declarative nature of rego to actually be able to build up slightly more complex use cases using these simple building blocks. And the nice thing about rego is you can actually go much more complex, right? And this is a more complex example that actually does something like JWT decoding, right? And this is something that practically every one of us is actually done, where you have a JWT token, and at the first point, you have some middleware in your API where you're trying to capture that token, and you're trying to say, hey, like is this person allowed to perform this action, right? So the nice thing about rego and OPA is that it actually has some helper functions as well, which do things like JWT verification very, very simply for you. So just to dig a little bit deeper into some of those options that you have, like I mentioned, you have a bunch of built-in functions, depending on whether you're using numbers, aggregates, whether you're using HTTP, GraphQL, time-related functions, right? So for example, if you want to say, I only want to give access to this resource if a user has been with my servers for more than 30 days. It's quite easy to do policies like this without actually having to manage it completely separately. And again, because regos are just text files, it means that you can always commit it into Git. You can always have your GitOps workflows. You can integrate it with CI CD. You can do a bunch of stuff around it. So let's get back to this. So yeah, like I said at the start, we spoke about OPA so far. But obviously, it's a Cubed day, right? So obviously, you guys care about Kubernetes. But the nice part about OPA is that OPA and Kubernetes sync together like a match made in heaven. So earlier on when people wanted to use OPA with Kubernetes, the standardized workflow was to run OPA as a sidecar container and then basically have your main service call the OPA sidecar for whatever it needed. Again, it's a little bit messy because managing a sidecar kind of, it's possible to do it, but it's kind of messy. At the same time, how exactly do you manage your regos? Bring them in as config maps, bring them in as some. Do you want to manage them as secrets? Because in some cases, you want things within your ego to be secrets. So how do you manage that that kind of becomes problematic? So this is where this new project called OPA Gatekeeper came in. And OPA Gatekeeper basically brings all the goodness of OPA to the world of Kubernetes. So what exactly does it mean? It means that this is completely Kubernetes native, and we'll dig into that a little bit more. It is backed by Open Policy Agent, which means anything, if you have Open Policy Agent running outside of Kubernetes and you want to use exactly the same policies for your services within Kubernetes, you can do just that. The nice thing that I personally find useful, especially when we're trying to talk to people who are trying to start on their journey of compliance, is that it comes with a well-built library of common policies, which you can use as a starting point. And we'll go through that as well in a little bit. So the reason why this pre-built library of common policies is very useful is because if someone's just starting off with compliance, they always say things like, don't roll your own auth. The same way I always suggest people don't try to roll your own compliance policies unless you know what you're doing. It's very easy to start off saying that I'm going to build out a compliance policy that ends up being either too open or being too closed, which leads to a lot of work later for either the ops team or for the dev team. And that makes it unbearable for both the ops side and the dev side of things. So that's why this pre-built library of common policies is something that's very useful for folks. So installation, it's very, very straightforward. Simple kubectl apply. There's a 3.14 joke there if anyone's interested in it. Yeah, our Helm, which I'm sure a lot of us are fans of, it's literally two commands. You have OPA running with gatekeeper running within your cluster. It also runs as a replica set or a deployment runs as replicas. So in case you're running a large cluster and you have way more policies that you need to do, you can always scale up your replicas as you require. It kind of makes life very, very easy for you to actually get gatekeeper up and running. And yeah, so what exactly does this installation do? So when you actually create this installation, when you run a Helm install or a kubectl apply, what OPA is doing is it's basically creating a bunch of Kubernetes resources. It obviously creates the pods for open policy agent. And it creates an audit pod as well. And you have a validating webhook and you have a mutating webhook. And there's a bunch of policy CRDs that get created. So now the nice thing about, because you have a webhook of validating webhook and a mutating webhook, basically that means that any time you're creating a Kubernetes resource or you're editing a Kubernetes resource, it actually can, that becomes a policy point for you. That becomes a point where you can ensure compliance happens. So what exactly would that mean? That means that, say, that I want to bring in certain rules within my system, irrespective of whether I set up my pods using a kubectl apply, or I use Argo CD, or I use any other deployment framework, or I use a Jenkins job for all it matters. It kind of becomes a standard interface for where compliance checks are going to happen. The nice thing about the way that this new gatekeeper works is that it actually has much more fine-grained access. So you can actually say things like, I want this policy to apply on pods of a certain type on a specific namespace following certain rules, which was a little harder with the earlier sidecar-based approach that we had. But yeah, and we'll actually go through some examples of how exactly this works as well. So yeah, now we've spoke enough about how exactly Gatekeeper works. So let's actually see some of this in action. So you're going to have a cluster, and I have Gatekeeper running on it. So OK, this is why live demos are never cool. OK, it looks like the demo may not be working. So basically what the demo, what we are going to go through, I can actually show you what exactly that does. So we have a pseudocorp, which has a bunch of policies. Let's say that the policies are things like a simple one, which may be you only use internal registry for your images. We don't want to use docker.io. We don't want to use QA. We don't want to use anything else. We only want to use internal registries for the images that are running within our pods. The second thing is that every Kubernetes resource should have a team label, and everyone who's part of a cost center in their company knows why this is critical, because all of us have Kubernetes clusters where you have a bunch of resources that no one knows who owns. You're paying for it because you're running machines for it. It's taking up space. It's taking up resources. But no one really knows who actually owns this. But you don't want to terminate it because it may be running something critical. Then you say something like every pod should have explicit resource limits. And again, this is something that comes in from personal experience where you have someone who comes in, sets up a pod that has no limit set up, no request set up on it, which basically means that over time, they have a memory leak. It ends up taking all the space in your cluster. And then at some point, it's taking down other services that's running in the cluster. And you don't know why. And finally, you say that no container should run at root, and this I'm not even going to go too deep in. This is quite obvious why we want this to happen. Now, in the earlier world, if you had no OPA or no OPA gatekeeper, then it would always make it harder for you to do this, because the way you do it would be through some sort of an audit, to audit all your resources. You'd spend some time. You'd have someone from the ops team go through all your Kubernetes resources manually. There are some automated tools as well available there, which would run once a week, once a day, something like that, to figure out what exactly is happening. And then you try to figure out who is the team that owns this. Again, for things like which every Kubernetes resource should have a team label, then you start hunting down which team most likely owns the service, which most likely owns the spot. Then go to them, ask them about whether they can add a label. Then they'd say, yeah, we'll do it. And then they don't really do it. And then you go back and forth and back and forth. And I've been on both sides of this table, right? So I understand the pain on both sides. So this is something that you have, right? So with OPA gatekeeper, it kind of becomes very straightforward. So what OPA gatekeeper does is it kind of comes up with something called a constraint template, right? And you can set up a bunch of constraint templates within your system, which basically tell you the thing at the bottom is actually the rego that you have, right? So if you see at the bottom here, it's a very simple thing. All that it says is this is internal registries only. So it says that if you take all your containers and you basically say, for the image that you use, it has to belong to a specific repo. Note that this is a very standard rego. We are not really telling it what the repository is here, right? Because this is a template at the end of the day. So as we mentioned at the start, OPA works with your policy, which is a rego. And there's also some data that comes in. And when you're using OPA gatekeeper, instead of using constraint templates will be the rego that you have, and that is the rego code that you have, which is quite straightforward. But then you actually want to set the data that any request has to be validated against, right? You create something like this. So if you notice here, when you had the previous step, we actually created a constraint template. And we actually give it a kind called internal registries, right? This is a custom CRD that we are creating. It's a custom-custom-resource definition. So later when you want to actually create data that you want to validate against, you actually create a CRD of that type, right? So here I'm creating a CRD of type internal registries. And I say I can tell it exactly what I wanted to match. So I'm telling it just to match on pod. The namespace is our default. And the parameters here is where I say the repo is registry.sudo.com, right? So if you see here, we actually have the input.parameters.repose. So this is where it actually validates against, right? So yeah, the whole point of this is that once you have your CRDs that have been created, and once you have the CRD, the constraint template, and the CRD created, every single API called that goes to the constraint is actually going to get validated, right? So yeah, this is basically how that workflow would look like. I wish there had a demo that I had to show. I had a few examples that we could go through, but it can actually go through. So something like this, a very simple image like this if you had a pod that you had defined like this. Let's just say that for simplicity, we already have the security context, which means that the root is not allowed. And we have a bunch of other rules that we have, right? So the three policies that we had set up on this cluster. One said that we don't use Docker.io. We don't use any custom clusters. So this obviously fails that, would fail that. The second thing is we say that any pod that we have cannot take more than 200 M of CPU, right? Which, again, is a very, very limited thing. And in this pod, whoever, the dev or the ops guy, has set up 500 M, which means, again, it's a policy violation. And the third constraint that we had was that every pod should have a team name, right? A team name as a label. So now if I tried to kubekettle apply this pod, then what would happen if a kubekettle applied this YAML, then what would naturally happen is we'd have three policy violations, where it would say that this pod is not going to come up because of the policy violations. And it's going to get blocked at the entrance itself, right? And it's going to say that, yeah, you don't have a team name. You don't have the images not from an approved registry. And your resources are not set correctly. Resources are above the limit that we have set up, right? And just to see what, yeah. So this is a simple limits configuration that we have, where we say that this is a container limit, and we say that every container should have a limit where the CPU is 200 M and the memory is 1 gig, right? It's just randomized numbers. So because we have this as our thing, and if you see the CRD, the CRD is, again, a custom CRD based on a constraint template called KITS container limits. So what this means is that, again, because I have the container limits up to 200 M, I'd actually get the violation. So let's just imagine that I ran this against my cluster, and it actually went through perfectly, and then it actually threw the violation, right? The fixes for this are, again, quite straightforward. So I'd basically add a team label, say pseudocopdev. I'd have pseudocop.com. And I'd have to reduce this to 200, or let's say 100. And only after this, it would actually allow me to create the pod, right? So that is kind of how this would work. Sorry about the demo. I hate live demos, but yeah, this always happens. So we'll just move past that. But yeah, basically, if you had this, that kind of allows you to actually say where your compliance happens, right? The nice thing about OPA Gatekeeper is also the fact that because you have these checks happening at the time of application, right, you can actually set this up at your CI-CD level as well. So that means that every time you create a new YAML that you want to execute either via Argo or via your Jenkins or or via some other workflow that you could apply or Tecton or whatever you use internally, that actually becomes a compliance checkpoint for you. The nice thing, again, about OPA Gatekeeper is that it allows you to work in two modes. One is the audit mode. And again, for a lot of us, I think compliance is something that comes very late in our journey, right? Compliance comes up only after someone comes and says, hey, we need to do XYZ compliance, right? And trust me, when I say this, a bunch of folks working even in very sensitive domains, like fintech, health tech, with all the compliances in the world, they have the SOC2s and the HIPAAs and stuff like that, when it comes to their cluster management, when it comes to cluster compliance, right, that's something that they have always been struggling with and always come out with workarounds for how to manage it, right? But the problem with the compliance is that it's always hard to do in a post-factor basis, right? You always have to do it early on rather than later on. But the nice thing about Gatekeeper is it actually allows you to do things like have violations tagged but not blocked, right? So it'll actually log violations. So you can say, for example, that right now in your cluster, these are your policies, but right now in your cluster, there are these five pods that don't follow your policies. Or there are these 10 images that are not coming from registries that are approved. And then you can follow up with folks to actually get that changed, right? And then later move to a actually block or deny kind of workflow that you want to do. And that segregation between a deny and only log kind of makes Gatekeeper very, very useful. This, again, is something that I spoke about at the start. When you have OPA, Gatekeeper, there's a bunch of pre-built policies. For a lot of common use cases that you have, whether you're using validation or you want to mutate your request as well happening to the Kubernetes API, both of these can be managed with these pre-built libraries. And to anyone who is getting started with OPA or Gatekeeper, I kind of strongly suggest to use the Gatekeeper library as a starting point. Because it has a bunch of rules that you will have come across that you will have thought of setting up. And again, the nice thing about this is it's always open for contributors. So in case you have use cases, I recently had one of my colleagues who actually contributed something to the Gatekeeper library that's based upon some of his works that work he's doing in an e-commerce company, which kind of became a use case that is much more broad. So it actually ended up becoming a contribution to the Gatekeeper library. So anyone who's looking for open source contributions who's trying to get more involved with OPA but doesn't really want to get into the core building of the OPA library itself, the Gatekeeper library is a great place to start off, because it's primarily just around building out policies that you think other people can gain out of. So yeah, let's talk a little bit about best practices. As with anything else, I think the KISS principle is always your friend. It's always easier to start off with a simple policy and build it over time rather than start off with something that you think is all-encompassing and ends up becoming too unwieldy very early on. The second thing, deny by default, is your friend. Again, that very simply, I think everyone in the ops team kind of understands why deny by default makes sense, and purely speaking from a logical side of things, it kind of makes sense to have a policy that denies everyone by default. And then the more lines of rego that you have or the more lines of code that you have, the more people have access to whatever resource you're trying to do. So it's more of a common sense kind of thing, but very often we see folks who try to say that, hey, I want something that everyone has access to, but only I want these three people not to have access to it for whatever reason. And then they build something where it's allowed by default, except you block these three people or three roles. That in three months' time becomes, I want to block these three people and one other role. And then you want to block a few more folks. And then kind of that rule set becomes very unwieldy. So always deny by default. And even in the examples that we showed, I think deny by default has always been what we've used by. Gatekeeper, again, allows you to log policy decisions, which is very nice because it allows your devs and your ops teams to do policy replay. The caveat here is unless you mutate very heavily or you have multiple mutation webhooks within your workflow, then actually replaying it or getting the actual request at the time it hit the gatekeeper can become a little harder. It can still be done, but it's still a little harder. But usually we kind of ask people to understand if they actually need to use the mutating webhooks within OPA or Gatekeeper. If not, if you can stay away from it, that's always great. The fourth bit, everything I scored, the reason why everything I scored is great. Again, like I mentioned at the start, is because everyone in our team has that one ops guy who kind of manages all your access. And we've seen this happen at companies with four people. We've seen this happen with company at 4,000 people. Where there's always that one guy who, if he's on leave, everything comes to a standstill for a date. May not be something critical, but a lot of decisions depend on him. But when you go to having everything I scored, Git becomes your bus factor. Which means, and again, because Git is distributed, it means basically all your decisions are documented. Everyone knows where it is. Everyone knows how it is. Everyone knows when it came in. Everyone knows what it does. And if you integrate with CI CD, it's very easy to go back, revert, anything that you want to do. And the final bit of this is we have dev. We talk ops. We talk compliance. We talk security. And we do dev ops, sec ops, dev sec ops, AI ops, ML ops, all these ops. But in my opinion, and this is something that we have seen more and more as you talk to more folks within the industry, is that they're all just facets of the same problem. They're all just different facets. But it's just we are looking at the same problem from different angles, which I think is very similar to the five-blind men looking at the elephant. You're all looking at the same thing, but we are looking, touching it in different ways. So we come up with different constraints, come up with different ways of looking at it. But the core thing is that visibility is the key. And just like with the five-blind guys, if all of them had visibility into what everyone is seeing, you have very clear vision of what we are building out. And at the end of the day, having some of these policies in place becomes very, very easy. And using something like Gatekeeper and using Rego, and actually having documented your policies, especially if you have it in Git or somewhere where even your dev teams or your compliance teams and your ops teams can actually see what is happening, becomes very powerful. One final caveat on this is you actually want to move your policy making one step higher as well. We usually look at policies as something that either devs build out or ops build out, or the compliance team builds out. But in some cases, why is it limited at that? Why can compliance not be built out by, say, a product manager? Why can compliance not be built out by, say, someone who's deeply working on the revenue offside of things? We actually want to have that happen. And the nice thing about having, by using systems like Gatekeeper or OPA in general, what you're allowed to do is actually build out tools. And we actually have tools like Opal that are here. That's an administration layer on top of OPA, which kind of allows you to open up the policy making process itself a little bit more. So that again reduces the load that's on the ops team to create policies. But you can actually bring in other members of the team as well. Because at the end of the day, you don't want anything to be bottlenecked on one single person. So yeah, that's the entire talk that I had. In case anyone's actually interested in OPA, we actually have this very nice repo called Awesome OPA by the folks at Staira who actually built OPA. It has almost everything that you need, including a bunch of stuff that I spoke about in the talk as well. And yeah, so feel free to check this out. And the slides, in case anyone's interested, they are online, so you can check them out. And anytime, if anyone has any questions, feedback, anything, I'm available on X slash Twitter as Naveen Pai and at Obsverse as well. Thank you so much.