 Okay, we are going to get started. Hello everyone, thank you so much for joining us for today's CNCF webinar, Managing Your Policies and Standards. I'm Jerry Fallon and I will be moderating today's webinar. We would like to welcome our presenter today. I'm Ed Badrun, Chief Technology Officer at Megalix. Just a few housekeeping items before we get started. During the webinar, you are not able to talk as an attendee. There's a Q&A box at the bottom of your screen, so please feel free to drop your questions in there and we'll get to as many as we can at the end. This is an official webinar of the CNCF and as such, just subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of the Code of Conduct. Please be respectful of your fellow participants and presenters and please also note that the recording and slides will be posted later today to the CNCF webinar page at cncf.io slash webinars. And with that, I'll hand it over to Ahmed for today's presentation. Thank you very much. So we'll get started here. Thanks everyone for joining. I know it's a bit early for some of you. So today we'll be talking about managing policies and standards. We'll dig deep into that to see what that means. But mainly really, we're gonna be talking about two key technologies here that are open source part of the CNCF, which is Regal OPA, as well as Gatekeeper. So just a little bit about me before we get started and talk about this topic today. So my name is Ahmed Badran. I'm the CTO at Magalix, which is a startup that's about three years old now, specializes in kind of this area of governance and operational excellence for people joining, kind of going through the journey for cloud native. Prior to that, I did a while, some time ago at Amazon AWS, back in the day when all of AWS team used to fit actually in one floor in one building. Now they occupy almost a big chunk of Seattle. Went between different companies and it was interesting going through kind of the journey myself, kind of this cloud native journey even before the term existed, which is kind of what many of people now going through as they migrate kind of their legacy infrastructures and system and the monolith into this new world. So let's jump into that and kind of talk about really some of those cloud native challenges, kind of to create the context and background before we start talking about policies and standards and why should we even care. So when you think about kind of the old world of especially monolith where you have this single kind of big monolithic app where all your requests come in. So operationally, it has some interesting characteristics to it. It's simple to deploy, it's one thing, but also it's a little bit of a challenge because you get a lot of different components of your system in the same application, one part fails, everything kind of fails. It's either one or all. And then as people start migrating kind of to the microservices architecture and then sort of kind of adopting the dev ops kind of methodologies with these changes and these kind of journey that people are going through come also some challenges. One of the key things here that people maybe sometimes don't realize until they're really into the full implementation and productionization of their new microservices architecture is as you distribute your architecture and divide the responsibilities, you also dividing and distributing your problems. You're trying basically to decentralize the decision-making in your system. And that comes with its own little bit of unease, especially in the operational side because I used to have one thing that I know how to deploy. Now I have so many different things that I have to kind of make sure I'm aligned with the dev teams and how these different things should be deployed. There is now many different ways of people configuring their services, setting up things and definitely the containerization evolution, revolution if you may, helped with that standardization a little bit but also different teams have different ways of doing their thing, their setup, their configuration. So it creates a bit of that tension between the developers and the operator because at the end of the day, the reason we moved to the microservices, the whole cloud native promise is that agility. We wanna move fast. We wanna be able to iterate on a lot of innovative idea as a business. And that's why you separate kind of, you decentralize your decision-making and you let kind of your development team kind of be creative, innovative and move without having to kind of synchronize everything in a single monolith or a single, even organizationally you become agile and you become distributed. And that's kind of what the development team wanna kind of push forward with. In the operational side, you still care about stability and the operational excellence of your infrastructure on your production environment. Now, there comes this tension where engineers wanna move fast but then operators and the operation team wanna make sure things follow certain rules and standards. So it's not a zoo. If you wanna, everyone need to play a good citizen of this distributed infrastructure. We all share those resources. You know, one application running away with their memory or their CPU just impacting somebody else is not a good thing. So there comes this tension and it's obviously not a good thing to have your devs and ops kind of not in the same page. That's the opposite of the dev op kind of, you know, culture you wanna embed. You wanna people to be working together and synchronizing or kind of synergizing the effort to really solve all the business problems as opposed to be fighting. You don't wanna your operation team to be the bottleneck trying to review every change that goes out to make sure, you know, everybody's following whatever the standard is. So where does these standards or policies come from? Well, this could be best practices established by your team. They could be tribal knowledge you've built over time of how you wanna name maybe having a name and convention for your services or what have you or it could be security related configuration and best practices you wanna follow or depending on what industry you are in that could be compliance. So regulatory legal things you must kind of, you know, check the boxes for. Again, it's not, it's not a, you know, a problem that is only a few people faces. It's everybody probably going through the journey that they go through that and they struggle through that. Now, this is just one of those challenges. I'm not trying to kind of say this is the only challenge people, you know move into the cloud native, you know, paradigm is facing. But this is certainly one interesting one that tend to slow down and kind of, you know, kind of read its head a bit as people kind of doing migration and people kind of hit it a little bit late in the game sometime if they haven't thought about it because you're trying to move fast. You just microservices, you know you need all the technical things and it's like, you know this is utopia at the end of the tunnel but you end up facing some of these kind of really practical challenges that's going to hit you that affect your culture and could really impede your progress. So what are we going to cover today? What is the objective? You know, so we'll talk about a little bit about what is this governance which I kind of kind of alluded to as, you know how do you come up with a way for all of us to be good citizen of this new distributed world that we're going to live in. So we'll talk about that how to establish how to think a little bit about that kind of governance framework and then we'll look at, you know simple but kind of descriptive examples of just kind of open policy agent what it does and kind of the legal language and also we'll address a little bit about gatekeeper which is something that has been in the making for some time now but it's kind of, you know becoming more productionized than before. So and we'll even look at some examples of policies and Kubernetes that you can utilize. Now, instead of making this just kind of a very abstract actually I'll go through a true story that happened to us at Magalix and kind of how we went through and use that as a motivation to kind of really walk you through kind of the workflow of the process and how we ended up doing what we do and hopefully you learn from that as well. So one day, one of our SREs had this message in Slack there was this workload running in our dev environment and he didn't know what it was. He was doing some changes and he needed to change something and he needed to change the configuration of this thing and he wasn't sure who owns it because he needed to talk to them as this is, you know is my change again affects you, you know and I'm not, I don't know who this thing is and it took a bit for people to reply. Now, could this have been a malicious thing? You don't want to wait too long to know what's going on. Should we shut it down? You know, who should we contact? Maybe it is something in production or something of a critical. Now, this was dev, but assume this was production, you know as an SRE, you don't want to shut down something that maybe is part of some big feature going on or something just you happen not to be aware of. So that delay, you know causes problems and you really want to get an answer maybe to something like this very quickly. And we thought about it a little bit eventually we knew who it was and it was somebody prototyping something as part of some new feature we were working on. So we figured that out. But then, you know, what do you want to always learn as part of any operational excellence, you know framework you have, you want to be able to prevent issues you reflect on those sort of problems. It's just a simple one but you can think of this as a security violation somebody found somewhere and I need somebody to fix that very quickly. You know, should we then block all developers and force something like everybody should put an owner name maybe for every application that is deployed on our Kubernetes infrastructure. You know, how are we going to do this? Are we going to force a PR and like have the SRE manually review all of this? You know, this has got to be a better way. And usually the better way has to do something with automating the process. So let's call this the owner label problem. Like we thought about it like, what if we have a best practice or a standard in our organization that says, you know every workload you deploy must have a label that has the owner name or that could be an email of the team that owns that service. So at least we know who to contact or who to talk to. You know, you can think of, you can extend this. You know, maybe you should have also a link to the GitHub where the service code at. Maybe you can have a link to a wiki page in your internal organization, subscribe, maybe something a bit more about this. So if somebody want to learn about it. So whatever it may be, and you can extend this just simple idea of owner label. But at the end of the day, how do you enforce this? How do you make sure everybody's playing a good citizen without blocking the productivity of, you know the engineers want to move fast and move quick but still you want to provide this level of kind of checks and balances to ensure a common standards across the organization to make really your life easy down the line in production. So this is really kind of a governance problem in a way. So when we think about it here is, you know it's the idea of policy as code. Now we can write a document. We can have this in the onboarding. We can have emails going out about this new policy but that's probably not a good scalable sustainable solution. You really want to think of just like infrastructure as code, you want to have policy as code. So what is governance at least in the way we define it here is it's the ability of the operation team to verify and enforce certain policies and standards across the entire organization or maybe a subset or a certain cluster or a set of clusters or a set of workloads that meet certain criteria whatever it may be your ability to enforce certain things and be able to do that automatically in a productive manner is what I'm referring to as the governance framework. So there's kind of three things usually when you want to think about establishing your policies. First, what is your target? So in this example, we have in this particular case that the target here would be really workloads, any workloads. So not ingress, not services, not volume just kind of any object that is a workload. So controllers, you can make it more specific. Maybe you care about stateful sense more than certain other things or maybe certain objects and Kubernetes that have a certain annotation, whatever it may be you need to kind of for any given policy you want to kind of define what is the target and then you want to define the actual policy. The policy is a set of rules you want to enforce in our particular example is anything that is missing an owner label is a violation of our policy. So I want to check, I want to ensure that every workload controller being deployed to my Kubernetes cluster has a owner label. The last thing you want to think about also is the trigger. How do you do this check? Is it something you do once a day, once a week kind of on a schedule type? Is it something you do at deployment time? So in your Kubernetes like something like admission control for those of you familiar with admission control. So you do it at deployment times you prevent the deployment any deployment of a workload that violates this policy is that when you want to enforce the policy or could you do it even earlier? Like move into the left thinking about your build time your CI phase or even your commit your get commit phase and can you enforce it at that level? The more you move to the left the better because your developers your engineers will get kind of the feedback early and hopefully solve the problem before it's, they're in the middle of a deployment and now things are failing and now they have to go modify some code or update some files and go through the CI CD again. So those are the other three elements you want to think about when you're thinking about a policy. For our case, I want to enforce this on all workloads in our Kubernetes cluster for our dev and product environments or two clusters. And the policy is I want to make sure every workload the spec has a label called owner with some text inside. And the trigger we were fine every 24 hours just once a day I just want to get a report about it and then I'll go harass the people and then maybe later we can make it a little bit more dynamic and maybe enforce it at build time which would be awesome. Fail the build if something violates this policy. All right, so let's talk about open policy agents which is the first thing that should come to mind probably when you're thinking about enforcing policies in an automated manner. So the open policy agent is an interesting so it's part of the CNCF, it's a CNCF project and I think it's they just file for graduation. So it's a great tool to allow you to, you know your organization to divine these custom policies and be able to run them. Now that the OPA itself is just kind of like a policy execution engine. And it uses a language called Rego which is kind of the policy language. So we'll talk about it in a bit. But truly what it comes down to is let's talk about kind of the case when you're deploying things so you make a request to your Kubernetes you're trying to deploy something, update something what let's say an admission control case that change will go and your OPA could intercept that and then respond evaluate basically the policies associated with this change and then enforce say, you know, deny or accept or deny this change. So this is roughly kind of the paradigm. The idea is I have a policy and I have an object that is being changed and what the OPA is kind of will take these two and tell you yes or no, deny or allow. Is this change, does this object check the boxes, check the list for this policy with these policies, maybe have a list of them? Does it violate any of those? If not, then you're good to go. If yes, it violates them, that's a deny. So that's really just kind of a high level the essence of it. But how you deploy it and the architecture of it is something we're gonna talk about in a bit because there's different ways and that kind of depends on the triggers, right? So if you think about it, the OPA here gives you obviously a policy execution engine. So that by itself doesn't decide really the other things like the triggers or the policy. The policy you're gonna define with ego, the ego language, that where to enforce it is another thing we're gonna talk about and then the triggers is when to enforce it is the other one. So let's start with just the policy itself. Now, how do you describe the policy using kind of the OPA? Well, there is the ego language, which is a declarative language and I just made a very simple here policy. Again, I just checked that the label exists. It doesn't even check what the value is. You can make it more complex. There is a lot, I won't be going into details of how ego works or some of this. There's a lot of resources online to help with that but I just wanna get you the flavor and just put it all into kind of one end to end framework so you can see how it all works. In each piece of this, you probably could dig more. So this is the simplest one. Now ego is just basically statements like assertion statements that you evaluate to true or false. Even an assignment statement is evaluated but it gets evaluated to true by default but everything else is almost like a conditional statement of a sort and then it gets evaluated to true or false. Everything is true, that's a pass. Something is false anywhere. Well, that's false. The whole evaluation of the policy becomes false. So if you look at it here, there is input.metadata.label. Now input is kind of this object that is given to you by the kind of the ego language that maps to the object that is being passed to you. So the OPA kind of handles that and kind of assign the object that comes to this input variable. And then this is the spec of a workload of a controller object in Kubernetes. So dot metadata, dot labels. And then you can access, you look at basically for an owner label inside of that YAML JSON structure. And if it's not there, you know, that will evaluate basically to denial. Let's actually show you, if you haven't, there's something called the ego playground and it's available there. I think the URL is regoplayground.org. I should have put the link here. It's easy to find out, but let's have, here's the policy to your left as you see. And then to the right, I just put an example. Again, this is not the full spec of an actual object, but this is just kind of a part of it. And you can see, so this one doesn't have an owner label, has another label. So deny equal true because this statement ended up evaluating to true. So when you click evaluate, you'll see kind of the output here deny is true. This is another example where there is an owner and now there's nothing coming back. So what you see here is just kind of how would the OPA would evaluate these policies? And honestly, it's up to you the deny versus the allow how you wanna structure because you will interpret the result. And we'll maybe talk about the deployment. So you kind of maybe see how would you deploy this? So how would you interpret this result? So there's three ways really to deploy kind of OPA. One is just use the go as a library really, just use the go module and write your code. Just like this example here. So there is a rego.new and you create a query and you can parse and evaluate assuming that ego is in a file called example.trigo. But you also could deploy it. OPA comes as a container and you can deploy it as a container inside of a pod. And then your application can call that pod, maybe pass it the object it comes and then use something with it. And OPA also could be you could configure it. So it receives just as a full pod by itself, like in the third example here and it has its own URL and you can create a service for it and you can just kind of OPA as a service, you maintain that. Now what we decided actually in our case is we just went with the go library because we wanted really full control of creating our own service and getting the object and interpreting that result, the deny. We wanted also a standard within our organization of how we want to write those policies like those assertion rules. The word deny is just a rule and you can create whatever you want. And it doesn't have to be like we said, it really works in any object. If it's adjacent, it could work on it. It doesn't have to be a Kubernetes object like in our particular example. So it definitely extend beyond the one example I'm showing you here, but there's a lot of options. But the ones we went with at least for our case because we wanted to really have a kind of a standard within our organization to do this and also help even our customers with this. So we went with basically writing our own service and using the OPA go library to do this. But again, at the end of the day, you just need the policy, you need the object and then the OPA is just a policy execution engine which is no magic to it. How you write the policies, how you want to interpret the result, that's basically up to you. So this is something that is not part of OPA itself. It's just the policy execution engine. So how do you manage your policy? So these are some of the things that are missing with OPA, the vanilla OPA. Where do you put those policies? How do you manage them? When do you run it? Somebody has to call this. Even if you're on OPA as a service, you don't write your own. Somebody has to call that service to do this. And that gets us maybe into some other ways to trigger this, which is really Gatekeeper. Gatekeeper is kind of an extendable parameterized policy library that I think now is v3 and it's still in beta, but there's a lot of contribution from a lot of people and I think it's picking up some good steam here. And it tried to address some of those shortcoming or basically adds to the OPA so you get kind of the full framework of a policy as code. Cause like you saw there, you can write a policy but where do you manage it? Where do you keep it? The other thing is when do you trigger it? Who's gonna do the triggering now? Like in our case, we ended up writing a service and we decided to do some events and when something changes, I will call the service to kind of validate, verify the policies at work. Is that change that somebody made violates any of our policies or not? So Gatekeeper really tried to address this. And let's see if the next slide is the example, but let's just kind of the architecture of it. So it uses OPA, you think of it as a layer on top of OPA. It will run like an agent in your cluster and it will register itself as an admission controller. All right, and what about the policies? Well, the policies will be custom resource definitions. So they have the resource definitions, not just the rego now. So you create a policy, which is gonna be its own YAML kind of object that is a CRD. So you will persist to just like, you can use the cube cuddle command and just like you create and apply any objects in your Kubernetes cluster. So that's how would you manage those policies? And then Gatekeeper will register itself, a webhook with the admission control. So any changes that happens in your cluster, you will get kind of an admission request or an admission review request. So this is the triggering mechanism like we were thinking about the governance framework. And then we'll talk about actually how you do the targeting with this, but basically the Gatekeeper will get any change from the API server and the webhook and it will run the policies, the relevant policies, and then we'll give you an admission review, deny allow. So what does these CRDs look like? Well, there's, you basically create a template, which is this kind of example on the left, what's called constrained templates. And all you see there is really the rego. So the missing labels or what have you. And also you can define variables and that's a very powerful aspect because again, that's what makes this a template is you write it once and then you can verify because sometimes, I know I call it owner label, maybe you wanna call it something else or maybe I have different, in the databases, I wanna call it something or my services for application or workloads that my team owns, I wanna call it owner, but maybe for third party workloads like databases we just get from third party, we wanna use a different label. But it's really the same policy here, the same template and you can have what is called constraint which we have an example here on the right where you kind of plug the variables you want, the exact value for the variables and also this is where you get to define your targeting. So if you look to the constraint here on the right, the parameters here label is owner and you can see I'm for this particular case, I have it target kind deployment. So deployment objects is where I gonna run this policy. So you see here now the gatekeeper is providing the targeting with what you see here and also a decent way to manage those policies we talked about. So it's not just creating the policies using the rego but also a way to manage those policies. And because it's registered itself, you deploy it as an agent and it registers to itself, now the triggering will be anytime something changes you will get a trigger. And gatekeeper will do the magic for you, we'll find the objects, whatever the object that's being changed we'll find the policies that matches that object and then we'll run them and then we'll kind of give you the result, the deny allow kind of admission response. So you can try like this example here where we have something that was missing the owner label and you see just when you're trying to do a bad deployment you get this error from server denied by must have owner and tell you they're creating this admission whip hook and the gatekeeper and then so on. I don't know the time. So I just talked about one example like this owner label but obviously there's other ways, other things you can do you can create policies for, you can check for readiness probes and liveness probes, that's just a good practice. Every workload you have, every container you have should really have those liveness probes and readiness probes, the services. You can obviously also enforce certain just basic security like hygiene, good hygiene like allow privilege escalation must run as non-root. You can maybe enforce certain things like workloads all your services should have at least two replicas just for full tolerance of one fails or one node has an issue, you have another replica. Affinity, pod anti-affinity is another thing making sure your pods, your multiple replica of your workload is not deployed to the same node. So that node goes down now your whole service goes down even though you have two replicas but they both ended up in the same node, that's not good. Role binding, container images, check in, you know, making sure you're only getting your images from trusted repositories and so on and so forth. These are just kind of very simple basic examples but I'm sure you can come up with more you can come up with your own as well that kind of makes sense also for your own organization. This is a bit continuing with kind of the gatekeeper because once you do the run it actually the status part of, you know when you do describe for your Kubernetes constraint object you will actually see the status. So to kind of give you an audit and a status especially when you create that constraint for the first time, it will kind of do a quick run for you and show you all the things that were denied or would have been denied. Basically objects you have that already are in violation of your policy. So to kind of really wrap it up before I even talk about kind of, you know let's just finish with this. So there's also metrics, you know, gatekeeper really I think take kind of the OPA and kind of the concepts of the legal like these basic building plots and try to put it together and provide this tool really that will help you then implement this kind of governance framework that we were talking about because the idea is you need a way to target the changes of the objects you wanna enforce your policies against and then you wanna define your policies which means also you need a way to manage them and put them somewhere and then you need the way to trigger those policies to trigger those checks. Now gatekeeper does it just kind of with admission control if you need to do other things. Again, OPA is a open source but also they have a nice go library so you can just use the code and write your own service which is kind of the case we've done. We also look it into gatekeeper it's still in kind of the development phase really that is I know some people may go around adopting but I think the constraint and the template the constraint template and the constraint framework that's created definitely move us forward in that path. So now you can see kind of the end to end story here you have a policy you discover some issue in your environment where we needed everybody to play as be a good citizen of this distributed environment in this case just having an owner label or whatever other operational excellence best practices you might have, you take that you codify it with code. So using Grigo to make your policy as code as opposed to just kind of like an email that somebody would send or something you put in some like onboarding handbook that probably nobody will follow. So now you are enforcing your policy as code which really give you this kind of automated ability to ensure kind of good hygiene, good operational excellence good stability of your production environment without putting a lot of strain on the developers and slowing them down and impacting kind of the agility of the whole process. And thus you as both dev ops kind of the devs and the ops really focusing more on the interesting problems of the business and really focusing on the value for the things you wanna do for your business as opposed to kind of fighting and enforcing and who missed that and the dev, the ops team zone just like code reviews or PRs or trying to really chase developers to make sure everybody's following the policy. So with that really I concluded kind of the webinar and I'm happy to answer questions it would just kind of brief high level overview kind of looking at three ago OPA and gatekeeper and how can you kind of put them together with kind of a specific kind of problem just to motivate the discussion. So with that, thank you and I return in the back, Gary. Thank you Ahmed for a wonderful presentation. We have plenty of time for questions. So if you have anything you would like to ask please feel free to drop your questions into the Q and A box and we will get to as many as we can. Do we have anyone at all? Many people probably haven't slept since last night. Everything I guess was clear this month. Okay, so we have a question here. Could you give a bit more background information on gatekeeper? Yeah, so gatekeeper. So it is also a CNCF project. It is really built on top of the, it uses OPA. So it is open source and you can definitely find it and they have nice tutorials. But what it is at the end of the day is an agent you deploy and it registered itself as an admission control. So what admission control is, if you're not familiar with that, it just the API server allow it to register a controller with it and register a webhook. And when something changes it will send you kind of a request to the webhook when you respond to it. So it is called an admission request and it passes you the object that is being changed. Kind of the old and the object and you get to decide if you allow this change to happen or not. So how would you allow it? Or how do you evaluate it? Well, you have the object coming in. Now you have maybe policies defined. They are being defined kind of like those CRDs like the constraint template and the template like the example we show. So the gatekeeper will automatically be reading those and watching them and evaluate those policies against the change. And the admission response is either basically allow or deny true or false in a way. And if it is false, it will kind of fail the deployment. So whoever is running the kubectl command or what have you will see a failure like the example here we showed. You will get kind of an error like this. So if I'm doing this deployment and basically I have this YAML spec and it doesn't have the label. When you run the kubectl command, the API server obviously everything go to the API server. Notice that somebody is trying to change some object and it knows that there is this gatekeeper who's registered as an admission controller. So it will send it a request, that admission request basically ask it, should I allow this change to go through or not? And then because I have a constraint, constraint template that I constraint that has this owner label policy, the gatekeeper will evaluate using OPA will evaluate that policy against this change and denied it because it doesn't have the owner label. So kind of high level, that's really kind of what's happening right now. That's what gatekeeper is. It is currently v3, if I remember beta, I mean, there's a lot of people contributing to that project. So we'll kind of see how it matures and when it becomes kind of production ready. I know some people probably already have it in production but there is some really good active development on it. So I encourage people to try it out. Sorry about that, I was on mute. Do we have any other questions at all? So if no one has any other questions, I think we will wrap it up. I wanna thank Ahmed again for a wonderful presentation and for taking time out of his day to join us today. And I wanna thank everyone for joining us today as well. As I said before, today's recording and slides will be on the CNCF webinar page at CNCF.io slash webinars. That just about does it for everything else. Thank you all again for attending. Everyone take care, stay safe and we will see you at the next webinar. Take care everyone, thank you.