 it's okay it's been a long day the reshuffling of the schedule has caused some challenges for all of us and so I wanted to apologize first for beginning this talk so late in the day I'll do my best to keep it interesting and snappy and maybe we can have an interesting conversation about policy which sounds like a contradiction in terms but it really isn't so let me start by introducing myself my name is Craig Peters I am a program manager at Microsoft working on Azure at Azure my responsibility is container infrastructure so basically anything at Microsoft Azure that runs containers my team is responsible for making sure that all of the upstream open source dependencies are maintained correctly and we contribute everything in the upstream as a part of that we develop new projects that enable new capabilities in container tooling that that run on Azure or anywhere else and so we're gonna talk a little bit about how we came up to creating a new project called Gatekeeper and that's the subject of the talk today unfortunately I'm up here by myself you don't see Toran standing next to me because due to the change in schedule Toran had a conflict at this hour that he wasn't able to get out of so he sends his apologies for for not being here I'm also feeling the pain of this because what you'll find in the discussion today is that I understand policy fairly well I have a fair amount of experience with it I however am not an expert expert in open policy agents on which this gatekeeper project depends so you know we can we can only go so far in the discussion so if we have questions that come out of this discussion that need to get deeper into open policy agent Toran is here at the conference and we can find him tomorrow to dig deeper into those questions I'll also share other ways in which we can collaborate online and through the community to get questions answered further so without further ado I'm not used to these let's try that no problem okay I already answered who I am I want to know a little bit about who you guys are you know I want to show of hands here who who here is a developer building applications that run on container infrastructure Kubernetes or other other places here I do some of that anybody else okay that's great about two-thirds of the audience okay who here operates Kubernetes clusters for those developers as an infrastructure provider me too okay there's some of the same hands but also some different that's not too much of a surprise who here is here because they're responsible for security audits in their organization anybody yes couple that's fantastic so I am not but I've worked with many people like you so I think we're we're going to be able to have a real good conversation here so some of the motivation for the work we've done is reflected in this URL which is actually kind of interesting if you go and take a look I'm not going to dig too deeply into it but it's basically documentation of a number of horror stories for what has gone wrong for some people when they're running Kubernetes clusters in production for big environments and or small or basically anytime you're trying to do anything real in a container orchestrated environment you can always shoot yourself in the foot and so these kinds of tools that we're going to talk about are intended to reduce the risk of those kinds of problems coming about so so let's kind of paint the scenario for a minute so when we all got started building our clusters we very carefully planned out how people were going to use them right we generated all of our our back we configured it so that everything would work you know only the right people can have access to the right namespaces that you know essentially you think that you've got everything planned out you've got runbooks you've got ways to handle errors backup from recovery all that stuff is ready to go let's ship it let's open it up all of a sudden the developers come and they start doing stuff right and very quickly I suspect that some of you in this audience may recognize some of these questions is anybody here ever looked at their cluster and said where did that namespace come from has that ever happened to anybody else I yeah and then you look in the namespace and you're like well there's a whole bunch of pods there what what are they hey but yeah yeah and then you look in the pods and you see what what containers are running there oh my god what is that container and where did it come from like seriously like who who decided to pull from that particular repo like how did we manage to get there and then you know what do I do next what happens if I delete this like who's gonna care is anybody gonna care is a production system gonna go down you know what's the impact of that so you know despite your best plans you know you have to make things available to developers they have to have some freedom to move so you can't lock everything down from an Arbeck perspective completely or you're gonna end up in the same old you know old space where things you know developers couldn't do things they need to do so the question is you know how can we address these problems so okay I'm gonna kind of skip over the cycle that the previous one really did this so you know essentially when you've got a dynamic environment you've got all kinds of things happening at the same time lots of teams have kind of conflicting goals you're trying to share resources so that you're not paying too high administrative costs or too high an infrastructure cost and then you've got complex things where you're doing similar things but not maybe not even exactly the same but across multiple infrastructures right maybe different clouds on prem in a public cloud and things kind of start getting pretty crazy right so how do you limit you know the use of unsafe images how do you keep track of who created what resources and understand the purpose of those resources and who depends upon them how do you keep users from running into each other besides namespaces right namespaces aren't quite enough there how do you make sure that you've got the right tooling in place to create observability about what's going on in there and and how do you manage the cost so you know those are all the kinds of questions that you know in my experience everybody runs into on day two of running their cluster like day one is fine day zero is good we've kind of got that pretty well solved at least cloud providers do you know what what happens next so lots of things happen like we all solve problems on a day-to-day basis by using common tools that we we use every day right so the first thing to do is to write things down like when you do something make sure that you document what you've done like you create these wikis you put them in spreadsheets you you know there's a million different ways that we can try to do that the challenge there is I get people to do that like nobody wants to write it down I created my yaml so yes sure oh is it too hard to hear okay so the question then sorry this the question then is how do you make it easy for people to accomplish the documentation of things without having to go into a separate manual step to write things down and then how do you deal with the limitations of our back right so like I only have a certain limited set of vocabulary for the verbs in our back and you know you can extend that but that's a slow process of going through the community to define you know what do we all mean by role-based access control right we've got a fairly solid definition of what that is today but extending that as a slow and cumbersome process and that our back is probably the wrong mechanism for for accomplishing controlling some of these other things so we can maybe agree that that neither of these approaches really completely solves the problem that you can you can do some pieces but you're always gonna have holes there so Kubernetes has additional capabilities built into it on top of you know once you've authenticated somebody you know who are they are they allowed to access this is all the domain right of our back then you've got admission control right so within admission control the first thing that happens is for that person are they reaching their resource limits right can't do we have the resources are they allowed to create new things that's great and then we've got this thing called the webhook the webhook then has access to all of the metadata about the object that somebody is trying to create here and can implement rules using an external controller right so what this controller does it allows you to write very powerful rules to define what can happen in that cluster for a given object for a given operation right so I'm trying to do something to change an object in Kubernetes is that allowed that's a very powerful concept that I can do that late in the process and then all the key thing here is all of this gets reflected in the end in the state of all the objects in the cluster right and all of these things happen before it gets reflected in that CD and so this is an important principle that I love that we as a community built into to Kubernetes the next thing is well if we try to use this it then allows us to have access to all of that metadata about the objects and then I can build in to these admission controllers all kinds of policies right because I have access to all that data I can block privilege containers I can say certain people in certain contexts can create privilege containers and in other contexts they cannot be created I can in certain contexts block the use of certain image registries I can assure that egress rules are only used in certain places and so forth right there's there's a essentially an arbitrary set of combinations of rules that are then possible to implement through this ingress controller mechanism the challenge is that you have to create ingress controllers in order to do that and it turns out that that is very hard right if you you're essentially just writing a new controller and go and for people who love to do that that's awesome but the challenge is that it's not very portable you end up creating controllers in general that are purpose built to solve the problem that you have today and the policies actually end up changing over time and the challenge is that you don't want to have to go to a developer to say I want to change the resource quota for X to Y or I want to allow additional namespaces to have additional quota right these those kinds of things don't make sense to encode in your code so they need to be flexible and parameterized for different environments and they need to take advantage of external data so you could write a controller that goes inquiries some external system like an accounting system or an audit system to make a decision that's great but then you also have to think about other use cases like what if I want to make sure to validate all of my policies as I change them and I want to do them in kind of a CI system so I need a dry run mechanism and all those kinds of things that's great all of a sudden it starts looking like a very sort of custom set of code that's not something that we all want to maintain or develop by ourselves so I want to take a look now at a solution and that solution is gatekeeper so we're going to do that in the form of a recorded demo because nobody wants to watch me type here with me while I get that up very slow so what is gatekeeper gatekeeper is an open source project which you can find on GitHub it has on the GitHub repo it has very straightforward installation instructions you'll find that it's implemented as a set of Kubernetes controllers right resource definitions and you can simply use the script to deploy those and what you see is we created a set of objects in the namespace gatekeeper system and it's going too fast so we essentially applied a set of resource definitions those resource definitions created controllers in the controller namespace and now we're going to walk through what it looks like to use gatekeeper so let's look at what CRDs got created it created two one for the configs and one for the templates we'll look at the importance of that in a few minutes this is a validating web hook configuration so it's a standard way of configuring the the web hook and we take a look at that you'll see here that it's look it's an admission web hook controller and it's implemented in the gatekeeper system and it applies essentially against a set of resource resources that are coming in through the Kubernetes API so in this demo we've got a sample bank there's a web native bank and they've implemented Kubernetes and they've opened up to the world so my developer now can go and create some system and they created a namespace and they created a bunch of objects in that namespace and then their project kind of moved they moved on to another project and then we found this namespace where the administrators and and we're like well who created that let's take a look at it and you know this is we all kind of some of us raise our hands you know we've found namespaces we don't know what they are we have to go talk to a bunch of people it can take quite a long time and eventually we find that somebody created this and then they moved on to it and this is not dependent in any way so how do we say never again like we're not going to allow that to happen anymore so there's a set of templates in this demo that we walk through and the important one here that we're going to look at first is that we're going to require labels so that's a template for rules for requiring labels and then there's constraint for the labels and we'll apply all of those constraints let's take a look at we're going to look at one of the constraints here here it is so in the constraint we're going to say all new namespaces must have an owner right so it's a it's a constraint of kind required labels and we're going to say all namespaces must have an owner and the owner must fit a standard template in this case implemented as a regex so once we apply that constraint to the system the next time the developer comes back and says okay I'm just going to create an arbitrary namespace they can narrow back you can't just create a namespace you also have to label it right so the the awesome thing about this is that I can just simply create a user friendly message so they they understand what they've actually done wrong so this is a properly formed namespace that's got an owner label associated with it that matches that regex so now I can create that namespace and now what I'm going to show is how you you know we're going to try to create another set of resources that have no limits that violates another policy that the or the limits are too high and we get another message that show that's very clear about what policy I've I've violated in this case I'm pulling containers from the wrong image repository it tells me it's sorry you're not allowed to do that so here is another example of where I'm looking at let's see this one is about duplicate service I say this is a policy that says if you're going to create a new resource in a namespace it can't have the same name as another resource in that same namespace and so we've tried to create a duplicate one and we got back a very meaningful error there so eventually the developer figures out through what those pop the implementation of those policies they get their application up and running and everything's well until something goes down let's go back for a second here skip something important so they finally get their service up and running and everything's going great and we never so why did their system go down in this case we do a root cause analysis and something you know a common practice for all of us developers is that we're lazy but we also know that we want to be taking the latest patches and making sure that we're using the right latest image so we often resort to using the latest image tag like that's a has anybody else ever like done that I use latest way too much well it turns out that often the latest has a whole bunch of stuff that we don't yet support or can cause unforeseen circumstances and it turns out that that's often a cause when you do a root cause analysis of a big outage so let's say maybe what we want to do is actually not allow the use of the latest image so now let's look a little bit under the hood there are two pieces to implementing policy with gatekeeper one piece is the template so this is something that the administrator would create the template this is a template called Kubernetes band image tags and what it does is it implements this deny rule the deny rule this is actually an open policy agent syntax and open policy agent actually does the work of enforcing the policies for gatekeeper it says you know I look at these attributes of the input I find the spec I pull up the image label and I say let me get a variable and compare that variable to the image label and if they don't map if they match then this is a band label right so here you'll notice it's not specified that the latest tag is the one so next we're going to do we're going to look at the template so this is the actual object that is the variables or the constraints applied to that template so when it what the what the constraint did is it said up here I can see that when I look at the labels the the label that I'm trying to match is the latest tag so right up here right so I'm trying to I'm looking at the latest tag and I that's the tag that I want to ban and so here I'm actually looking at the object it might be more it's easier to see in the constraint but the constraint is actually a very simple object and that's essentially the end of the demo so let's take a look under the hood how did that work so how did we get to this so when we looked at the demo we noticed that there were a couple of things that there's a template that implements the the semantics which is in the language called rego rego the open policy agent understands and so some class of users we call them administrators but some class of users needs to understand the the rego semantics and then there's another class of users that needs to understand the policies you actually want to implement using those semantics right so those those are more the the the admins who care about what the resource limits are which tag shouldn't be allowed so the way it works is happily as Kubernetes objects so gatekeeper implements open policy agent as a set of CRDs those CRDs are watched by OPA via gatekeeper and it watches through the API server all of the objects that created through the API server implements this webhook admission webhook and allows me to do a review through the webhook of everything and runs the query against OPA to apply those constraints to the policy to the policy templates essentially the template ends up generating the policy in OPA and like I said before it essentially exposes all of the metadata of every object type that goes through the admission controller and and so in this way we now have essentially a cloud native way of enforcing policies so some people have rightfully asked well what's why gatekeeper why don't I just implement everything in open policy agent open policy agent is a very powerful cross context policy engine that's been used for all kinds of different control systems and so the fundamental difference is that in OPA you essentially load all the policies via config maps config maps are traditionally very hard to kind of maintain over time in clusters and there's you know essentially in OPA there's no library of standard policies you have to essentially write your rego from from scratch there's no clear way to do reuse or sharing of that and that's not exactly the way we want to standardize standardized that across multiple clusters or multiple environments so what we did with gatekeeper is that we turned these into custom resource definitions so that you can manage policies as objects the way that happens is this combination of the templates which we looked at which contain the metadata you want to extract and the rego semantics and then the constraints which are put together with that template to create the instances of those CRD policies and then the project includes you know part of one of the things that we're building up through gatekeeper is a library of standard policies so standard templates and constraints that you can then go and fork and modify and use them for your own purposes another feature that we added to gatekeeper is intended to make it easier for companies to get from point a where we are today with say no policies implemented to point B I'm completely in compliance and all of my policies are enforced without killing your developers or your administrators so the first step in that is to understand in my existing environment what objects conform to my policies and which ones are out of compliance so we implemented an audit capability that audit capability allows me to periodically look at all the objects in my clusters and evaluate them against the set of policies I have in place and that report is then generated as an audit against the CRD and this allows me to say over time I'm going to start by auditing and understanding where I'm out of compliance and then I may either tune the policy as I may decide that something should be allowed in my policy and then eventually I'll start enforcing so that all new objects have to conform to that policy and so this allows sort of an easier on ramp which is a very very important thing in a lot of environments other environments obviously you may want to enforce them right away so where are we in the life cycle of this project so this project is in the alpha stage we it's working great we've got it I can actually say that we use it I'm at Microsoft and we use it to implement the Azure policy for AKS so we actually have a preview service running in Microsoft leveraging exactly this technology we also have other vendors who are using this technology in their environments and what we're doing is we're trying to build we're at the stage where we're trying to build the community around this so we need more hands who care about this to come get involved and give us feedback and participate in the development of it so there are new things that right now are on the horizon right now we don't support mutating webhooks so the next thing that we're looking at doing is doing things like well if somebody requests something from Docker registry maybe I want to point them to my private registry for example right without in a silent way so they don't have to worry about it right so mutating webhooks for one thing another is that the you know we do replicate the data from etcd back into open policy agent so that we can do comparisons across multiple clusters for example that will also be useful for comparing against external data so that you can use additional context in your policies so that's another piece of work we're also working with sig off on authorization and using IOPI with authorization it's likely to not actually be a part of gatekeeper but a separate project the audit right now is essentially an initial tack at it we have a lot of feedback that we need more capabilities there right now gatekeeper has very limited metrics and observability we know it needs more maturity there and right now there's no tooling around creating the the policies you there is a dry run so I can create a set and kind of locally test whether or not my policy is well formed and and what you know what objects don't work but we need to do more there to make it easier I want to quickly say thank you to the community of people who have been the core of getting started with gatekeeper it's a you know a very interesting cross-section of people across a number of different organizations including Google Microsoft Red Hat and and many others so how do you get involved this is a very important slide you come to the slack so please join the slack community for an open policy agent there's a Kubernetes policy group and you know come submit your issues and keep track of what's going on through OPA gatekeeper right now the meetings for the community aren't at a very Asia friendly time but that's something we're looking at fixing I'm pushing for having a Europe friendly time and an Asia friendly time in North America some mornings or nights my time and that is the end of my presentation I think we have one minute for questions if there are or I'm happy to hang out afterwards for any additional questions are there any questions real quick no well okay oh there is one question we're trying to do something like we're using the EKS AWS we're trying to use like KAM I wonder if if this can replace KAM in terms of like roles I don't know a hundred percent because I don't know KAM I know I am and I assume KAM essentially just extends that okay so in theory I think the answer is yes we should test out that use case so unfortunately we're out of time thank you guys very much I really appreciate your attention