 Hello everyone, welcome to CloudNative Live, where we dive into the code behind CloudNative. I'm Anita Alastair and I'm a CNT Ambassador and I will be your host tonight. So every week we bring a new set of presenters to showcase how to work with CloudNative Technologies. They will build things, they will break things and they will answer all of your questions. So you can join us every Wednesday to watch live. And this week we have Andy here with us to talk about writing Polaris policy and as always, this is an official live stream of the CNTF and as such it is subject to the CNTF code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct. Basically, please be respectful of all of your fellow participants as well as presenters. But yeah, with that done, I'll hand it over to Andy to pick up today's presentation. Awesome, thanks for having me. So today I'm going to do something a little bit different than what I typically do in these. I'm going to be talking about Polaris. So if you're not familiar with Polaris, Polaris is an open source policy engine for Kubernetes. Fairwinds has been managing clusters, Kubernetes clusters for like seven, eight years now. And through all of that, we realized that the thing that broke our clusters most often was our customers deploying things into them in ways that broke the cluster. And so we wrote Polaris as a policy engine to audit those misconfigurations and also to block them from coming into the cluster. So that's what Polaris is. But today what I'm going to try to do is to tackle one of our open issues. So if you're curious about contributing to Polaris or you want to add checks to it, this is the perfect place to watch, is I'm going to do exactly what you would have to do to add a policy to Polaris. So please feel free to jump in with any questions throughout this. I'll be happy to answer them as we go along. But I am just going to go ahead and dive in. So the issue we have today, this has actually been open for quite a while. And it has probably the most thumbs ups of any of our Polaris issues at the moment. So it seemed like a good one to tackle. But the original ask was to create a check to validate that pod anti-affinities or affinity terms were added to pods in your cluster. You're not familiar with affinity or anti-affinity. Those are just rules that you can specify to affect where your pods are scheduled. And so the request is, you know, basically, you know, when you schedule a set of pods, it's not guaranteed to be spread across multiple nodes. It's not guaranteed to be spread across multiple availability zones. The scheduler will attempt to do that. It's not guaranteed. And so we can sort of either guarantee it or start to push further in that direction of spreading out our pods more by adding these terms. And so one of our maintainers actually came in and suggested that instead we use pod topology spread constraints, which I think is a great suggestion. So we're going to follow that today. And so I'm just going to try and add a policy to Polaris to suggest that you add pod topology spread constraints to your pod definitions. So I have over here, I'm running a kind cluster. It's 1.25. I just created it about 15 minutes ago, every 20 minutes ago. And then I've installed a couple of applications in it. So I have a deployment running two pods in the demo namespace. And I have several deployments in the Yelp namespace. So I have two different apps running in the cluster. And I know because I wrote the YAML for these that, for example, the app server here has no affinities or topology spread constraints on it. So first thing I'm going to do is just go refamiliarize myself with pod topology spread constraints and get them added to this deployment so that once I write my check, I can actually validate the checks working. All right, so we're going to go pull up the Kubernetes documentation here, pod topology spread constraints. It goes right under the pods back. There's a bunch of optional stuff. I'm just going to copy this straight out at the moment and go down to my pods back here and break them real quick. And get that in here. All right, so let's take a look at what all of this means. Max SKU is the amount which we can unevenly distribute it. I don't think I really care at the moment about most of this. So let's see. Optional, optional, optional, optional. All right, so we've got Max SKU. I'm just going to set that to one for the moment. The topology key, that's the interesting thing. So that specifies what we're trying to spread across. So we're going to say right now I'm going to use the host name. So we're going to say we want it to be spread across multiple nodes. You might put something in here like failure domain that would spread across multiple availability zones or add multiple spread constraints to spread it across both nodes and availability zones. I'm just going to keep it simple for the moment. When unsatisfiable, I don't think I can be quite so strict as to say do not schedule because that is going to break in my kind cluster here because my kind cluster is only a single node. So this is definitely not going to be satisfiable. So I'm going to say schedule anyway. And then let's see, what does the label selector do? Selector. So it's like maybe selecting specific nodes. I think my label selector doesn't match its own labels. Can still be placed here. Interesting. I think I'm going to leave this empty, but not quite certain. I think for the moment, I'm just going to comment that out. All right. So let's apply that. It looks like that applied just fine. It's rescheduling that pod because I just modified the pod definition. And hopefully we see, let's see, this one's six seconds long. Let's describe it. I don't know why it's failing. It's writing this prose, but that's all right. Looks like it got scheduled just fine. And we're terminating it. OK. So we have one deployment with a pod topology spread constraint. And we have several deployments in the cluster without. So now we can go actually start to write our policy to verify that we have those. We have both the positive and the negative case here. All right. So I have here a Polaris configuration. And the first thing I'm going to do instead of trying to just add it into Polaris and make my pull requests and all that, I'm just going to add it into my configuration for Polaris as a custom check. So this is a Polaris configuration. This is largely the default. But I do have a few custom checks in here already as examples that I've used in other demos and things like that. So I'm going to just go ahead and add this as a custom check. And then we can look at how we add it into the Polaris code base. Very easy way to get started with it. So we're going to add a custom check called Apology Spread Constraint. I'm going to add it here in the list of checks that we want to apply. I'm going to set it to the warning level. That way, if I do have the admission controller enabled, it's not going to block that. So we're going to say success message. Pod has a Apology Spread Constraint. That's a difficult thing to say quite so many times in a row. All right. Pod should be configured with a Apology Spread Constraint. So we've just got our success message and our failure message, the information that we're going to share with the end user. In the category, when we add this to Polaris, this is going to go under the reliability category because the built-in categories are efficiency, security, and reliability. This affects the reliability of pods and the stability of your clusters. So I'm going to put in that category. I am going to target, this is the interesting one, and we're going to pull up the documentation for Polaris here. So target specifies what piece of the spec we're looking at. So we can look at just the container specification. I want to be looking at the pod specification because that's where the Apology Spread Constraint exists. So I want to target the pod spec. So that'll look for any pod spec, which is good. And then now we get into the complex bit where we start adding our schema. All right. So we have to put in the schema draft. So if you're not familiar with Polaris, Polaris uses JSON schema to validate what's going on and to do the audit. We've kind of extended JSON schema a little bit, but it's largely vanilla JSON schema. So we have here, if we look down at this other custom policy that I've written, we're targeting the container. And then we're looking at it's an object. It has a property called image. And it's a type of string. And we're going to allow any of these pattern matches. So this is actually going to be fairly similar to that. What I'm going to do is actually pull up the app server that I just modified here on the left so that we can get a sense for where we are in the object. So we know we're targeting the pod spec, which means we're essentially sitting here in our policy. So just under there, so we're going to say type object. I'm going to say properties. And we're going to say topology spread constraints. Now we get into the tricky bit where we have a list. And so that's the property. It's going to be a type array. That's an excellent question. I don't know the answer to that. So usually when I get stuck like this in Polaris policy, I start looking for other policies that are similar. So on the bottom left of the screen here, I'll make that a little bit bigger. I am in the Polaris repo in the checks repository. So this is all of the built-in checks that come with Polaris by default. So I'm going to look for something that has the pod spec. Ooh, priority class. Now I need something with a list. Let's look at, let's look at run as privileged. Let's see what this policy looks like. If we're targeting the pod spec, we create a definition. So JSON schema lets us create predefined blocks that we want to reuse potentially. So that's a type object containers type array. So array was correct. And then items, properties. This might work. All right. And that's within the container array. So I don't want properties. Let's say I want array. I want items. And I want, let's say, I find a different policy to get an example off of. Or I'm going to have to go find some JSON schema examples because this is where things get tricky type object required. Looks like for anything that has an array in it. All right. Any of the properties, those are OK. That's the container array again. There we go. OK. So I can say required. All right. Let's do this. Let's get the deployment in the old namespace. The app server as JSON. And go find a JSON schema validator because that's sometimes the easiest way to do this. Where's the one I've used in the past? See if that one. All right. Generate schema from JSON. There we go. This might help. So let's look for our API version kind, metadata, namespace, arch-apology, spread constraints. And we've got items type. That's what I was missing. OK. We have items and then under items we have type object. Now essentially we are targeting the actual object that is one of the items in the topology spread constraints. And then we need, let's say, properties. Let's just say the topology key has to be any of type string. And I don't want pattern. I want, I think it's out of view. Fire next queue, topology key string. I could say pattern. I think what I want is const. Yes. Let's try that. Say, Kubernetes.io slash hostname. So essentially I'm saying the topology key. So I have to have a topology spread constraint. And I have to have one of these topology keys. I need to find the one for topology.kubernetes.io slash zone. So it has to have a topology spread constraint with the topology key that is either kubernetes.io slash hostname or topology.kubernetes.io slash zone. All right. So I'm going to save that config. And then I'm going to run Polaris. And we're going to do out formats, pretty. And then config, Polaris config.yaml. And that should run my custom check against the controller. And I obviously wrote this run because all of these deployments in all these namespaces say that they have topology spread constraints, which I know is a lie. So let's take a look at our config. Also, in addition to questions, feel free to point out when I do things wrong, because it's definitely happening in here. So if anybody sees the issue, let me know. I do think we have to say, let's see, pod spec properties, topology spread constraints. I think we need a required in here. Yes. So under the pod spec, it says type object, we're going to say required, topology spread constraints. So that's going to make it so that this property is required as part of the pod spec. So let's run our audit again. There we go. Now we see that we are getting a warning from all of the ones that don't have topology spread constraints. And then this particular one, this Yelp app server, does. So now our check is working. That's good. Now I'm going to modify the wording here a little bit to say a valid pod topology spread constraint, because we're going to not just say that it has to have one, but it has to be configured a certain way. So I want to be clear about the messaging there. And then I want to double check. Let's go ahead and I'm going to copy this spread constraint. And I'm going to edit the deployment in the demo namespace. And I'm going to add this. I'm going to give it a different topology key just to make sure that it is working the way I expect. And actually, I'm going to use Polaris a little bit differently here. Instead of auditing directly in the cluster, I'm going to say audit the YAML in the demo app configuration. And there we go. And we see this is also warning. It does have a topology spread constraint, but I modified the topology key that I was using to be outside the list. So this is now an invalid pod topology spread constraint. This might be something where we consider splitting this into two checks, one saying that you should have a spread constraint and one saying that it should be configured a certain way. For this, I think I'm just going to leave it the way it is right now. So we have a working custom policy. We know that it's doing the thing that satisfies the issue. Now we need to go actually add that into Polaris. So this is where we would go from how you can just use Polaris the way it is to extend it. And now we're going to add it back in and contribute it back upstream. So why do I have a diff here? I'm going to check out a branch. What issue is this? It is 547. X547 add topology spread constraint. All right. So the first thing we're going to do is add a YAML file called topology. And we have a question slash Polaris location needed, I guess, from the audience. Someone just asked documentation question mark. So maybe they are looking for resources about this or. Yeah. So all of the Polaris documentation is at Polaris.docs.fairowins.com. And then if you're looking for information on contributing, if you go to the Polaris repo, we should have a contributing guide in here somewhere. It might be in the documentation as well. But general process file an issue. Make sure it's something that we think is a good addition. And then feel free to open a PR on that. And then if the question is suggesting that we need to add documentation as part of the PR, I also agree with you and I will be doing that. So hopefully that covers all the documentation questions that come up. All right. So I'm just going to grab the text that I wrote here. I'm going to put that in the YAML file in Polaris. So I'm going to write that. So now we have topology spread constraint as a check. And I'm going to go ahead and build this here. And then there's a few other things we have to do. So first, I'm just going to make sure it works. So we're going to go back here to where I was running this locally. And I'm going to do the locally built version. I'm Bob's Polaris Polaris. So I'm going to use the version I just built. I'm not going to pass my configuration in because that's adding the custom check. And let's see if we find it in the list here. All right. So that's now working, I think. Oh, maybe it's not. Ah, it's not because just adding it to the list of checks Polaris is not enough. What we need to do is find the default configuration and add to that as well. Where is that? That's a great question. And I believe it is covered in the documentation. All the built-in checks there. Let's see. Figurations, check that. Default Polaris configuration is here. And that is in examples config. Is that right? So under reliability, initially, warning will leave me from out of there. Under reliability, I'm going to add my pathology spread constraints warning to rebuild. I'll rerun here, and I'm going to go forward to apology. Check topology spread constraint not found. Interesting. What am I doing wrong? Let's go ahead and get a real IDE here. Now that I'm in the code base, so that I can under around a little faster here. And we're going to look at the audit command, and we're just going to walk through this. Checks optional. We get config. Config is a configuration object, which means we probably have package config checks. Here we are. This is what I need. Container checks, pod checks, apology spread constraints. Add that to the list of checks in Polaris here. And let me make sure I got the name right. That's going to make sure that I'm not missing anything that's specific to the non-custom checks. Let's see who I am. So just copy paste to make sure I'm getting the name correct, at least spelled, and then pop open a terminal here. And grab the path. All right, so what I'm doing here is I'm going to switch to kind of a different flow here where I'm working strictly out of the IDE. And I'm going to say, go run. May that go audit. Audit path. And I'm going to point it to that demo app. And I'm going to say format pretty. Hopefully, I get my topology spread constraint. Polaris audited path did not run any checks. That's interesting. I'm pulling it right locally that way. Been a while. I'm not sure I've ever actually added a Polaris policy before. Passing all of our checks, that's not correct. So let's take a look at the logging flags. Let's view on level equals debug. See if we find anything in the logs. That's not. Oh, I know the problem. This isn't the right path. There's no yaml in that path. I believe the folder is called demo app. So there was nothing to audit. All right, now we're on the right path here. And we have our warning check, topology spread constraint, category reliability. Pod should be configured with a valid topology spread constraint. Great. So now we've added the check into Polaris by default. So it's being built. Now we need to take a look at tests and documentation so that we have a very complete PR here. All right, so let's go to Docs. It's Docs MD. No, not Docs MD. And if we go to Docs and we go to checks and reliability.md, we have a table here that lists all the reliability checks. So I'm going to add a topology spread constraint. And it's defaulting to warning. And I am going to say that it fails when there is no topology spread constraints on the pod. And then let's see. This is an interesting doc in that it just talks about liveness and readiness probes, mostly selects. Well, first of all, we'll add another link here to the Kubernetes documentation on pod topology spread constraints. And then that link in. And then actually, there's some great information in the issue here that I might use as background as well. And I think let's divide. I'm going to do a little documentation clean up here. Background, liveness and readiness probes, poll policy. Let's say sections about image poll policy. And then I'm going to add a section about topology spread constraints. And by default, Kubernetes schedule uses a bin packing algorithm to fit as many pods as possible into a cluster. Scheduler prefers a more evenly distributed general node load to app replicas precisely spread across nodes. Therefore, by default, multi-replicas not guaranteed to be spread across multiple availability zones, craze, provides topology spread configuration in order to better ensure pod spread across multiple agencies and or posts. Example, range, code in here. We'll go get that out of our demo. Take a look. Oh, I already have that open. Actually, I'm not going to use this one because. All right, grab that code example. Drop it into the box here. Do a little cleanup because we don't need all of this. That's necessary. That's necessary. All right, so and we need to change this to what was it? No, I don't want to do hostname. Let's do topology.covernance.io slash zone key. Example, OK, pod topology constraint across zones. I do feel like if I'm going to recommend this, though, I need to understand MaxQ a little bit better before we put it in the documentation. You must specify it must be greater than 0. If you select schedule anyway, this gets higher precedence to topologies that would help reduce the skew. If you select do not schedule, it defines the maximum preparedness between the number of matching pods and topology in the global minimum. Well, that's a fun little bit of mental math that I'm not going to necessarily do. If you have three zones with two, two, and one, MaxQ set to one of the global minimums. OK, thank you for the example. That is much better. So one is fine. We're just going to put one in there for the documentation. And I'm going to leave out this label selector and drop the content. All right, so we have an example. So that's probably enough documentation for this new policy. Now let's take a look at tests. I think we don't have any more questions. No questions so far. Yeah. All right. So after this, everybody's going to go write more Polaris policies because we're all experts on it. Just like this, yeah. But keep the questions coming if anyone has any. But I think we also have probably a bit of time in the end if everyone's waiting for like final notice. Gotcha. All right. So I see here we have a checks directory under test. And in each one of the, each check has a folder. And that folder actually has, let's look at a little simpler example, a failure.yaml and a success.yaml. So my assumption is that we go through and we run the policies against each of these failure and successes. And assume that the failures fail and the successes success, which I'm guessing happens in our CI somewhere. So let's just double check that test dashboard, test Kubernetes deployments, the job test. Well, I'm just going to go ahead and add the folder and see what it does. I'm going to add a success.yaml, failure.yaml. And I am going to grab that demo app deployment code. I'm going to copy that into success.yaml. All right. So this one should be a success. No. No need to keep that in there. All right, let's add failure.yaml. And for this, let's see, we're not just going to call it. We're going to call it failure.no-spread-constraint.yaml. And so in this failure, I am going to just drop the whole block because we want to fail when there's no pathology spread constraint at all. And I'm going to do failure.invalid-topology-key. And we're going to put that in here. And I'm going to go change the topology key. Oh, it is already bad. We're going to save that and we're going to fix our success because it's clearly not going to work. Where's our documentation? Let me go grab the example from that. We are going to use the zone. Thanks. Let's go run our tests and just see. No, that didn't change anything. So how are we running these tests? Let's go ask the documentation. Maybe it'll tell us. Contributing project structure, getting started, running tests. Oh, that's old. We haven't had to run it that way in a while. Webhook tests, that is wild and helpful. So I have an idea. Go ahead and commit what we have. I have a check for these bad constraints. And I'm going to make a pull request. All right, I'll need that in the title. Get our faces 547. I have signed the CLA, I promise. I think I have. If I haven't, I'll go sign it. I did add documentation. C issue 547. I'm going to type our own character. All right, the goal of this PR is to add a check for pod topology spread constraints, recommending that these two ensure that you have a high ability across zones and or posts. What change did you make? Added, spread, and string check. What alternative solution? There may be parts of the spec that we want to recommend. Only limit to one. The thing I'm thinking is that maybe allowing both of those topology keys is not the best solution. It might not be. But I think giving the options is probably best. Topology key. Also, I talked about possibly splitting this into two checks, one that ensures a topology spread constraint exists, and another to verify its configuration. Turns out adding features to CNCF progress is about documentation and writing good notes, as it is about actually adding the feature. All right, let's go take a look at our PR. All right, and the reason I went ahead and opened the PR is because I am expecting some tests to run automatically. Or perhaps they are not, or am I not seeing them because I'm not logged in here? That may be the case. Let me go ahead and sign in. I need my security key. Get our security key. So what happens when you go to your demo in a private window? All right, not going to worry about that at the moment. We set up Polaris test, and then we're going to go ahead and be running these tests. So that's good. We ran the go test that passed. I am still curious about those checks, tests, folders. But I mean, we can go ahead and check those ourselves real quickly by just running them against the demo. So Polaris audit, audit path, test, checks, all these spread constraints should see that on. Let's run it against failure, no spread constraints, topology spread constraints. Warning, didn't have it. Good. And then we'll check against failure, invalid topology key. And we see also failing. And then we'll just go check our success. Actually, if I really wanted to, I could pass the checks flag and just pass it the topology spread constraint. So I don't have to look through the whole thing. And it passes. Good. All right, so if those checks are being run somewhere, they will pass. I am super curious about that, though. I'm running fixtures here. I don't know why I'm opening this in BIM inside of my ID. But checks, fixtures, mocking pods and contests. So this is for testing against a cluster. We're mocking different pieces of the cluster. So that's fine. But where are we? I'm like cases. Mutation schema test. This looks like it might be it. Act two test Polaris check test. So we've got an environment variable check here that's pouring this Polaris check test. And we'll go ahead and run it. Tests. Interesting. I don't know if that's doing much, but it does seem to be pulling the check YAML files and running through them. So in theory, by running go test, tested that the check is failing the failure YAMLs and passing the passing YAMLs. Let's try the IDEs version of testing and see what it thinks. Curious. There's the test directory. And that's what we want to see. Schema test. Test checks. I think we're good. I think that's all the testing we need to add for this new track. Hopefully my reviewer will tell me if I'm missing anything there. Let's see if we've passed yet. We're testing the build. I'll have to go look at why this is failing, but that's not important. So I think we have a valid PR. And with nine minutes to spare. Perfect. I was nervous. I wasn't sure I was going to make it. Perfect timing, I have to say, then. Plenty of time now for the Q&A. So perfect. Yeah. Yeah, so now is the time that if anyone was waiting to the end of the demo to ask their questions. And now is the perfect time to start typing away and sending it in. But yeah, anything else that you want to share right now, Andy, or? No. I mean, in general, we have, Farron has, I think, probably like 10 open source projects that we maintain, four or five of which we consider our sort of flagship open source. So if you're interested in contributing to open source and you like working with Kubernetes, all of our tools are surrounding Kubernetes. So we have Goldilocks, Nova, Polaris, Pluto. Those are the main ones. Our back manager and Reckoner as well. So if you are interested in working on any open source projects related to Kubernetes, please feel free to go to github.com slash fairwindsops. And we also have a community Slack. And I am also in the Kubernetes Slack. If folks want to reach out, have any questions about this afterwards or anything like that, I'm happy to chat about open source. Awesome. But so now we are all experts and obviously experts in providing policies for us. But if anyone wants to learn more, what would be the next resource that they do check out? Or is there anything that they should be moving on next? Definitely, if you're interested in Polaris, take a look at the repository, take a look at the documentation. And again, reach out in the Kubernetes Slack or in our community Slack should be linked in the repo. But definitely take a look. It also functions as a validating admission controller and a mutating admission controller, which I think the mutating piece is super interesting. It's not something we've explored a ton, but I think it's very interesting and has a lot of potential use cases. So take a look. Perfect. Final call for questions if there's not coming anything in, but I have a question though. So what's in Polaris' future? So what are the next steps for the actual project and road map, things that's happening, or so forth? Good question. Good question. I think at the moment, the actual feature set of Polaris is fairly stable. Where we see the most opportunity in the next probably six months is mostly just adding additional checks. So I know I do have a coworker working on building out a set of checks to essentially satisfy the NSA hardening guide that we're also familiar with that came out, I think, about a year ago at this point. I may have my timeline off on that, but just really adding checks. But other than that, nothing major from a road map perspective feature wise. I think it's a pretty robust set, especially now that we've added mutation. Yeah. Well, it makes sense. No need to mess with perfection. So I don't know if I call it perfection, but it is pretty good. Makes sense. But yeah, awesome. Thank you so much. But since no audience questions has popped up, but if anyone kind of like later on realizes, oh, they should have asked something, I think you gave a lot of good helpful resources and places where to reach out to you. So that's really great. But that said, thank you, everyone, for joining the latest episode of Cloud Native Live. It was really great to have a session about writing Polaris policies. We also loved the few interactions from the audience and say hi to the people who said hi to us, of course, as well. And as always, we bring you the latest Cloud Native Code every Wednesday. And in the coming weeks, we have more great sessions coming up. And thank you for joining us today and to you around in the coming weeks. Thanks for having me.