 Hello everyone, I'm Sonny, and this is an introduction to Cloud Custodian. I'm here at Stacklit, I've got my socials up here. Hand it over to John. Yeah, and I'm John Anderson. I'll talk about what Cloud Custodian is and how I got into it. I'm one of the contributors on the Custodian project. And so I got into using Custodian as an SRE. So I've been managing Kubernetes clusters and AWS Cloud Estates at companies like SurveyMonkey and Zapier, and my job was to empower developers to manage their costs but be productive, make sure that they're in compliance with any governance things that we have like SOC2 and so that they could follow any of the best practices that we've set out within our organization. And so what Cloud Custodian allows you to do is do all of this in one place. You can set up policies with an easy-to-use YAML-based engine and control your Terraform, your Kubernetes clusters, any of the public clouds, Azure, AWS, GCP. And it's all the same YAML. You can run the same policy engine, which means you're going to be getting the same reporting. And before I landed on Cloud Custodian, I was running many different policy agents. I was running something for Kubernetes, something for AWS, something for Terraform. And when you're doing that across all these different things, you start getting a different experience between each one. So you might be preventing some SOC2 stuff on AWS, but you're not preventing it on the Terraform layer. And so Cloud Custodian opened all that up. And so one of the first things that most of us do on the SRE level is manage cost. We want to get those costs down, and we want to give that information to the developers, saying, okay, you have underutilized resource, or you have resources that aren't even being used. So when I was at Zapier, one of the big things that we had was some GitLab runners where their health checks were running, they were up there working. Everyone thought we were utilizing them. And then we got a report from the developers that pipelines were taking a long time to run. We started investigating it, and it turned out a lot of them were just sitting there doing nothing. So we were just burning $15,000 a month on GitLab runners that weren't actually running CI CD jobs. And so that's where we use Cloud Custodian today, where we can run these policies, and we can identify underutilized resources, zombie resources. We can actually enforce tags and alert the owners to say like, hey, we need to know what this resource is and who it's from. That way, if it is violating any of our rules, we know who to communicate with. And we're not looking to police the developers, we're just looking to empower them to be productive and make the right decisions. And then the next thing is compliance. Most of us have some type of compliance where you're going to be running PCI, HIPAA, CIS, any of those. And you just, when you're doing something like SOC2 on an annual basis, you want to be checking it throughout the year because if you just say, oh, it's out of time, let's start gathering all of our evidence it's gonna be really difficult to get all that in time for your auditor with Claude Custodian. You can run on an event-based mode or a pull mode and you can be running your SOC2 policies all the time. And then finally, probably the most important part of this is correctness. You're hiring a lot of experts. You have them there because they know what the best practices are. So you want them to codify that. You don't want developers writing code without tests. You don't want to deploy infrastructure without Claude Custodian because Claude Custodian is going to be those tests. It's going to enforce those correctness. And so when we say correctness, it's things like don't roll out a public S3 bucket or an unencrypted S3 bucket, things like that. And so we just want to enforce all of that. Right. So I'll give you a look at what the Custodian policy looks like. So for one, it starts out very simple. You just specify a name and a resource type. In this case, we're looking at S3. Then you define your filters here. So with Custodian, there are a lot of built-in filters that allow you to do things like look at relationships between objects or look at relationships between different resource types. In this case, we're looking at the actual policy on the S3 bucket itself, saying that we're going to allow access from this account to do these actions across account. But probably the most powerful part is the ability to do things like actions. So here we have a built-in notify action that allows you to send notifications to the resource owner. And that way you can really drive that behavior change at your organization, telling your users that, hey, you've misconfigured something or you need to change something about the metadata of the resource and get it so that your users are constantly basically leveling up their usage of the cloud. And finally, to run it, you just run Custodian Run with the policy YAML file there. So here's another example policy. In this case, it's an IAM role, but as you can see, it looks very familiar to the previous one. Instead of doing a cross-account check, we're just checking that the permissions are not over-provisioned, and here we're notifying the security team. Finally here, I'll talk a bit about the different modes that we have. So with Cloud Custodian, you have a pull mode, which means I want to take a look at the existing cloud infrastructure environment. So querying against your AWS account or your Azure subscription or GCP project. But additionally, we have what are called event modes. So this way you can react in a more dynamic way. So for example, if you were to have a developer create an S3 bucket that is public, you can actually detect that right away when it happens, change the permissions on that bucket to make it non-public, as well as do things like notify the user or auto-tag it, or any other actions that are available on the resource type. So really this lets you close the gap on how long you're vulnerable for, as well as prevent things that are headaches down the line when you've had an RDS database that's been out there for years, and then you're telling developers like, hey, you got to go and fix it. You're like, well, I kind of wish I caught that earlier. Finally, just a quick rundown on the software development lifecycle policies. So one way to do this is you can keep all of your policies in a Git repo, run CI on that, and typically what that means is run the actual policy in a dry run mode against your resources, and then deploy them, and you can have scheduled policies with periodic modes, as well as those automated policy triggers, and then send notifications and performer mediation on top of that. So what does this have to do with KubeCon? So Kustodian now has support for Kubernetes. One mode is the pull mode that I mentioned before where we just look at your cluster and run a filter on that. The other mode is the Kate's admission mode. So this is a admission controller that you can deploy into your cluster or outside your cluster using a mutating webflow configuration that can do all the things a admission controller does, which is allowed to deny objects based on filters that you've defined. It's easy to deploy with the Helm chart, and for example, you can do things like auto-label objects as they come in. So if you want to require everybody to label their resources as I'm being managed by Sunny, but people don't seem to be doing it, you can do it for them and try to move the needle a little bit. So we'll do a quick demo here, and hopefully everyone can see my screen. So first thing we'll do is take a look at the resources here. So here we have a namespace and we also have this pod manifest. This pod manifest, we've got pretty basic. We have Nginx and we have another pod with a couple labels here. So let's go ahead and apply this. And then what we'll do next is we will take a look at the policy that we have here. And here we have a policy that just says we're going to require this to be managed by, or sorry, have this label app that Kubernetes managed by and if it's absent, to filter it in. So we can run just run, challenge three policies. Oh sorry, got to be in my virtual environment there. So if you run this, you can see that after we applied it, we're getting back some results here. Basically all of these results show up in our output directory here. So we can see, for example, in this managed, required managed labels, we can see that in the labels here, we only have the foo and bar labels. So basically, you could do a whole lot of other, more complicated stuff with the filter syntax. But this is just a quick look at how you can get started. The next thing we'll take a look at is the admission controller mode. So first let's do a quick clean up. And like I said, the admission controller mode, the way that it works is that we'll try to create some resources here. And you can see here at the top, we have a missing recommended labels policy that says all pods must have foo and bar labels. So we actually do create the pod here, but you get that warning. The second one here is requiring deployments to have more than at least three replicas. This can be important because you want people to deploy things in a way that are more HA. And you can see here this is an outright denial. And in the same vein here with a pod with service account, this policy is saying that... Oh, I'm sorry, I didn't apply the Kubernetes manifest ahead of time, but basically this would prevent you from creating a pod with the service account cluster admin. So let's actually apply that here. And then we can do... Just a sec, sorry. Right, okay, there we go. You can see we're creating based off that service account because it now exists. The policies here, if you take a look, it's the same policy language as before. The only thing that's changed is that instead of having just the resource description and filters, you now have this new mode section where you can define what to do when you match on that policy. So in this case, we're saying we're going to deny connections and what operations you care about. Look at every single possible event, you can define really easily what you actually care about. So, let's take a look at here. So if we run a... If we just create a pod here with a QPCTL run, you can see here, again, we get that warning, but we can ignore that. We can now try to... Oh, that should be just that. And if we try to connect there, it's going to say we denied admission, because you can't connect any pods with the database in the name. So now I'll send it back over to John to talk about Terraform. Yeah. Yeah, and so what he was showing there, where we're able to scan out the Kubernetes resources and identify them, but then also block them on the admission controller. We have the same type of functionality when you're talking about Terraform. So our Terraform provider actually will scan at the... It'll do static analysis. So where you'll see a lot of policy agents where they'll require the plan to actually run and validate that your Terraform is working. We do static analysis on that, which means you can actually have your developers run this in an unprivileged mode, so they can actually check on their local computer, is this going to be accepted by actually running the plan and figuring out if it's going to do anything else. So yeah, so when we run this, you'll see it's going to traverse all of your Terraform configuration, and it's going to actually show you the lines of code that are violating what module they exist in, and it'll really walk you through how to fix this. So right now we have two policies that are running that say we require KMS encryption and we actually do want encryption enabled on any SQSQs. And so this allows you to just validate any of the Terraform that you're writing locally against those policies, and those can be correct as policies, governance policies, all that. And so if we wanted to go here and fix this, it told us the exact lines that we need to go into, put in the diff, and so you can see we just need to enable KMS encryption, and then that'll fix one of our errors, and then all we have to do is fix our SQSQ and enable encryption on that as well. And so this can be ran VSEICD in your GitLab or GitHub Actions, or just locally on your computer. Yes, all of these are shipped directly into Cloud Custodian. Yeah, and so the way that's working is we wrapped a library from Aqua Security called DevSec, it's a Golang library by Cloud Custodian's Python, so we wrapped that and we released it into a new open source project called TFPars, so anyone that would like to parse Terraform via Python, you can get it from the GitHub Cloud Custodian TFPars repository. And so the new command for running Cloud Custodian gets Terraform is called C7NLeft, and it supports actually traversing your entire graph, and so you'll see like one of those policies that we were showing, it was saying we require encryption on the S3 bucket, but with Terraform server-side encryption configurations are actually a separate resource, so it's actually linking those resources together and it's saying, okay, there is a server-side encryption configuration that's linked to this bucket, I'm going to run the policy against that and either reject or accept it. One of the common questions is how are you different from Open Policy Agent or OPPA? So OPPA is a much more broad and generic policy language that has a lot of use cases outside of just the cloud. One of the things that makes Cloud Custodian powerful is we are opinionated about where we are, so with that, that gives you the ability to have very expressive filters that semantically make a lot of sense inside the cloud environment. On top of that, it's much more verbose, so here's an example of OPPA requiring encrypted S3, and this is looking at the Terraform, so purely on the infrastructure as code-side, not looking at the actual resource itself and these are the equivalent policies in Custodian. So on the left here this is actually checking in the cloud what your bucket encryption is on all of your buckets in that account and then on the right is the Terraform equivalent, so basically much more concise and ergonomic, but of course OPPA has a lot of different use cases, like if you're using it in other contexts there's many reasons to use OPPA as well. I think the next question is how are you different from Kyberno? So Kyberno is great because it's built natively for Kubernetes and it has a lot of great power and features in the Kubernetes space, but if you're looking for a single tool to manage the cloud, Kubernetes, Terraform, Kyberno doesn't really do that in the future it may, but as of right now, Custodian is really the only one that's touching the entire stack going from your infrastructure as code all the way to the cloud to the cluster as well. And so here's another example of what a policy looks like inside of Kyberno and here's the equivalent inside of Custodian. So again a little bit more concise but for the different use cases, you may want to pick Kyberno, but for people that are using Custodian one of the initial goals with the Kubernetes provider is to provide something that's very native and familiar to existing Custodian users. I think that's it. Open to questions. Thanks for having us. We've got John going on and handing over the mic. Hey, so one of the questions I have, this is maybe a bit speculative but since you can look at Terraform and you can also, for like governance, then you can also look at AWS or another cloud provider for like Kubernetes for governance would there be a way to add a requirement that something that like okay, if this S3 bucket exists in AWS it's required as part of our governance that it's in Terraform for instance like to enforce like IEC coverage Yeah, that's a good question. The question was if there's a bucket you have in S3 that's required to be managed by Terraform could we do it in Cloud Custodian? Yeah, I think the facilities are there. There are some interesting policies that you could write. So for example with the event driven policies you can take a look at the actual underlying event which means that you can also take a look at who the user is that's creating it. So any sort of modifications to the bucket if it's not coming from a user that you know is part of your CI and you believe you can actually also inspect the user agent itself which would allow you to check like with a regex expression that the user agent is Terraform that's how I would do it I'm sure there's other ways other members of the community might be able to chime in. Well, I was thinking is because you can statically analyze the Terraform and like presumably you could then link that statically analyze Terraform to an actual resource if you could make sure that for each resource there exists like some Terraform basically. Of course you could look at the user agent or whatever that seems like a bit of a hack kind of because that will say for stuff created on the bucket but then like for modifications it gets a bit I wouldn't say it's a hack I mean if you have stuff in Terraform it really shouldn't be any other agent going and modifying the resource and that's I think that's something that I would want to know about as a SRE or cloud engineer is like are people coming in and actually manually modifying stuff especially if it's supposed to be managed by Terraform but I think if your organization is in a state where you have widespread infrastructure code adoption like unfortunately those kinds of standards is a lot easier than really if you're in kind of a 50-50 state where like some developers are really using IAC in Terraform CloudFormation and what have you and other people are still doing click ops well an example I guess is like you could have a if you create a Terraform resource you write some Terraform code you create a resource using Terraform locally like okay it'll see if you look at the users as it was created using the Terraform agent or whatever but if you is there a way to check let's look at what's in Git and let's look at what's in the and let's look at what's in AWS and to see like okay it's like everything like in the like for instance can you check like okay is everything that's like in the production account also in the main branch of like the IAC repo or whatever like the productive branch like the direct connection between Terraform HCL environment isn't defined by either of them it's defined by state yeah and the C7 and left stuff again it's a static analysis of the Terraform code and not taking a look at the state right now there's another question like that so with your Terraform stuff does it also work with modules or is it only if it's all in the main TF I'll let John answer that yeah it works on the modules as well so it'll actually download all the modules and traverse them as well so it'll run on there at least like download everything and yeah and it traverses through and works it you have any more you want to say another question in the back Condu would you compare with Azure policy for Azure resources that's a great question so we actually support Azure policies as well as part of the way that we deploy we have similar questions with AWS config so with custodian it's really like the easiest way to deploy those types of rules I'll let Kapil speak more to that yeah I mean Azure policies are powerful as a control plane primitive but I think the challenges is expressivity and readability whereas these are potentially operating both in CI much more expressive and can be operated on a developer station as well and can also do analysis post deployment so to speak whereas Azure policies like last time I looked at those is a fairly primitive comparison model as far as what you could do and defining exceptions and other things other real world things were actually a little bit harder yeah and I'll add like one of the big benefits of custodian is the fact that it's operating across all these things it's doing Kubernetes, Terraform, Azure GCP all that and so a lot of times like we said with OPA and Kaverno they're great in the place that they're sitting but when you're trying to govern across all of it it's really nice to have one policy agent that everyone can learn so like when we rolled out OPA at Zapier the people who were writing Rego all the time knew how to write Rego. Anyone else it was difficult every single time and so with custodian when it's spread across all of these different environments people are writing it all the time and it's YAML so it's much easier for them to understand and so Azure policy is a great policy agent but if you wanted to cross everything you're going to want something like custodian we've got one over there today in our accounts one problem that we face is folks use tools that create cloud resources like EKS CDL COPS and when we use cloud custodian right now one of our problems is we cannot do a proper cleanup we have some dangling resources because EKS CDL for example uses cloud formation for some and then does IAM roles and leaves things behind so kind of just wondering if there's any plan to kind of I do not know where to fix it or how to fix it. So you're asking like if resources are being left over after a cloud formation stack runs like I want to clean that up right like I want to use custodian to clean it up. But when you interact with EKS CDL for example it's not just creating resources using that there's some leftovers around that are the problem and then COPS is a whole different beast I would say that's one other use case. Yeah and so I'll say with custodian we do have actions for modifying all of this as well so we can mediate any of these resources so the thing you would need to do is if you want to manage any of these resources that are being auto generated because Kubernetes does the same thing like you're using an AWS load balancer controller it's going to be generating load balancers out in your cloud and then so if you want to identify those as long as you're having whatever tool is generating those resources labeling them or some way to identify that they're created by this tool or like we were talking about earlier the ability to scan for underutilized resources as well like unused resources and you can just run policies that say hey I've got this unused resource identify it, remove it a lot of times what we recommend is actually just marking it for removal like mark it for deletion, send a notification hey I found this unused resource I'm going to delete it in a week if nobody notices it's going to disappear in a week and then they'll notice then but hopefully it's an unused resource and ultimately you really care about the underlying actual cloud resource like the metadata you have on the Kube API server doesn't really cost you any money but the load balancers that are out there will and certain can target that directly we have a question over here when you are looking at an existing infrastructure are you able to do whatever we want as an output like CSV file yeah so Kastonian has a bunch of the question was are we able to basically report on that data in formats like CSV yeah so Kastonian comes with a built-in report command where you're able to pass in the policy file that you were using as well as the output directory and it will create a tabular view in CSV on top of that there's a whole bunch of other tools inside of Kastonian as well just the Kastonian CLI there's the mailer which is good for notifications policy stream there's trail creator and stuff like that so there's a whole lot of cool stuff in the repo and yeah reporting is a huge part of that because ultimately the people that care about compliance are typically not running CLIs they usually want like an Excel sheet so yeah so I guess based on going off this question does it do like a dependency search so like say somebody creates EC2 instance and then they tie security group to that and then maybe they want to ALB on top of that if I delete like the EC2 instance or before delete it will say well this is dependent on this this and this you may want to delete this or not delete this yeah the question was around how does Kastonian handle basically interdependencies between the resources so with Kastonian it's really the query part of Kastonian allows you to query on relationships as well your question about like before you delete you know a security group you got to make sure that it's not attached to anything I would have to look at the specific code for that but I'm pretty sure we we catch that and handle that the security group itself it has like an unused filter so you can use that filter first before you try to delete it and there's other functionality that's similar to that for other resource types as well this question back there these policies don't look very hard to write but I have to think about all the things that I want is there some library somewhere of like these are pretty good policies that I could look at as examples yeah so we have on the documentation on our website we have a bunch of sample policies you're right the Kastonian language is very easy to kind of understand but you do it does help to get some initial traction first so I would take a look at those and it covers a wide array of use cases like simple compliance to stuff like cost efficiency using off hours and stuff like that yeah and if you go to GitHub there's a Kastonian awesome repo that lists everyone's favorite policies that they've ever been in that's a good way to start because there's off hours and encryption policies and all of that listed there if you couldn't get an Uber and ran here from across town and missed the presentation would it be the best way to catch up on what was missed or if you want to send something to your co-workers or other people that might be interested sure so I would start by checking out our recently revamped website you can also check out the GitHub repo it's Kastonian we also have a Slack and Gitter which you can get access to both on the website and we're in there pretty active so the Slack is relatively new so if you see there's only like 150 odd people don't be alarmed we're trying to move off of Gitter or not off of Gitter but also to Slack but yeah we're there and then we also host a community meeting every other Tuesday it's 11am pacific time so whatever oh yeah so we also have a YouTube channel um yeah so we recently had Governance as Code Day which had a bunch of different talks from all sorts of different users including like Intuit as well as you know Stacklade as well was there so I would definitely go check out those videos and then I think we have just a bunch of other videos as well so I'm not even on the YouTube channel about Kastonian so yeah but to catch you up we can write YAML to govern all the things it's YAML all the way down so yeah for the admissions controller how are those applied to the cluster is it through a CRD and we can Coupsytale apply the files so right now it's through a config map so basically the the service that's running it's going to just mount it into a volume and then read all the policies from there so how you mount that volume into it doesn't matter but the Helm chart itself is just going to load it up into a config map and then just read them from there so yeah and eventually we may introduce CRDs for policies so that people can deploy them that way but right now we're trying to work with people who are using it across many different resources so we expect them to not only be deploying Kubernetes policies by Terraform and AWS and all that as well so we're assuming they'll have a repo of policies at that point cool and I think we're basically there's one more question I think so I only joined the org recently at my current company we have a lot of resources out there that were created a while back at this point I think and we don't know who created them we don't know if they are still at the company so the auto tag only supports tagging on the CRD API right now right so is there support or are you planning to help back into like history and yeah so the question was how can we basically fix tags on resources that are already out there what cloud are you using okay cool okay so for AWS we have a tool called a trail creator it's not super well known but it basically can scan your cloud trail event history and then do resource attribution based off of that and that's a like what you described right now is basically the exact scenario that many many large companies are in it's also the scenario that basically caused cloud sustaining to even exist in the first place there's a whole bunch of stuff we don't know who owns it we got to figure it out because we got to also govern it I'd recommend checking out the tools directory so it's not you got to go like one directory in but yeah yeah it's using like Athena or yeah thanks everyone for coming we're out of time so we're going to end there but we've got many of the contributors and maintainers of cloud custodian here in the crowd so if you have more questions we can take them offline and talk after thanks everyone