 Well, let's get started. So welcome. This is the Art of Kubernetes add-on validation, secure strategies for the modern developer platform. My name is Joaquin Rodriguez. I'm a software engineer at Microsoft. I'm based in Austin, Texas. Welcome. So for today's agenda, we're going to be talking about cluster add-ons. I'll be introducing what they are in case you're not familiar with them. Why validation is important. Some of the validation strategies that you can take and implement. Also, I'll be talking about secure rollouts. And I will conclude today's presentation with a demo. So cluster add-ons, what are they and why do we care? So cluster add-ons are tools, their applications, or services that enhance the functionality of a Kubernetes cluster. They provide essential capabilities that are not included in the core Kubernetes components. So if you think about Kubernetes, we know Kubernetes is great. And if you implement the vanilla cluster, it can just do so much. You need to enhance it. You need to expand it. So that's what cluster add-ons are here for. Customer stations, just like I was saying, they allow you to customize your Kubernetes cluster according to the specific requirement of that cluster. So no, all clusters are the same. Some clusters might have different requirements. So it just really depends on what you're trying to do. Resource management. So add-ons help us manage resources to make sure that everything is running smoothly. And as you were expecting it to be. And also, you're making sure that your resources are not being wasted. So basically, also, you have an add-on ecosystem. So these add-ons are meant to interact. Well, in some cases, they're meant to interact with one another. So when we are validating, we need to check that when you're having these integrations between add-ons, they're working as they're expected. So these are some examples of cluster add-ons. You can group them in different categories, such as monitoring and logging, networking and communication, security and authorization, and storage. I'm kind of curious, by show of hands, who has used any of these? Yeah, pretty much everybody. So that's great. So OK, then why do we need to validate these add-ons? So the first thing is we want them to be compatible, right? So when you deploy these add-ons, you need to make sure that they're compatible with, for example, the Kubernetes version. As Kubernetes progresses, things might change. APIs might change. Things can get just different, right? So you need to make sure that they're compatible. You want to improve security. For example, you want to identify possible vulnerabilities, just to make sure that you're not putting things at risk. Performance, you want to make sure that you're not doing something that is going to exhaust your resources and make the cluster go crazy. Like, for example, if you deploy something with no limits, like CPU or memory limits, that could be an issue. So when we validate, we want to check those type of things. Facilitate upgrades. So sometimes when we want to provision from one cluster to another, let's say going from dev to test, before we do that upgrade, we want to make sure that it's working as expected. So when we validate and we try to upgrade, then we don't have that issue, right? Well, in some cases, we do have issues. But for the best case scenario, we expect it not to have any issues. And of course, each cluster might have its own configuration. And we want to make sure that we're validating against that. So today, I'll be presenting a few strategies to do validation. I just want to see a disclaimer. These are not inclusive. So basically, you can implement them as needed. You don't have to do them all. In some scenarios, you might need one or two. In some, you might need all. So it's really up to you what you're doing and what you're trying to validate. So the first thing is a static code analysis. And then we're going to be checking for HelmChart validation, image validation, validating using policies and rules. And then at the end, we're going to be doing some integration testing using secure rollouts. So let's start with the first one, which is linter validation strategy or static code analysis. So even before we deploy our add-ons into the cluster, we want to make sure that they are correct. And that's why linting helps for this. There's some open source tools that can help you. For example, one of them is KubeLinter. So KubeLinter is awesome. You can run this tool against a YAML. And then it will tell you, for the most part, what's wrong with it, if there's anything wrong with it. So as you can see here with this example, I'm just validating a simple deploy file. And it found some issues. Basically, I'm trying to run as non-rude. I forgot to put my CPU and my memory limits. And it will tell me about that. Other one that is pretty good, KubeConform. It works very similar to KubeLinter. KubeConform validates against the Kubernetes API. And then the other one is KubeScore. When you run KubeScore, it will tell you like a little score on how compliant your YAML is. In terms of which one is better, I don't have the answer to that. It's really up to you to test them out and compare them, depending on whatever needs you have. But the cool thing about these tools is that you can run them as part of your CI flow. So whenever you're doing something with YAMLs or you're trying to validate something, you can integrate these tools in your CI workflow. And then you can basically abort the CI flow if there's an issue. Next thing is HelmChart validation. So this is very useful, especially if you're creating your own add-on. Or if you're importing an add-on from somebody else and you want to make sure that they're working correctly. The first thing is HelmLint. So a lot of people don't know about this, but Helm has a built-in Lint tool that you can run against HelmCharts. And basically, it will tell you if you did a typo or things are not looking as you expected. So it will just tell you, and it's pretty neat. You can also install different plugins. So the first one is Helm unit tests. So just like as if you're writing code, you can write unit tests for your HelmChart. And basically, you can put different checks that you can run against. And then if something fails, it will tell you. And also, just like I was mentioning in the previous slide, KubeConform, you can actually integrate KubeConform as a Helm plugin. And by doing that, what it does, actually, it cannot use its Helm template to transform your Helm chart into a plain YAML. And then it will apply validation towards it. So it's pretty cool. It's pretty, pretty useful. Next thing is container image validation strategy. So again, if you're making your own add-on, or if you're trying to deploy an add-on that already exists, you want to check for vulnerabilities before they make it into the cluster. So there are some pretty cool tools that you can use, such as GRIP or Trivi. They're open source. And basically, what it does is also as part of your CI flow, you can run your images. You can scan your images. And it will report for any vulnerabilities. These tools have different databases that you can really checks for issues. And then it will report you for those issues. And then they're integration-friendly. These tools also can be incorporated into your CI workflow. And they're open source, and they're free. So that's pretty cool. OK, something that is very important to keep in mind. So let's say you already deploy your image into the cluster. And then a few weeks later, your image was known to be vulnerable to some flaw that was discovered somewhere, right? How can you continuously be checking for new issues if this database gets upgraded? You can use Harbor. So with Harbor, you can host your own images inside of a Kubernetes cluster. And then inside Harbor, you can have plugins that integrate with GRIP or Trivi. And then so if in the future there's a new issue, you can be alerted that, hey, there's a new issue detected in this image. Do something about it. The next strategy is using policies and rules. One of my favorite tools is Kyberno. So basically, Kyberno is a Kubernetes native policy engine that automates validation, or you can do things like mutation, generation of Kubernetes resources, and also you can define policies. You can automate security and compliance. You can enforce pre-defined security standards for clusters add-ons automatically. Real-time validation, the cool thing about this is even before you try to deploy something into a cluster, if that thing that you're trying to deploy invalidates some of your policy, it will basically stop that deployment from happening. So it will say, no, you cannot come in. You cannot deploy that because you're violating this. So don't do it. And just like that also, you can prevent misconfigurations. Again, if you're trying to deploy something that is not compliant to your standards, it will block it. And again, enforce best practices. This is an example that I got from the Kyberno docs. It's pretty useful, and it explains to you how basically Kyberno works, putting in simple terms. So you have a policy, and that policy might have one or more rules. Each rule will match some sort of object. You can think of a namespace, or a deployment, or a label, or something. And either you're going to match that, or you're going to exclude that. And if you do that, then you can do a validation, or you can mutate something, or you can generate something, or you can verify an image. I have an example here of a policy. So basically what this policy is doing is checking for resources of the kind pod. So if you have a pod that is mounting secrets using environment variables, basically it's going to say, no, you cannot do that. You need to mount them as volumes. So therefore, this pod cannot exist. So just providing some context on how Kyberno works, you can do a lot of crazy stuff with it. I have another example in the demo that basically it checks for vcluster secrets. And every time there's a new secret of vcluster in my Kubernetes cluster, it will automatically pick it up and register it in Argo. Yeah, it sounds kind of wicked, but it works really well. My next strategy is validating using rings and secure rollout. So just to explain how this works, so with a ring deployment, you can think about it as faces. So you have a face deployment strategy that breaks down the rollout into different stages from a small control group outwards towards the entire infrastructure. This method reduces risks and allows for thorough check of the site each setup. At each step, sorry. Why reuse ring deployments for Kubernetes add-ons? So first of all, it's safe. It allows for the incremental validation and monitoring to identify and mitigate potential issues early. It reduces the impact of updates or new deployments on production workloads. And also, you can use actually GitOps for using ring deployments. So if you haven't used GitOps, you can use Git as a single source of truth for declarative infrastructure. So basically, your Git repo is the source of truth. And whatever you deploy in there, a GitOps agent, such as Argo or Flux, will pick it up and it will deploy it into your cluster. So how does this work? So the first thing is you have, for example, like a Dev cluster. And you're going to do the initial validation ring. So you're going to do your deployment into this Dev environment. And then once it passes, and make sure that things or the cluster add-on is validated in your Dev cluster, then you're going to move to the next cluster, which is the pre-prod environment. And here, basically, you want your pre-prod to match as close as possible to your production cluster. That way you can do the next phase of deployment. And then you can validate at this level. And if everything is looking good in pre-prod, then you can move into production. And then you can do the testing. Well, you cannot do testing production, but you can make sure that things are working less expected. Now, as a bonus, once you move into pre-prod or production, and then you can use a progressive delivery tool, such as Argo rollouts or Flagger, just to make sure that things are rolling out smoothly. And then you can do rollbacks if needed. And also, instead of just unplugging one thing and playing another thing, you can do it smoothly and doing progressively in these environments. OK, so next, yeah. So today, I'll be doing a demo for progressive delivery of these add-ons across different phases, Dev, test, and production. So let's start with a basic cluster. In this cluster, we can call it a management cluster. And what we're doing here is we're installing Argo. Argo is listening to a configuration repo for my management cluster. And as soon as it does that, it will install a few add-ons. And these add-ons are Grafana, basically so I can see some dashboards that I can check for the integration. Danos for storing my metrics long-term and then Prometheus. Also have an nginx ingress controller for accessing Grafana. And then Argo is going to manage a fleet of clusters. This is a hub-spoke model. You can go many ways around this. I chose this way just for demo purposes, and it's a lot easier to explain. So basically, this Argo instance is managing my dev fleet, my production fleet, and my pre-prod fleet. Of course, when you do this in production, you might do this differently. You might have one Argo instance for dev, one Argo instance for pre-prod, and then another one for production. Or maybe you can have Argo running at each cluster separately. It really up to you. But again, for demo purposes, I'm using this. Then on my dev fleet, I have a few virtual clusters running with different versions of Kubernetes. As you can see here, I have two clusters running version 1.28 and then have another two running 1.27. And then for my pre-prod fleet, I have two virtual clusters, and then I have two AKS clusters. And then my production fleet, it's basically a mirror of my pre-production fleet. And then I have Kiberno with a policy that will enable or disable add-ons across this fleet and I'll show that in a second how that works. And then each cluster will have a version of Prometheus and a version of Pod Info running. And then basically what we're going to do is we're going to be increasing the version of Prometheus and Pod Info across these fleets, dev pre-production and production, just to make sure that they're working as expected. Also something to notice, since I'm using Prometheus and Pod Info as my add-ons, each add-on has its own Git repo. And each Git repo has a branch for each environment. That way I can have a little more control as far as what is that I'm deploying to what cluster. So you might have, I don't know, in my dev clusters running version of Prometheus, I don't know, version 2. And then on pre-production, I have version 1, for example. So by having this type of setup, then I can be taking a look at how things are moving across each environment. And it'll make more sense when I show this during the demo. And last but not least, Prometheus is doing a remote write back to Thanos. So each cluster has Prometheus and is writing back to Thanos. That way I can aggregate all my data and then I can just have a nice dashboard on Grafana that will show me how things are integrating. OK, so for my demo, the first thing I would like to show, I have this Grafana dashboard in which I have my environments already defined. So you can see here dev, pre-prod, and production. And you can see here in dev, I have my four clusters. All of them are running Prometheus version 2.44, and also they're running Pod Info version 6.6. Now, these clusters are a little different. Well, three of them are version 1.28, and one of them is 1.27. And as you can see here, things are running smoothly. You can see that I have some metrics coming from my app for each cluster. You can track the average request per cluster. So things are working fine. Then if I move into my pre-prod environment, I have two clusters that are already upgraded to version 2.44 for Prometheus and the other two are running the older version, which is 2.42. They're running the same version of Pod Info. And again, there are also different versions of Kubernetes. The ones that have the K3S are virtual clusters, and the ones that don't, they are my AKS clusters. So for my demo, I actually prerecorded it because doing the progressive delivery takes some time, and I didn't want to waste your time, so I just recorded it and cut the waiting periods. OK. So here, again, like I was explaining, we have our pre-prod, and right now, my Prometheus version for all of them are 2.42. And then I'm going to go into my repo that manages the Prometheus add-on, and I am in my pre-prod branch. I have a base customization, and then I have some overlays. So for my East cluster, I'm going to bump the help chart for that Prometheus instance. So I'm replacing it with version 46.0, and then I'm doing the same thing for West, OK? And then I'm going to do a commit and push. So now, if you see I have an article instance running, it's going to pick up that change. So it's saying, hey, I found something in my West US2 cluster. Let's sync it up. So it's syncing, and then, oh, the same thing for East. I am going to use the cluster to connect to my East US cluster. And then you can see here that I have my old instance from Prometheus running that was deployed like two days ago. And then you can see the new instance starting to be deployed, OK? And then if I go back to my Grafana chart, now you can see on pre-prod that I have version 2.44 deployed. And now, OK, well, that was the prerecorded part. So I'm going to go back to the live Grafana. So now you can see that they're deployed. I can filter by them. So I'm going to just use the East US and West US2. And you can see now, just by looking at my metrics, that this application is working as I expected. Every 30 seconds, I'm getting a request as I'm expected. And my request duration is very, very low as I'm expected. So I can make sure, oh, sorry. I can validate that this integration between Prometheus and Podinfo is working. The last thing I wanted to show you, if I go into Argo and if I go into Settings, you can see that I have my fleet of clusters defined. And each fleet of clusters have its own labels. These labels, I'm controlling them via Kiberno. But here you can see as an example that I have Prometheus enabled. So if I were to go back to my Kiberno policy and say this is too false, then basically I'm shutting down that add-on, which is pretty powerful if you want to do things across large numbers of fleets. It looks like this. Let me go back. So this is my add-on validation repo. So in this repo, I have everything that controls the management cluster and also the fleet of clusters. So I have my Kiberno policies, one for AKS, one for B cluster. If I open the one on B cluster, then I have one policy per environment, so one for dev, one for pre-production. And then right here, you have the cluster labels. So you can see that I have some add-ons that I have disabled, such as OPA or CERT manager. But I have my PUD info and Prometheus to be enabled. So by doing that, then these labels will be injected into my cluster, into my Argo instance of clusters. And then if I go into workload and I open an app set. So this application set in Argo is the one that is installing the add-ons across all my dev clusters. So I have this rule here. Let me see if I can find it. Basically it's saying, OK, from all the labels that you have in those clusters, let's pick up the ones that are true and are also as part of PUD info and Prometheus. And then I want you to deploy that into my fleet of clusters. And by doing that, basically I have more control if I want to deploy these across my fleet of clusters. And then so once I do that, then you can check at each stage, at each environment, between dev, test, and prod, that if it works on dev, then you can progressively test in the pre-production environment. And if that works, then you can move on to the production environment. OK, so let's go back to my slides. OK, so just to recap, so cluster add-ons are super important and they need to be validated to prevent issues. There are so many strategies to validate these add-ons. The ones that I just presented, there are just a few of them, but there are a lot more out there that you can implement. Also, open source is awesome because you can get a lot of free tools that you can integrate in your workloads. And if you use the right tools, then you can secure your environments. And that will be it. Thank you so much.