 Hi, everyone. Thank you so much for joining this session today. Like, it really means a lot to me. And I mainly came here today to tell you a story. Tell you a story about policy-based governance and a multi-cluster Kubernetes environment. During this story, we will go through how an organization is able to transform itself from a messy, unstandardized environment into a strictly regulated system in which nobody can do anything without the big brother watching him. So as we go through the story, we will see how we manage to take advantage of open source CNCF projects and leverage them in order to take a huge Kubernetes environment that consists of hundreds of clusters and onboard it into the governance stack. My name is Mikhail Kotelnikov, or Mikhail Kotelnikov in Hebrew. And I am a cloud architect in Reddit, Israel. I'm working as a professional services group. And yeah, it mainly means that I come to Reddit customers and organizations that work with Reddit and provide solutions that may help them on board Kubernetes and manage it properly. And today's story, the story I'm going to share with you today, is based on an engagement that I went through and I'm still going through in the past two to three years. And I think that the main goal that I'm trying to share in this session is how we were able to achieve this goal, first of all. And secondly, what are the challenges? What are the decisions? And what are the risks of going through this process together with the organization or in your environment? Because it's not a simple process to go through. It's not a simple process to govern and provide policy-based aspects to your environment. And this is exactly the process that we're going to cover in this session. OK. So is the mic OK? OK. And is it working? Yeah? OK. So in order to make this process easier for us to understand, we'll need to go two years to the past and see how the environment was when we began working with it. The organization that we worked with had multiple clusters. The first one was based in Tel Aviv. And the main business was around deploying applications and Kubernetes instances closer to the edge. So each cluster had an application or a set of applications that were deployed on top of it. And it provided access as close as possible to the users that were consuming them. As time went by and years went by, the organization made more money and it started scaling to other sites in Israel. It went towards other cities, other remote sites. And you may see that the organization started evolving across the complete country. And now we have clusters really spread out across many different sites. And we provide services to the site, to the users that consume services from the site. As time went by, it really made sense at the time. Clusters began changing from the blue dots you see here to red dots and unaligned misconfigured clusters, in which you could basically see that in each environment, you could see a certain misconfiguration. Like each cluster had a different aspect that was configured on top of it. One cluster did not have a resource quota for its namespace. Another cluster had unsecured ingress instances. Another cluster had unencrypted traffic going between pods. And it was a complete disaster, even though we tried to maintain the policies throughout the complete environment, we could see that each cluster, it was either misconfigured or somebody changed something manually. And we could see that things were not as expected. So at this point of time, we decided that we needed to take action and be able to provide some kind of a solution. So in order to understand the basics of what we try to solve, we need to first of all understand what is Kubernetes policy-based governance is about. So let's go through how policies and how policies should be managed in a Kubernetes instance. This model that we're going to go through is a model which is designed by the compliant CIG in Kubernetes. And it was provided as a kind of portfolio that you could implement in your environment and manage it as part of your organization. So the model states that, first of all, you will deploy an administration point for policies to be managed yet. So you would kind of create a single point of management for the policies to be maintained. Mainly, you will have the compliance officer and the platform team to apply the policies and configure them together. The compliance officer will say that the security control that we must address and the platform team will usually provide the technical aspects that can be implemented in order to achieve this control. Afterwards, from the administration point, the policy is propagated into the Kubernetes cluster itself. On the Kubernetes cluster, we will have an enforcement point, which will apply the policy, and in terms, it will allow the policy to actually persist on the cluster itself. And as you can see, the policy is deployed here by using a certain policy engine, but it also reports to a policy information point, meaning that each violation that is initiated on the Kubernetes instance itself is being forwarded to an information point, which is usually being looked at by some sort of a SecOps team that looks at the violations. And many times, the platform team is in charge of the platform, so it remediates the violations that were initiated by the Kubernetes cluster itself. So if we take a look at the policy lifecycle itself and how it's being managed in such organizations, we will be able to take a look and see that the control is being defined. For example, in here, we say that all access to the cluster has to be secured via TLS. Has to be secured. Afterwards, we say that as a technical control, we define that all ingresses must be secured via TLS, and we state that all of the Kubernetes clusters should have this configuration. And then we need to create some sort of a mechanism which is able to take this technical control and enforce it and be able to take it to a level in which we can see each misconfiguration or violation that is initiated in each namespace in our environment. So we are able to define the control itself. We are taking the technical implementation of the control. And afterwards, we are maintaining this control as time goes by. So this may sound simple at first, because we have nowadays in the CIN-CF portfolio a lot of tools that are able to do and perform such things. We have Keyverno. We have Gatekeeper that you can deploy on top of your environment. And you can see each policy and how it works. And you can basically track each violation in a single cluster. But when you move towards multi-cluster, some things may change and some things may need to be adapted to when you perform this kind of evolution. So in this, if we take a look at a core concept that we need to address when we go to multi-cluster, we need to address the concept of base learning. Because you need to be able to control all of your clusters and make them look like soldiers. Each cluster should be in a certain uniform that the cluster that stands right next to him is wearing. Nobody can be, nobody can change and nobody can actually be different from each other. So if we are changing this and we are moving towards a position in which we have clusters that drift from each other, we are not being able to provide an environment in which we can actually allow anomalies in which we cannot allow, we cannot control the anomalies that are being initiated in the environment. So if there's something that happens and is being investigated by the team, we can actually see that some clusters might be drifting from each other. And it really makes investigating the issue harder. So if we take the model we saw before, the policy-based governance model we saw before, and we take a look at how it might need to be adapted when we change towards multi-cluster, we will see that the administration point will mainly remain the same. We will still need the point in which we can manage the policies and see and address them. We will see that the information point will probably remain the same. We can see that the SecOps team and the platform team will still need a single point of view for their policies to be managed, to be seen as their status has to be seen from a single point. But the main thing that changes here is the way we enforce the policies and how we validate them. Because as scale grows, we need to be able to see how these policies are being maintained and being distributed to the cluster itself. So when we started designing this multi-cluster government stack, we had multiple requirements from this enforcement point. The enforcement point had to be consistent, first of all. Meaning that if I have multiple clusters, if I'm deploying multiple Kubernetes clusters and I have a certain policies that I need to spread across them, I need my policy to be consistent across my cluster. So if I change a certain policy, a certain aspect in the policy itself, the policy has to be propagated to the cluster and to the enforcement point on each cluster. So if I have a certain policy engine on the cluster, it has to be modified as time goes by. And be equal for each cluster. Another requirement we had from this enforcement point was flavors. So as our organization grew, we saw that clusters became different from each other because of different business needs. So we had multiple clusters for different business units. And different business units require different policies. So if we take certain clusters and we adapt them towards different policies, we have to be able to control the policies themselves and how they're being deployed on the clusters. In terms of certain policies are going to be initiated and deployed on production clusters. Another policy is going to be deployed for dev clusters. If we have different business units, we are going to be needing different policies for each business unit based on their requirements from the organization. And if we take a look at the next requirement, it's basically saying that we need our environment to be ever-growing. So if we are taking our environment to scale and we're scaling to dozens and hundreds of clusters, we need the policy to be able to spread and scale across the clusters as my environment grows. So if you saw that I had two clusters and I had three and I had four, then I had 100, I need my policy to be automatically enrolled on each and one of these clusters. And I cannot be able to leave any cluster behind because it will hurt my business. And the last one is reports. I need my enforcement point to be able to spread the reports and be able to provide each policy enforcement point to actually forward the initiated violations from the policy itself to a centralized information point. So if we can see the policy information point that has the three clusters inside of it, you will see that if I have initiated a violation on one of the clusters, it has to be reflected in the governance stack that my SecOps team sees. So based on these requirements, me and my team, together, we came up and decided to design a solution. And the fundamental base for the solution and its requirements was quite unique. And I must take a note at this point and say that there are many options here and you can take a lot of actions and you can use enterprise products. But we decided to go and take our policies and treat them the same way we treat our applications. So if you think about it, the things that I described before are things that we have already solved when we are dealing with applications. We are able to scale our application automatically for clusters that are being deployed in our environment. We are able to take and monitor the applications as the environment goes and grows. We are able to provide different flavors to our applications as my flavors extend. I can perform multiple baselines for applications for different environments. So we decided to take this approach and deal with policies the same way we deal with our applications in the environment. So the first thing that you might think about when you deal with applications is where do you store them? And at this point, we decided that we are going to store our policies in Git. So we are able to take the policy itself, the definition, the governance rule that we want to initiate, and place it inside of Git. So the administration point is actually needs to be propagated with policies from a Git instance. Afterwards, we decided that we need to have a certain Kubernetes native platform for deploying these policies. Because we have multiple solution nowadays that allow policies to be propagated on top of clusters. And taking it and managing it as an extension to the Kubernetes API makes it much easier and much more declarative for users and platform engineers to actually manage the policies and evolve the policies as the environment grows. So if I have another requirement for a certain policy, it will be much easier for me in the platform team to create the policy as part of my daily routine if it's being generated by a YAML that is deployed by Kubernetes. So in this case, we have multiple products that do it. I chose Gatekeeper in this case, but you could go with Keverner or any other deployment that you are wishing to do. If we take the next component that we need to take care of is the GitOps-based deployment, it is much easier to deploy our policies the same way we are deploying the applications. Meaning that if a change is initiated in Git, I want the change to be propagated into my clusters in the environment. So if I have a policy that is being managed and it went through an approval process of the SecOps team for the compliance officer and the DevOps team and the platform team, the policy is being deployed into Git, all of the clusters that are affected by this Git repo, by the policy that's deployed into the Git repo, need to be applied and distributed across the environment. Basically, taking the GitOps approach allows me to scale my policies across the cluster in much an easier fashion. So in this case, we went forward with Argos CD that fetches the policies from the Git repository and spreads them across the cluster as my environment grows. New clusters that are enrolled into Argos CD can be applied with the policies that are inside of the Git repo. And the next aspect that we need to take care of is policy customization. And since we are working with YAMLs and basically we are working with Kubernetes CRDs as an extension to Kubernetes API, we are able to take tools that modify YAMLs, that provide customization to YAMLs based on templating on overlays, it depends on what the organization shows. You could take a specific policy template and create different value files for each policy. So for example, in this case, I have provided a different value file for the policy that's going to be deployed on Flavor A. And I have provided a different value file that's going to be deployed on Flavor B. The flavors can provide more restrictive or less restrictive policies based on what I chose as the organization's compliance officer, for example. And the next challenge that we need to take care of is the policy observability itself. And the way we actually take the policy and provide its status either to the SecOps team or to the platform team itself. So in this case, you can take the agent that's deploying and managing the policy and scrape metrics and export the policies as specific metrics to your environment. You could use an alert manager to forward your policy result to different kind of ingestors. Or you could fold the metrics into a centralized stack like TANOS or Victoria Metrics and basically allow the user to see this metrics and to act upon the violations as they are initiated. So as we can see from this stack, from the technical implementation that we went for, we are able to take the policy itself and manage it by using certain CNCF open source tools. And let's try to take this kind of methodology and provide a very big picture by using all of the things that we went through in this presentation. One of the things that we try to tackle when we just onboarded the policy multi-cluster governance stack was a pretty bad culture of namespace creation across the environment. We saw that multiple clusters have different namespaces that should not be present on these clusters. So some of the namespaces were created by malicious sections. Some of the namespaces were created to troubleshoot issues and people just left them there and forgot about them. And some namespaces were just not required to be on these clusters. We wanted, at the first stage of our policy enrollment, to be able to take these namespaces and create a policy that shows whether we have namespaces that are not managed by my team or by the platform team and initiate a violation if the namespace should not be in the cluster. So basically, you could see that our whole idea here was to maintain all of the namespaces that are spread across the complete environment from a single point and initiate a violation if the namespace is not declared with Git. How we did it is here in this repository, and you guys can take a look at it later. But the final result was something like this. First of all, we have our operator or platform engineer that takes the value files in Git, that declares the namespaces that should be initiated and should be implemented in this environment, and declares the namespace and the additional configurations that should be present in it. In this example, we can see that we have a loud namespace one and a resource quota that is applied to this namespace. Alongside default namespaces that I approve of being in this cluster. Afterwards, this value file, this template, is being taken care of by ArgoCD. And ArgoCD takes it and deploys it on the managed clusters. So if we take a look at what's being deployed on the managed cluster is the namespace that is declared inside the values file. And we can see that alongside this namespace we have another default one that is being actually managed by ArgoCD. The policy itself, since it has a declarative way of showing the namespaces that are being managed in the cluster, is able to see which namespace should exist on this cluster and which namespaces should not exist in the cluster. And based on the result of the policy, the policy can actually initiate a violation to my monitoring stack and allow my platform team to see this policy result and understand it whether it's critical and whether I actually have an issue in the environment. So if we take a look at this and we actually prepared a small demo for this, we can see that my dashboard here shows me all of the clusters that are being managed in this environment. We have three clusters that are managed by the governance stack. And I can see that I have one violation. Alongside this, I can see that I have the name of my violating cluster. And I actually have the namespace name of the violating cluster itself. All of this stack is being deployed automatically via ArgoCD. So all of the applications that actually deploy the policy and deploy the governance stack itself are being managed via Argo. And we have a specific values file that propagates to this applications and creates the policies and the artifacts themselves that are being deployed on the edge cluster itself. So if, for example, I run a command to add another cluster to my governance stack, to my ArgoCD instance, I can see that it creates the cluster. I can see that the cluster is being added automatically to the clusters. You can see that I have cluster 4 over here as well. And an application is being created to enroll the cluster into the governance stack. So I have created a monitoring agent. I have created the policy itself. And I have distributed and propagated the policy based on this certain action. And if we create a namespace on this cluster, we can see that I've created a manual namespace on this cluster itself. And we can see that this namespace is being monitored by the policy stack itself. And it will initiate another violation in my dashboard as it should take a few seconds. And it will initiate another violation for the created namespace. And let us just see it will take like two more seconds. Yep, and you can see that cluster 4 has created another violation. If I now add this namespace to Git, I will be able to actually remediate this violation by taking it from this unmanaged namespace to being a managed namespace in Git. So yep, we are finished for today, guys. And thank you so much for joining this session. And. Thank you.