 told me that the key to a successful session is quoting Kelsey Hightower. Usage matures products faster than hype does. That's definitely true. One more quote just to be sure this session is a success. You haven't mastered the tool until you understand when it should not be used. Last quote for now. Smart people ask dumb questions. Luckily Kelsey said nothing about dumb answers, so you are welcome to ask me questions later on Slack. My name is Regina and I am a huge Github spam. I am also a cloud platform team leading Bank of Olim. Ever since I was a little child I was the only female somewhere. The only girl in class, one of the few women lecturing in DevOps areas and lately I have become one of the few female open source contributors. This is a shout out to you ladies to participate as well. Me and my husband have two boys, which makes me outnumbered in my family as well. This session is about a misconfiguration while working with Customize. How many people are watching this session to hear about a problem that happened to someone else? I will tell you about what caused it, how we fixed it, what we learned from it, and how we improved our methodology as a result. I will be focusing on GitOps for cluster management, especially on managing Argo City Config with GitOps. I will not be going into GitOps for business apps. What I will talk about today is from the perspective of a cluster admin and from the operations perspective, but the insights and the methodology are relevant also for managing business applications configuration. The background, GitOps manage desired states that we have. We manage multiple OpenShift clusters. We have clusters with Argo CD that manage other clusters. We manage multiple tools, products per cluster. And we manage multiple OpenShift operators per cluster. The tools that we manage, some examples are Prometheus, Grafana, Loki, VPA, the Vertical Pod Autoscaler, Litmus. Some clusters have cross-plane installed, some have Argo CD installed, and many, many more tools. We're interested in configuring the same product differently per cluster. We're dealing with lots of OpenShift-specific configuration. We upgrade those tools frequently. And many tools that we manage are Kubernetes plain manifests. Now, you can understand that we manage a lot of stuff, a lot of tools and products. And when we talk about GitOps, the repo structure is key. So let's see what was our initial repo structure and how it evolved. You see that there are multiple Argo CD instances that each manages multiple clusters, much more than the two in this chart. At the beginning, we duplicated that architecture to our Git repos, meaning we had one repo for the configuration of Argo CD's and of the clusters they live in. And each such control-plane cluster desired state was in a separate branch. And we had another repo for the desired state of the cluster's Argo CD managers, also with a branch per cluster. We had a many repo with a many branch, a lot of configuration duplication. When a new product needed to be added to our desired state, people didn't know where to commit it to. This structure was a mess, and we decided to move to a mono repo, mono branch, in order to avoid the duplication. And so that there is one place to look at to get the picture of the overall desired state. The branch per cluster or per environment is an anti-pattern. Don't do that. Never, ever. You can read more at Costi's Capiloni's blog, mono branch for all the clusters configuration. But you hopefully remember that we need to manage many products that come as play manifests, and we are focused on configuring the same product on many clusters differently. One option was to do that with Helm. We'll have heard of Helm and use it. Let's recap what Helm is. Helm is a Kubernetes package manager. It is also a CNCF-graduated project. This means that it is considered to be stable, widely adopted, and production ready, automates software packages' life cycles, such as installing, upgrading, and removing those packages. It uses a template engine, and it enables us sharing and reusing software packages that are called Helm charts. The chart consists of a template directory, a values YAML file with the parameters to be used with the template, and a chart YAML file, which describes the chart. The templates are YAML files with Go template and template functions. For configuring the same product on many clusters differently was to use Customize. I'm sure that many have heard about Customize and use it as well. There are people who love Customize and there are people who hate Customize. My husband asked me, why is Customize misspelled with a K? So let's quickly cover what Customize is and what problem does it solve. A native Kubernetes configuration management tool. And this is why it is spelled with a K. There is a big community around Customize and it is even bundled into Cube CDL. It uses declarative approach to Kubernetes configuration. It enables updating configuration without working by traversing a Kubernetes manifest to add, remove, or update configuration. It uses layering over base settings to selectively overwrite default settings without changing original files and avoids configuration duplication. It enables writing shared configuration once and then overriding it per cluster or per environment. The common configuration is referred to as base and then the overrides of this common configuration are referred to as overlays. Overlays are batches to base manifests or additional manifests. The final Kubernetes manifests are produced upon running Customize build. This command processes the base and overlay files, which are basically fragments of the configuration and produces the final plain Kubernetes manifests. Why we chose Customize? You remember that we managed the configuration of many products of many OpenShift clusters, many OpenShift operators, we need to be able to upgrade those products frequently. So we do not want to fall behind with old versions. We also need to add OpenShift specific configuration to many of those products. Manifest change in those products is externally driven and frequent, as opposed to manifests of self-developed business apps, the change in which is internally driven. We are focused on configuring the same tool on multiple clusters differently. We are not focused on versioned software artifacts in this case. We need a flexible configuration without changing original manifests, because if we change the original manifests, upgrades to the product would get us to merge hell and would be slow and error prone. So keeping the original product manifests separate and untouched is crucial. Customize provides clear separation between what is common and what is overriding. So the original manifests of the product naturally fit into Customize base, and then our configuration changes naturally fit into Customize overlays. Wrapping those tools in Helm chart seemed an overhead. If we choose Helm, we would need to write the Helm templates ourselves for each such product or configuration. Now you remember that Helm template is a combination of Kubernetes manifest fragments and Go template code. So when such tool has a new version that has changed manifests, the merge with our self-grown Helm chart would be time consuming. Because with Helm we wouldn't be able to avoid changing the original product manifests. The technology we use should be always working for us and not the other way around. And in the case of third-party products and tools that come as plain manifests, writing our own Helm chart for each such tool seemed like working for the technology. However, using Customize for that looked like the technology is working for us. We have started with Argo CD long before there was a stable Helm chart. Migrating to Helm chart didn't seem worth the effort until some point in time, but we're constantly looking at it and might migrate in the near future. I want to stress that we are only talking about how we manage plain manifests and we manage tools and products that come as a Helm chart with Helm, obviously. To that we have covered why we chose Customize. Let's see how we managed our Argo CD configuration. We have Argo CD running on different clusters. The Argo CD configuration is self-managed by Argo CD. We managed Argo CD config maps with Customize. There is a part of the Argo CD config map that is common to all the Argo CDs and there are parts that are relevant for only specific Argo CDs and they are different. The config map part that is common to all Argo CDs isn't base and the parts that are specific to each Argo CD are in overlays. It's a visualization of the repo structure. This is the shared part and those are the specific parts that override the shared one using patch. Control planes are clusters where Argo CD is installed. This is Argo CD config map, part in Customize base. We use OpenShift routes and Argo CD needs to be instructed about some route fields to be ignored during diffing. The reason is that the route host field is automatically populated by OpenShift at runtime. But those details are not significant and what is important is that this ignore differences section is responsible for the routes to be displayed as synced in Argo CD. A spoiler. The resource customization section, this one, is a multi-line YAML. It is important for later. This is the specific part. The Argo CD URL is unique to each Argo CD and so is the DEX config. I'll instruct Customize to patch the base manifests with the contents of the overlay manifest using strategic merge. There are also other patch strategies that can be used. Now that you have all the background details, I will tell you what happened on November 13. November 13 is the World Kindness Day and what better day is there to break stuff in production and other environments? So what happened on 13th of November? We received a requirement to implement an Argo CD custom health check for Ingress. The requirement was relevant for a particular Argo CD and so the implementation was to add a custom health check to the overlay Argo CD config map. You can see the resource customization section was added to the overlay with a health section for Ingress. Look at the base Argo CD config map again. The ones who are experienced with Customize can probably spot the problem. The others would have to stay in suspense for a little while. Those are the tests I have done before merging. I ran Customize, Build Results and verified that the word Ingress was there. I performed a Cube CTL dry run, an Argo CD dry run and I performed a run on a test cluster. All those tests were successful and so I confidently clicked merge. After the merge I verified that the Ingress resource now looked healthy and so the task was done. This is what happened five minutes later. We had a gazillion tickets opened. The business app CD was failing. Most of the apps in Argo CD were out of sync. All the route resources in Argo CD were out of sync. And we now have a monorabel. So the first thing we do when we troubleshoot a problem is looking at the commit history. So I looked at the commit history and surprisingly the last commit was mine. Now fixing the problem with Monoribo is as easy as reverting the commit. Now we will see what the problem was, what impact did it have and what did I learn from that. This slide is for those who are in suspense till now. On the left there is the expected final Argo CD manifest and on the right is the actual final Argo CD manifest. As you can see the route ignore differences section is gone. But I didn't delete this section. So what happened? Well the problem was the multi-line YAML which got entirely replaced by patching the base with what was in the overlay. Let's see what impact this problem had. The actual application deployment was successful. But the Argo CD status was out of sync. Now the CD pipeline commits the manifests and waits for Argo CD up to become synced meaning this is synchronous operation. This sync state was never reached and so the status of the CD pipelines was failed. First lesson learned was about overlaying file contents. I'm talking about having a part of a file in base and a part of the file in overlays as opposed to having a copy of the whole file in overlays. It requires using a patch which is sometimes non-driven. The common use case for that is the deployment image overlay. The question is whether other use cases are a bad practice or a best practice. Pros are that it eliminates duplication and it eliminates the change in one place and I forgot to change in another place problem. The cons are that the impact of an overlay change is not obvious. So the conclusion is use it with care. The learn number two, patching multi-line YAML. Multi-line value can be a YAML structure itself. But the multi-line value is a string, it is not a map. And so the so-called keys in the multi-line value are not real keys. And so patching such a key in a multi-line value will result in the entire multi-line value being replaced. So the conclusion is if multi-line values should be patched consider getting the file out of base and keeping it in overlays in its entirety or use helm. Lesson learned number three, understanding the impact of a change. So our goal now was to understand how a change in the overlay impacts the overall configuration. When I ran Cube CTL dry run and Argo CD dry run I made sure that my change was valid from Kubernetes and Argo CD point of view, but this was not enough. The missing part of the puzzle is the diff between the final new and the final existing configuration. This is not a diff in a customized file content that I have changed. This is a diff in the final Kubernetes manifests. I will explain how we added the missing part soon. Had I seen this diff in addition to the diff of the file I changed the problem would have probably not happened and I would have been telling you another story today or just going to the beach instead. Lesson learned number four, Git repo protection. Mono repo. The GitOps desired state of all the cluster we managed is in this repo and this includes monitoring tools, cluster configuration, storage operators, and many, many more. Business apps GitOps desired state is managed separately and they are out of scope for this session. Rebo is mostly customized based. A change can affect a production cluster. A change can affect multiple clusters at once. Having a Mono repo is very convenient for cluster configuration because it enables us using tools like customized to efficiently manage what is common and what is different from multiple clusters. Changes can be quickly introduced to multiple clusters at once and operation can be lightning fast. However, breaking multiple clusters at once can be lightning fast as well. With great power comes great responsibility so this repo must be protected well. So we realized that we needed a merge request approval. Merge request is a methodology of GitLab because this is what we use. It's basically very similar or identical to pull request at GitHub. We realized that we need automated validation and pretesting of configuration changes. Now merge request approvals and automatic tests sound familiar, right? This is what people do for business application development and CI. Well, managing Kubernetes infrastructure and operations has to follow the same path, maybe even more strictly. Especially in a world of Kubernetes is a control plane for everything because then you end up managing not only the cluster as the cluster but many other remote resources as well. The same concepts of working with GitLab, GitHub issues, merge requests, branch methodology, CI tests apply here as well. Just like our code needs to be tested properly and sometimes reviewed, our configuration also needs to be tested thoroughly and sometimes reviewed. This is the last quote of Kelsey Hightower I promise. Automation is the serialization of understanding. So let's take a look at how we serialized the insights from that incident with regard to Git repo protection combined with understanding the effect of a change in customized configuration. Here are the steps in our pipeline. Cube CTL dry run, Argo CD dry run. Changes in base now require an approval before merge. Changes in control plane cluster overlay also require an approval before merge. Changes in critical cluster overlays acquire approval before merge as well. Other changes are auto approved as they're considered non-critical. Regarding the merge request approvals, they will never replace automatic tests but having a review from a colleague is a great way to spot problems otherwise overlooked. Unfortunately, waiting for a review is also a great way to slow down or even block operations. So we chose something in the middle. Widescope changes like touching customer's base and changes to critical clusters require a second eye while changes on a playground cluster can happen without waiting for anyone and can be just merged. This is an example of a critical cluster change merge request. You can see the comment that was automatically added by the merge request pipeline saying that a critical cluster overlay was changed and that a human approval is required. This is the comment. This is an example of a customer's base change merge request. You can see the comment that was automatically added by the merge request pipeline saying that base was changed and that a human approval is required. However, this merge request has no wide scope changes or changes to a critical cluster and so it was automatically approved by the merge request pipeline. It can now be merged by the author of that merge request. But wait, what about the impact of a change in customize? This pipeline is not complete without handling this impact. Do you remember Customize Build I mentioned earlier? Customize Build produces the final manifests of the result of Customize Build using the changes section of the merge request. So the solution was to commit the Customize Build result file along with the changes themselves. So the merge request pipeline requires the up-to-date Customize Build result file to be committed along with the changes. This is an example of a failed merge request pipeline. Now look at the comment on the merge request that was created automatically. You have changed Customize Configuration without committing the Customize Build result. The impact of your change is not determinable. This is a visualization of the implementation. This is a familiar merge request changes view. You can see the small green part that was added and then the huge red part which got overwritten. And now the impact is clear. A change similar to what I have performed on November 13 must not have been merged. The aftermath of a change is now visible and easily comparable using the standard merge request diff mechanism. Basically now after all the tests in the merge request pipeline have passed, I can also see what exactly I'm about to change and I can make a decision, whether this is what was intended or a mistake. And then depending on the scope of my change, whether it affects multiple cluster or a critical cluster or just a playground cluster nobody cares about, I will either be able to merge immediately or be able to merge only after an approving review of a colleague. And so the overall repo protection is improved. This template engine and a package manager is constructing Kubernetes manifest from fragments of configuration. Code and value files. And understanding the impact of a change in any of those can be equally challenging and is equally important. It can follow the same methodology I have described earlier. You can read more in Costis Capilloni's blog. There are additional aspects covered in that article. To those who survived this far, understanding the impact of a configuration change is important. Configuration testing is just as important as code testing. Merge request pipelines for configuration are just as relevant as merge request pipelines for code. Choose the right tool for the job is important. The technology must work for us and not vice versa. Don't be afraid of making mistakes. I want to dedicate this slide to my kids. They attended my session when I gave this talk live on Q-Day. Live on Q-Day. They are 9 and 11 years old and they do not know what Kubernetes or GitOps is yet. So I added QR codes of jokes, riddles, and Dr. Seuss wisdom in some of the slides of my live session just for them. We are hiring so if you want to come and break more stuff with us, learn from that and then improve what we're doing, please contact me on LinkedIn or on CNCF Slag workspace. Thank you so much.