 Hello, and welcome to this talk, One Million Lines of YAML, Wrangling Kubernetes Configuration for Hundreds of Teams. My name is Katrina Vary, and I work for Shopify. Specifically, I'm a senior staff software engineer who works in infrastructure engineering at Shopify. And I've been using Kubernetes since Shopify first started experimenting with it back in 2016. More recently, I've had the honor to co-lead 6CLI and its Customize and Care and Function sub-projects. The reason you might be interested in my opinion on config management, besides that I work in that area with 6CLI, is that I worked on Shopify's production platform team since its early days. And I was responsible for both the doing and the redoing of a bunch of the decisions that we made related to config management. This talk is going to be a story about how Shopify has managed an enormous amount of Kubernetes YAML, first in a way that didn't work super well, and then in a way that worked much better, both for our platform team and for the developers that we serve. But the story itself isn't really the main point. All orgs at this scale are different, and I'm not going to try to tell you to do exactly what we did. At the end, I will introduce an open-source toolkit that you can use to take a similar approach if you'd like to. But I'm not going to go through the vast array of tooling options out there. There are a lot of talks that already do that. My main hope is that you'll leave this talk, understanding why some configuration management systems will work out better than others and cause you less pain. It's still going to be config management. It's not going to be no pain. But I also hope that you'll take away some principles for designing your own systems. Before we dive in, though, I want to share with you why I think this problem is more interesting than it might sound and hopefully convince you that it's not just a big headache that you just have to deal with. Yes, managing large volumes of data in a whitespace-sensitive language is a pain, but let's take a moment to appreciate what we're really doing here. We're not talking about managing arbitrary YAML documents. We're talking about managing new documents that contain Kubernetes objects, and those objects are really, really important. Because Kubernetes is API-centric, the declarative data that we're talking about is really at its core. This data in our YAML files is the means through which we express the state that we want our systems to have so that the various Kubernetes components can make it happen. This is a really, really smart and extensible way to build a system, and it's a key driver of Kubernetes success. Managing this desired state is an important part of what it takes to use Kubernetes successfully in a large organization. And it's especially the case if you're doing that config management on behalf of or in collaboration with a diverse set of users. As we'll see, it requires more than just choosing a definition format or a packaging format. Kubernetes configuration management is really a journey from the idea in someone's head about what they want their production system to look like through that state being submitted to the API server. And it's important to carefully consider what should happen at each step in that journey and, more equally importantly, really what can go wrong. So with all that in mind, let's dive into the story. What to do with this data is one of the first questions that a platform team might be faced with when starting their platform, assuming they want to follow the golden path of storing and declaratively applying their Kubernetes config. This is not at all the hardest part of the problem that you're facing at this stage of your platform building, so many orgs reach for an easy option to get it out of the way. At Shopify, we initially gave developers some checkboxes that allowed them to select what kinds of components they needed to run as part of their deployment. So they could select things like web, jobs, MySQL Redis, all those kind of options there. And what we did was we would dump a set of YAML files into their repositories right alongside their application code, where they could easily manage and review it. But what we gave them wasn't actually just playing YAML for one particular reason. The reason was that some bits of our configuration needed to vary on a per commit basis. Shopify had long since been doing continuous deployments, with the Git revision being used for versioning. So we wanted to use the image tag for the revision for the image tag. And some apps also expected an environment variable with that data. So because we were using this revision for pieces of the templating, and the revision was coming from the repository we're storing the config in, we obviously can't commit that to that repository. So to handle this, we reached for the familiar tool of templating. Specifically, we decided to use ERB, which is a popular templating language that is included in the Ruby standard library. And it allows you to embed actual Ruby into arbitrary documents. Since our deploy tool was written in Ruby, and most of these files were being distributed to Rails developers, this seemed like a natural fit, and an easy default choice. In hindsight, though, this was a big mistake, and I am still haunted by a conversation that I had with my lead at the time. It went something like this. You saw what I was thinking about doing. And you questioned the sanity of allowing a Turing complete language in this context. But naively, I thought it would be more empowering than dangerous and went with it anyway. Now as a customized maintainer, you might be expecting me to claim that this all went wrong because templating is bad and everyone should just use customize under talk. But I don't actually think that, and templating was only part of the problem. My boss was super right about the problem of exposing a Turing complete language here. People did all sorts of stuff with it. Stuff I never imagined. They created complex, unintentionally designed abstractions within their config. They did complex math to calculate relative values for fields like replic accounts and resources. And they even made HTTP calls, fetching some data to serialize to include in their templates. And that added a lot of risk to their releases. So all this and more, though, was perfectly acceptable under the ERB system. And it really did cause misconfiguration and outages. But also templating itself was part of the problem. Templating is a really handy implementation tool, but it makes a poor interface, especially if you're exposing the templates themselves like we were. The documents that we were distributing weren't really valid to anything in their unrendered form. So we had to resort to string scanning to update them. And this was, at best, annoying for both us and our customers. And at worst, it was extremely error prone. The other issue is that there was no interface between our team and the developers. In practice, the final deployed configuration was a collaboration between the defaults that we were maintaining and the customizations that they wanted to make for their app. But to see which was which, you had to dive through Git, which is not ideal. This was also bad UX. Even for the most common changes that a lot of developers wanted to make, they had to wade through the whole resource to figure out how to do it. So all that is bad enough, but the ERB part and its consequences weren't actually the worst part. The worst part was that we were rendering this ERB during the deploy. In time, we ended up with critical apps that had a lot of complexity to their templates, and with that complexity comes added risk. Even straightforward changes can be surprisingly difficult to review, and seemingly small mistakes can have dire consequences. So this here is a real snippet from a PR that caused an outage. The mistake is on the slide. I'll give you a couple of seconds to take a look and see if you can spot it. So if you spotted that the for loop terminator is misplaced in relation to the document separator, that's the issue. What happened was what should have been multiple documents was actually one, and the duplicate keys were not caught, and the resulting document probably had them, and they were selected in an undefined order. That's how the YAML parser works. So yeah, in the end, what happened was an unintended deletion of multiple critical resources. Not good. The other big problem with deploy time rendering is that, at least in our system, a rollback is just a redeploy of a previous revision. And redeploying meant re-rendering. So this is another example from a real incident. In this case, an environment variable got misformatted due to a bug introduced into the rendering pipeline itself. And since that bug affected rendering, when the app spotted the problem and tried to roll back to the known good revision, it didn't work. So there was a considerable confusion trying to identify what the problem was, and what could have been an extremely brief service disruption was prolonged by this issue. So at this point, we had a bit of a laundry list of things that we did not want to have in our new system or to frame it more positively. We had a wish list of properties we did want it to have. Let's summarize them. First of all, the new system needed a focus on its consequences for release safety. It should make all changes reviewable, all releases repeatable, and moreover, it should make it impossible for end users to compromise those properties. We also needed the new system to be maintainable through automation. We compiled a list of changes that we had needed to roll out in the past, and we used the ease of making similar changes as a criteria for assessing our ideas for the new system. We also knew that we wanted to have a formal version be a part of the system so that we could evolve it over time. And if we wanted to make a major change again, it would be so much easier next time. And then finally, we wanted to make sure that we designed for the needs of all the users that we really had. And these range from folks who were building simple apps that would only need a couple straightforward changes in their lifetime, right through folks running key services that had a considerable amount of unique and very necessary customization. We didn't guess about what these needs were. We actually went out there and looked, analyzed, programmatically all of the applications that we had at the time we were doing this project. And based on that, we knew we had to provide basically three buckets of feature set. One, we need something that just worked without modification when people were getting started. Two, we needed to make it easy to author a specific set of changes that we had discovered were very common. And then three, we did need to make it possible to do something more complex when it was necessary. So what did we build? Here's a high level view of the new system. We'll go through each piece. Starting from the left, we have a config CLI. What we did was build a command into our existing developer tool that our devs were using every day. This command would interactively walk the users through all the most common workflows that were involved in producing and modifying their production configuration. That tool is really just helping them author a particular file that contains a Kubernetes style resource that we called Runtime Manifest. We would then use that Kubernetes style resource to produce a single file review artifact that we called the lock file. That file contains all of the real resources that will actually be deployed. And we committed it right in the repo alongside the config artifact. Next, we had a release artifact that gets uploaded at CI time to a standard, highly available location that the deploy tool could pull from. And as it happens, we used Google Cloud Storage buckets for that. So with this overview in mind, let's take a closer look at what each piece looks like. We created the config CLI in service of our goal of making it easy for developers to get started with the new system and ensuring that commonly needed changes would be really simple to make. The config API itself does contribute to these goals, of course, but we decided that wasn't good enough and we wanted to take it further. By embedding a sub-command in the CLI that most developers were using all day already, we were able to give them a familiar interactive workflow that could provide better guidance than any doc. Next, we have the config API itself. It simplifies the user experience by explicitly encoding the main application properties that the developers need to control. These go in the runtime info field. This resource also is the main contributor to the automation principle thanks to its virgin schema. A focus for us in its design was the clear separation of the base best practices that our team was owning from the modifications that the app owners were making. And that's most visible actually in the component section that we can see on this slide. We did this very customized style. It's oriented around the concept of having bases that define resources and then modifying them with patches, which are essentially partial resources that only include the fields that you want to change. We stored all the bases that we wanted to control, as well as some standard patches, in a centralized repo where we implemented strict versioning patterns and immutability controls. This did work well, but in hindsight, I'd say we should have taken the user experience further than this and hidden the default bases and patches behind fields or even APIs of their own instead of exposing their files right in the manifest like we did here. For example, we could have made a field for the webs and jobs types that we were defining in this example, or we could have formalized a notion of component kind in general. In any case, as it is, the CLI tool was aware of the collection of default bases and patches, and it was often the one to write all of this anyway. On the flip side, though, notice at the end that we have an example of local paths. It is important to us that we continued to allow advanced users to leverage the full power of the Kubernetes APIs when they needed it, and this is how we did it. This customization system also allowed us to do a no op migration in the vast majority of cases despite the complexity explosion that we initially exposed. Next comes the review artifact. This one is just a long file full of standard resources, so it's not super interesting to look at, but it's very important. Having this committed alongside the high level config artifact made sure that subtle configuration mistakes, which are less likely but still possible in the new system, are able to be caught during a regular code review. This gives us a very high degree of confidence that what we're going to deploy is exactly what we intended. The review artifact also helps developers understand the changes that we're making to their apps if they're interested. Because the full details end up exposed in the review artifact. For instance, if we're making a big bump to the version of the web component they are using, it could look something like this. And if that's all the app owner cares about, they could only look at that part, move on with their day. But if they're interested, or if something goes wrong and they become interested, then the review artifact always shows exactly what changed in a much more detailed diff. Last but not least, we have the release artifact. This part definitively ensures that all our deploys and rollbacks will be repeatable. It also just looks like a long list of resources, back to back in a single file. So why do we need this in addition to the review artifact? Well, it comes back to that fact that we're committing to the app repositories again. And we still need that revision data in there. And we still can't commit it because it's coming from that repo. And now we also wanted to have a completely final immutable artifact by CI time. So how did we do this exactly? To do this with a high degree of confidence, we used a series of transformations based on explicit metadata. For example, the app developer might give us this image in their manifest. And then when we generate the review artifact from it, the tool is going to go through, and it's going to find all of the containers that have that image, and it's going to add the special annotation to those resources, identifying those containers. And then finally, when we're going to generate the release artifact at CI time, we go ahead and make that transformation to those containers, tagging them, in a surgical and structured way. That tool is not capable of making arbitrary changes. Certainly nothing user directed. It can only do just that. So that's what the new system looks like. This has been working really well for us. But like I said at the beginning, I'm not trying to convince you to do the exact same thing. Even if you stick to a really similar design, there are some really obvious substitutions. Instead of a CLI to help you author Confake, you might have a UI or an editor plugin. Instead of having a Kubernetes style client-side custom resource, you could use something else. It's more off the shelf. Depending on your risk tolerance, you might have an audit system instead of a review artifact. Or perhaps you have something like a review bot that posts its diffs under certain circumstances. Instead of a file in GCS, you can obviously choose another highly available data store that you're comfortable having your releases depend on. Or you might do pure GitOps and commit to a separate repo. And speaking of GitOps, this setup that I'm describing, it's not too far off the one customized recommends. This slide is from a presentation that I gave at KubeCon North America 2021 with Jeff Reagan, one of the original maintainers of Customize. And you can see that describes a two repo setup, one where you have the customization resource, so that's the config API, and another where you have the fully inflated deployable resources. If we translate this to the diagram style that I've been using so far, we can see that the first segments look just about the same. We have that Kubernetes style API called Customization. We have it CLI that has a sub-command that helps you author that object. And then in the recommended setup, you commit that object to a repo. And the first difference is that when you have the tooling generate and commit the inflated or wet configuration, you put it in a different repo. But you're still putting both of these artifacts under full human review. The second difference is that since you are putting that somewhere else, you can have a single artifact for both review and release, because it can have that revision data in it to begin with. So that's one system that works well to fulfill the goals that I set out earlier. And some ideas for variations on it. But why does this work well? What parts of it are essential versus nice to have? To dig into this, let's distill some principles. The first principle is about the user experience of the config format itself. This is a big part of how your users experience your platform. So it's actually really important. The specific tooling that you'll choose here in the format that you use will depend on your organization's preferences, its policies, and its culture. But the universal thing is to talk to your users and find out what needs they're going to need to be able to express. Maybe their intent is close to the Kubernetes APIs themselves, but maybe it's not, or maybe the answer depends on the level of maturity of the project. Inventory the cases that you have, and for each case, make sure that you have a way for the user to easily and declaratively tell you, I want this. Make sure that you can tell that intent that they're expressing, apart from defaults that you're providing, but that they're not really trying to control. They're just accepting them. The second principle is about making sure that you can automate changes to any content that you're distributing widely. We ended up going with a schematized versioned Kubernetes style config artifact. But depending on your use case, you might not need something so bespoke or formalized. No matter what you do, though, make sure that you can find everything that you're distributing. That's in a predictable location that you can read it, that you can parse it, that you can process it, and that you can dump it back to the repo with a level of confidence that's not going to require your team to go and take a look at what you did. This is what makes the choice maintainable at scale, and more importantly, or just as importantly, really, it makes it much easier to swap out the solution for a different one as your platform evolves. The next principle is to make consequential changes visible. Our system arguably goes overboard on this by putting every single change at the deployed API level in front of the developers for review, whether or not they even customized anything. We're striking a balance here, really, between exposing yourself and your users to a diff that might not be meaningful to them, and on the flip side, making sure that when the change does have a mistake in it, somebody is able to see that as soon as possible in the process. The bare minimum, I think, would be to have a system for easily auditing the diff between release artifacts. With something like that in place, you could also consider a sort of middle ground where apps that don't have any customizations can have their release artifacts generated straight from the higher level abstraction, but as soon as they start making customizations, they become required to commit to the full artifact for review. If I had to pick one principle to follow at the expense of the rest, it would be this one. Snapshot exactly what you release and do it as early in the process as you can. If you're doing pure GitOps and committing it, that's great. If you can't, then I recommend doing it as part of CI like we did. This works very well. If you absolutely can't avoid doing it live during the deploy, make sure that you snapshot the result and store it somewhere. Make sure that place is easily accessible and that you have a plan for quickly viewing the diff associated with a given release when something goes wrong. And make sure that your rollbacks will reuse that state so that you don't have the risk associated with regeneration in such a critical piece of the disaster recovery process. The final principle our experience suggests is modularity. Now, this one wasn't actually explicitly part of our original goals. Haven't talked about it yet, but having it really paid off. This allowed us to extend the benefits of the new system to as many applications as possible, including some that weren't originally in scope, by making it possible to opt in at each stage of the process. For each of the pieces of our design, there was some exception among thousands of applications or our own system services that couldn't use it. Some apps have such a non-standard shape that our CLI really isn't much help to them because they're not really using the standard bases and patches. But they could still declare the key properties of their application and runtime manifest, and they could use them with their custom bases and patches. For some advanced apps, the safeguards that we built into runtime manifest were too restrictive, and they couldn't use it at all. So, OK, they did whatever worked for them, and they committed a lock file to the location where we expected it. Our tooling would just pick it up from there. In a few really extreme cases, mostly those system services we owned ourselves, we had a genuine and difficult to avoid need for deploy time differentiation that was really tough to design out. So, we couldn't really even use the lock file. For those ones, though, we could still generate those release artifacts and reuse them for rollbacks. And we added another risk mitigation in some of the cases where we would commit test fixtures using the sample inputs and subject those fixtures to code review to help surface the problems as early as possible. As you can see, the boundaries and conventions that we included in our system helped all of these implementations be possible. Now, if you've been thinking, wow, that's way too much work, doesn't she remember how this is not necessarily the most important thing we have to do when we're starting our platform, I hear you. And if you have to choose, I would say this looks like the minimum viable setup. I would say choose a config API that you're going to be able to manipulate programmatically and replace in the future. And then generate an artifact from that config for each release that you do. Make it immutable, store it somewhere, and drive your releases off of it. If you can go a bit further, generate it as early in the process as you can manage, either CI or commit time. If you can go a bit further than that, invest more in user experience. Make sure that your config API lets your users express the intent that they have. If one of the many existing tools fits your use case, that's great. But if not, the final segment of this talk is for you. We're going to take a look at an example that shows how to use tooling from Customize's KEMO library to build your own client-side Kubernetes-style config API. I'm going to go through this very quickly. But the key pieces that I'm showing here are part of an end-to-end example that I committed to customize repo. And I'll give you a link at the end so that you can dive in deeper yourself. But hold on a minute. Kerem, Function, Framework, what is that? Let's break down that definition. First of all, Kerem stands for Kubernetes Resource Model. There are a lot of conventions around this. But for our purposes, to follow the Kerem, we can say that you need to include the standard metadata that's used for identification purposes. So specifically, you need to have a top-level API version field, a top-level kind field, and a metadata.name field, optionally also a namespace field. So far in the presentation, I've just been calling these Kubernetes-style resources. Now, a Kerem function is a piece of code that modifies a set of Kerem resources. That function might edit the resources. It might add to the set. It might remove from the set. And what exactly it does depends on some desired state that's declared in a Kerem object, so another resource. Building on this, the Kerem Function's specification is a standard for designing Kerem functions that will work together in a config management pipeline like the one customized uses. And then finally, we get to the term we wanted to define, Kerem Function Framework, which is a toolkit for writing functions in Go that follow that specification. So that was still pretty dense, so I'll attempt to synthesize it a bit more. The Functions Framework is a toolkit for writing Go code that generates or modifies local Kubernetes configuration based on some state that is declared in a local Kubernetes-style config API. So it's a toolkit for basically building a resource like the runtime manifest config API that I was showing. The complicated terminology here comes from the fact that this toolkit is helping you do it in such a way that if you want to, you can also use the result as an extension for customized, kept, or other similar tools in addition to being able to use it on its own. So here's an example of an API that you could build with this framework. As you can see, it starts with some Kubernetes-style metadata and it allows an app owner to describe some hypothetical app and workload properties. These are completely arbitrary, by the way. The framework beyond the metadata has no opinion about what fields you define. To implement this, you might start by defining the corresponding start type. The framework can take advantage of open API schemas and this example shows you how you can take advantage of QBuilder annotations to have a schema generated for you, which is pretty neat. You can then implement a schema method on your type that returns a generated schema and then you'll get validation nearly for free. This method will automatically be called by the framework and it will also call validate and default if you define them. Those two methods do what they sound like. They let you implement custom validations and defaulting. In some, all three of these things help you make sure that the data that you got from the user, so an example app in this case, is in good shape before your main business logic gets called. And that main business logic, or in other words, the core of your implementation really, is going to go into a method called filter. This is where you create and manipulate YAML objects until you have a set that reflects the desired state that the config API expressed, so the example app that you were given. The framework has a wide range of tools to help with this task as well, from helpers that make surgical edits to particular fields to things that we call processors, which are higher level. This example demonstrates the template processor, which actually lets you use familiar templating, go templating specifically, in the form of resources and patches that are in go template files as your primary implementation detail. What ties all of this together in the framework is something I like to call a dispatcher, which basically looks up the incoming type that came from the user, so example app in our case, and it finds the corresponding implementation. So all that code we just looked at. The dispatcher pattern allows you to handle multiple abstractions or more over multiple versions of a single abstraction so you can evolve your API over time through one binary. It also gives you a hook for post-processing, which we're using in this example to sort fields in a deterministic order before output, no matter which API we just processed. Once you have all those pieces in place, there are a handful of ways that you can actually use your new config API. The framework makes it easy to build a dedicated binary that you can use and you use it by passing the config file as the first argument. The framework also has some helpers for building a Docker file that can help you create a containerized version. And as I alluded to earlier, this framework was created to help implement the Kerem function specification, so you can also plug this into either kept customized or any compliant tool. So if that quick walkthrough peaked your interest, please have a look through the complete example, which is at the QR code on the left and give the framework a try. The link on the left also contains a few resources relevant to this talk. And whether or not you want to use this toolkit specifically, I sincerely hope you learned something useful from this talk, and I thank you for your time.