 Hi everyone. Thank you for attending our talk, Eating Your Vegetable, How to Manage 2.5 Million Lines of YAML. My name is Danny Thompson. This is my colleague Jesse Suin, and we're really excited to give this talk. Before we dive into it, just a brief intro. So, like I said earlier, I'm Danny Thompson. I'm a software engineer at Stitch. If you haven't heard of Stitch, Stitch is a platform for simplifying user authentication through passwordless lock ins. But before that, I was working at Intuit on Intuit's platform team. And part of my responsibilities were helping developers deploy their applications onto Kubernetes. And I'll let Jesse now introduce himself. Hi everyone. My name is Jesse Suin. Thanks for joining our talk. I am a principal software engineer at Intuit as part of the Intuit platform team. And primarily, I work on the Argo project, which is a suite of Kubernetes tools and controllers focused on application delivery. Great. Thanks, Jesse. So why don't you dive into it? So let's start with just a single Kubernetes file before we get to 2.5 million. So what we have here is a deployment. And this deployment has a anti-finity role that guarantees that the pods within this deployment won't be scheduled on the same nodes. And so what's really great about this is that there's no code you need to write. All of it is configured through YAML. And we just need to hand this to Kubernetes and the scheduler will figure out that work for us. And so as a developer, this is one concern just taken off my plate. However, that being said, this YAML is also very verbose. It'd be very challenging to write this without Googling the spec. And the other part is, it's pretty long. This YAML isn't even complete. You might notice that we left out parts of the selector because we couldn't fit the entire YAML on this slide. And so while it is challenging that Kubernetes YAML is this verbose, it's almost a bit necessary for Kubernetes to kind of solve all the different use cases it's trying to address. And so as a result, us as developers have to have a challenging problem of managing Kubernetes files. And this is just one Kubernetes file. When you run a cluster, you're going to need more than just one deployment. You're going to install a bunch of add-ons. Just as an example of that, here's a slide of a bunch of really popular open source projects that are installed across a wide variety of clusters. And these add-ons are adding new functionality to Kubernetes that's not already there and helps you run a production-grade cluster. And this list isn't even exhaustive either. And so with each of these add-ons, they generally require modifications to get them to fit into your cluster. And so that is in addition just to the YAML for these add-ons, you need to layer on your own tweaks to it. And so this as a cluster operator makes it challenging to kind of run a production-grade cluster because you have to manage all of these add-ons. And then if you think about how once you get to a certain scale, most companies generally lean towards breaking up their clusters into multiple clusters. And then this problem is compounded because you need to configure each of these different add-ons for each cluster. And with that, the point I'm trying to highlight here is that editing Kubernetes YAML on a microscale of one file is challenging or has its challenges. But also then managing all of your Kubernetes YAMLs across an entire cluster and multiple clusters is also challenging. And so this makes it kind of challenging for organizations to adopt Kubernetes. And we're going to talk about different approaches to kind of help manage this large subset YAML. With that, before we dive into this approach, I think it's important to touch on the different considerations for the different approaches or at least what you need in a configuration management system. Configuration management is very much a problem that's kind of unique to each organization. So each approach that you take doesn't necessarily mean it will work in every organization. And what you need to think about are the things that are kind of unique to your org, what is your culture, what is the responsibility of your teams, and this will help inform what is the appropriate approach to take. So a great place to start would be with the different personas. Generally, we see this as kind of two. You have your operators and your developers. So operators are in charge of running your cluster. They generally take off the shelf software and install it into the cluster. They generally want stable and controlled updates and use semantic versioning to help understand kind of what each change is bringing in. On the other side, we have developers who are focused on building bespoke applications. These applications are generally focused on business logic and solving business needs and don't necessarily have as much of a focus on platform. Compared to operators, they will deploy as needed and don't necessarily need semantic versioning to version their applications. So with each of these personas, they actually have very different needs and you might decide that actually it's better to take two different, use two different approaches with configuration management to solve their needs. And for this talk, we're going to focus more on the developer experience. So diving a little bit deeper on the developer persona, developers generally have a wide array of experience and comfort level with Kubernetes. You might have some power users who have previously run clusters and know the ins and out of Kubernetes, or you might have users who have no interest in Kubernetes and just want a business application. And so you need to be able to kind of cater to both ends of the spectrum. And the way you do that is totally dependent on how you expose your platform to them. For example, if you extract everything away, you solve the zero interest use case, but you might alienate some power users who might have a special use case they're trying to solve. On the other end of the spectrum, if you just give developers raw YAML, the zero interest developers or zero interest in Kubernetes developers will want, will be confused by the YAML and won't necessarily understand, but the highly the power users will run with it. And so you need to kind of take this into consideration what your organization cultures like with your abstraction and kind of to that end. Another thing that's important to point out is kind of control of the configuration. You want to find a balance between centralized control and developer control. Centralized control enables things like standard patterns and best practices, which allow you to have easier maintenance of your cluster and also more security concerns. But developer control allows you to kind of enable your developers to do what they need to kind of deploy their application, whether that's onboarding a new environment or choosing a different deploy strategy. And so there's a balance between these two where you want developers to be able to do what they need without having to go to the control team or the centralized platform team to do to get them to do what they need. Versus you also want the centralized control team or to be able to deploy changes out and not have to go bother hundreds or thousands of developers to kind of go merge in a PR. And so with all of these considerations in mind. That's where you need to kind of evaluate what's important to your organization. And so at this point, I think we're ready to kind of evaluate what are the different approaches and how the pros and cons of each of them. So let's start with an easy one. This is pretty easy. You just give your use your developers just playing. This is usually a starting point for most organizations since it's there's nothing to learn. It's easy and straightforward. And it's totally flexible. But this strategy really doesn't scale because it has no configuration reuse. So it's really easy to make a change in one environment but not have it carry over to another. And so this is generally seen as kind of just a stepping stone. And this next approach we're going to talk to is kind of seeing usually as like a next step to Ryan love like okay if not right animal why don't we use templates. So templating is where you take a list of parameters that you inject into the predefined template, and that output is your configuration. An example of this would be helm, the end and Jason and the image we have here is a home chart that we have listed. And what you'll notice is that it allows you to list kind of different values that you can fill in and another file which the developer would provide. So the advantages here is that it's a simpler configuration and it's very flexible like you might notice we have like an if statement with within this deployment template. But there's some disadvantages here where generally these templates start to grow in complexity as time goes on. And you start to parameterize everything and so maintaining these templates and understanding what's going on becomes really challenging. So that's templating in a nutshell. And the next one I'm going to touch on is kind of a little bit of a different approach and so that's using overlays. You can think of overlays as defining a common base that you share across environments. And then for each environment, you define a file that has the specific changes that you want to make to that environment. So then when you want to deploy, you take your base and apply those environments specific changes on top of that base. An example of this would be customize the image we have here. The image in the background would be a defined base or which is a deployment. And then we have environment specific changes. So for the staging environment, we only want to replicas. And so when you create the staging configuration, the two replicas is going to override the one replica listed there. So the advantages of overlays is that it's very readable. It encourages configuration reuse and it's mostly flexible. That being said, some disadvantages to the overlay approaches that it's not always immediately intuitive to developers, what's kind of going on. It takes a bit for it to click sometimes. And the lack of parameterization makes things that should be easy hard. And so that is overlaying a nutshell. And the next approach I'm going to talk about is kind of a little bit more of a pivot. And that is abstractions. So an abstraction is where you hide the underlying details of the configuration with a simpler interface. And some examples are Pulumi CD8s and also Helm kind of checks this box. So here we have, or at least the image we have here is like an example of an abstraction where the developer would only kind of provide the things that they care about. Okay, I'm running a web service. This is the DNS name I want. This is how I want to upgrade it. And I want it to be on the mesh. And so user only has to list these fields. And then that is translated into the all the different Kubernetes manifests that kind of come along with that. And so you can see, this is much simpler configuration. And so you can see those organizations to implement standards that can be applied across the org, and the way that's done is by implementing those standards in the underlying abstraction. There are some disadvantages where you're giving up flexibility, the developer is giving up flexibility. And usually most of these abstractions tend to kind of start to leak the underlying details. And so kind of as a result, we, we found that there really hasn't been the right abstraction found for Kubernetes yet. And that kind of just abstraction in that shell. The last approach we're going to talk about is codifying your configuration management. And this one's pretty simple. You just use a programming language to generate your configuration. So some examples of that would be CD8, Palomi, Jsonnet. And this image we have here is from CD8. And what it's doing is it's importing a couple different like well known structures and passing in kind of like the different configuration that a user would want to apply. And so the benefits that we get there is you get everything that comes with a programming language like loops, conditionals, functions, unit testing. And it also really gets the same benefits of an abstraction, because it's in a sense it is kind of a bit of an abstraction. That being said, you get all the disadvantages of a code base. So this is another code base you have to manage. And you have to, so like you might have bugs and you have to kind of figure out how to debug those. And it can be a challenge for your developers to figure out how does this chart that we have listed here translate to our final result. So with all of that, those are all of the common approaches we've seen to configuration management for Kubernetes. And so at this point I'm going to hand it off to Jesse and he's going to give an example of how Intuit has done configuration management for all of its developers. Thanks, Danny. So now we'll get into a bit of a case study of Intuit's approach to configuration management. And for this section we'll focus actually on the developer experience that we provided as a platform team. So a little bit of a background. Our use cases that we have 4,000 developers deploying mostly SAS applications. These developers are managing multiple environments and for us, environment equates to a Kubernetes namespace. And these environments are mostly identical and they have site variations in the in their config. So maybe the differences are that they have different DNS names. They use different AWS resources, and they have maybe different I am roles and privileges. Intuit promotes a DevOps culture of you build it you run it. So those developers who are building the application are the same ones who operate it and are responsible for the uptime and meeting ability. And so, with that, we came up with set of requirements. And as a platform team, we wanted to provide a standard set of patterns and best practices, what we refer to as the paved road. At the same time, we also want to empower developers and provide them the flexibility, so that they would be unblocked in the event they need more capabilities that aren't provided by a default, even if this came at the cost of simplicity. So we made the conscious decision to actually expose our developers to Kubernetes YAML and so they can leverage all the power and benefits of Kubernetes. And finally, any approach that we took it needed to be GitOps friendly. GitOps is extremely important to into it because of a lot of compliance and security requirements. And so whatever solution we took had to fit into our strategy. And so the solution we came up with actually after years is actually to use Customize. And with Customize, since you're exposing your users to actually Kubernetes resource and manifest, you're preserving all the full power and capabilities of Kubernetes. Because Customize is Kubernetes native, it's well supported and documented, our users or developers are often able to get help from open source documentation and not always rely on the platform team. Because it's just Kubernetes YAML at the end of the day, readability is very clear, you're just again looking at manifest. And this is actually both important for the developers, but the platform team who is supporting that service. If say there's some problem with the bug in the YAML, it's very clear where the problem is rather than some abstraction possibly hiding that away. And Customize overlay pattern really fits in with our use case of mostly identical environments with with slight variations and so that promoted a lot of configuration reuse and maintainability. And finally, what I think is probably Customize's best feature is its ability to reference a central or remote base of Git repository, and that allows us to do that distribution of standard patterns across our organization. So yeah, the central remote base is how we leverage Customize. Basically, as a platform team we provide what I call a catalog of generic starter YAML. And so in a simple example, the developer will get something like a web service which is comprised of a deployment, a service and an ingress. Later if the developer wants to leverage some of the more advanced capabilities of Kubernetes, they can then include say like a HPA base or a canary analysis base as part of their service. And we cementually version this remote base so that developers, they can upgrade to a newer future at their own pace. And so all of this is to kind of gives us this standard distribution of patterns, best practices and things like, you know, setting pod readiness gates and resource limits and ingress annotations. What the developer gets is they actually get their own Git repository that derives from the central remote base. And the developers own this repository, they're free to make the changes in this repository to suit the needs of their service. So what we have here is, is actually an ingress that we as a platform team came up with that suits our use case. As you can see, we use the LB ingress controller. And if you use LB ingress controller, you'll know that it just has a ton of different annotations to control the underlying load balancer. And most of these details are things that you don't want to expose to your developers. They don't need to know or care about how these things affect the load balancer. But a couple of things they do need to know such as in this case, the external DNS name that they want to use and also the certain ARN for their service. And so what the developer gets is this Git repository, and it's structured like this at the top level, they have the application base directory that is serves as the common definitions that they want to apply to all of their environments. And then they get a list of environment directories, and each of these environment directories correspond to a Kubernetes namespace, and it only includes the changes specific for that environment. If you look at that top level customization, you'll see that here it's referencing that centrally managed remote base that I refer to this one's referencing our 4.0 version. And then if you look at the customization in an environment, you'll see that it then derives from the top level app base of their own Git repository. And so with this approach developer can leverage that same configuration reuse that they get by also inheriting from that central repository. Finally, if you look at say the ingress for a specific environment, you'll see that it has those two specific annotations that the developer cares about they want to reference a specific certificate, as well as a specific DNS name for that environment. Okay, so let's see how did this scale and work out for into it. So if you think about a single environment, at least in our case. For us, this renders out to be about 250 lines of YAML. And if it's if a service has four environments, and we have 2,500 services, then the amount of deployed YAML ends up to be roughly 2.5 million lines of YAML. So if you consider actually how much of that YAML is being managed by humans in Git repositories. It's kind of a little bit differently, but you have a base of 90 lines of YAML and for overlays one per environment and each of those overlays are about, let's say 45 lines that equates to 270 lines of YAML per service. 2,500 services, and you get 675,000 lines of YAML. This is roughly about a little over 25% of the deployed YAML. So, this has been working well for us but it hasn't been without its challenges. And so these are some of the challenges that you'll come across should you take us approach like this. Number one is actually user support. And as I mentioned, we chose to expose our users to Kubernetes YAML. And so by making that decision, we also gave our users a lot of loaded foot guns. You know, people who may not be as familiar with Kubernetes can make lots of mistakes. Similarly, you have advanced users who want to leverage a lot of the advanced Kubernetes features and they end up falling off the paved road because they're doing things unexpected. And it becomes harder and harder to support that class of users. The second challenge that we had was just the automation necessary to support this. You know, 2,500 services equates to 2,500 Git repositories. And so in order to support that, we had to build a lot of automation to do things like automatically send pull requests in events say we want to deprecate like a certain API. And finally, the last challenge is with the tool customize itself. You know, when we you depend so much on this tool, which is responsible for rendering out the YAML that you ended up deploying any change to that tool can impact a lot of your services. So if we were to say upgrade customized and it had it rendered YAML a bit differently, which it has, then we have potentially are breaking all at once a thousand different applications. And another thing you should know about the customizes, it's support for CRDs is a bit limited. It has great support for native Kubernetes kinds. When it comes to CRDs you aren't able to leverage some of the convenient features like strategically merge patching your resources. We actually ended up forking customized in order to run a version of customized that natively understands the CRDs that we we use that into it. And so with that, Danny, I have some final thoughts that we have about, you know, the current state of configuration management, but also like where we think it needs to go in the future. Yeah, so hopefully, as you could tell, that we really think there's no perfect solution with configuration management, no matter what you do, you're going to have kind of certain edge cases or road bumps that are going to be a challenge. And it's kind of the nature of the problem. And even within that, also like with whatever solution you choose it's going to be highly dependent on kind of your organization and how it divvies out different responsibilities and just the general culture. And so what we've kind of taken away is that at a certain scale, just managing YAML files is a lot of work. And so it's hard for developers to fully understand it and it's also hard to kind of maintain millions of lines of YAML. And so with that, what we've been thinking is that there needs to be kind of a better abstraction. We think Kubernetes YAML is super powerful and it's here to stay, but we need a better way of kind of abstracting away. So a lot of the nitty gritty details of YAML away from the users because they don't need to know about it necessarily. And one way it's we could potentially do this is through a UI assisted configuration management tool. And that can allow our developers to kind of easily edit their YAML in a way that kind of gives them more insight into what changes they're actually making, but also kind of enforce those organizational standards that you want to see. So with that, a good example of that would be Spinnaker as a project. One thing they do really well is they make it really easy to create a EC2 instance. So they walk you through all the different fields you need to provide and kind of show you an opinionated way of bringing up an EC2 instance. And it's easy GUI to follow. And so we're hoping we could find something similar with Kubernetes. With that, thank you for attending our talk. Here's a list of resources that we have. And please feel free to reach out. We're excited to see what we can kind of do with the configuration management portion of Kubernetes. Have a great day and enjoy the rest of the conference. Thanks, everyone.