 Welcome to our talk. Today, we're going to be talking about a paradox of choice, how to pick an application definition that works for you. My name is Anusha Raghunathan and with me is Kevan Dhani and we're software engineers working at Intuit. Today, we'll be talking about some of the background of why we're even doing this talk and introduce you to the problems that we faced in our platform infrastructure team, supporting the millions of the customers as well as our thousands of developers that build application on top of our platform. And the meat of the problem, the paradox of choice, we will be talking about that and the solutions we came up with when we were faced with this problem. We'll finish up with results and takeaways. Intuit and our infrastructure at a glance. Intuit is a global financial tech company that provides a vast array of financial products and services built on an AI-driven expert platform. If you've used any of these products, TurboTax, QuickBooks, Mint, Credit Karma, Mailchimp, know that they've been running on our Kubernetes-powered infrastructure. We support about 900 plus developer teams consisting of 6,000 application developers and they deliver more than 2,000 services powered on 245 Kubernetes cluster running on 16,000 plus namespaces. Know that these are just average numbers. When we have our tax peak seasons, these numbers go really high. And one of the core tenets of our platform infrastructure team is to accelerate the developer velocity of our end developers. And we try to reduce friction where and when possible. Let's take a look at an average developer's CICD pipeline. An application developer commits code, which triggers build, which triggers test, typically unit and integration tests, and then gets deployed to a particular environment, whether it's QAL, performance staging or prod, and eventually gets monitored using an array of monitoring tools about the health of the application. Now, note that the application developer also has to work with a deployment repo for their deployment needs. What are the configurations for my applications running on various different environments? That's when they are wearing both the dev and the ops hat. Let's take a closer look at our deployment pipeline. The application developer uses customized based configs and overlay configs for their deployment repos. This gets picked up by Argo CD, which is our primary tool for deploying any of the deployment manifests into our end Kubernetes production. Clusters, and they're all isolated using namespaces. Hmm. So what could be a problem with that? First problem, Kubernetes and cloud complexities are exposed directly to our application developers. An application developer pretty routinely has to worry about how do I set my horizontal scaling? What are my min replicas and what could be an approximate max replica for my HPA? Is 15 seconds too low for a health check interval on my ALB ingress object? What's the right max unavailable for my pod disruption budget? What are good CPU and memory limits for my service? What quota should I set for my application's namespace? Now they are worrying not just about building their node or Java application, but they're worrying so much about their Kubernetes objects and their cloud complexities. Problem two, our application developers are very savvy and smart. Let's say they go over the first hurdle and understand Kubernetes and cloud, but Kubernetes deprecations are exposed to them. The application developer now has to understand that ingress v1 beta one, for example, can get very well deployed on a Kubernetes 121 cluster just fine, but when it comes to Kubernetes 122, it's going to break. Of course we don't let this happen, but at the same time, we have to work closely with our application developers to migrate them from a deprecated set of APIs to a new set of APIs. That works. Again, something that causes friction and reduces developer velocity. Problem three, lack of operational input and application definition. Although we expose cloud and Kubernetes complexity, operational input such as, hey, how do I enable HA in my service using an application definition? How can I enable Active Active DR? What if some of my services need external traffic and I need to specify that? How can I do all that as part of application specification? We don't provide that today. So these were the top three problems that were actually plaguing some of our developer velocity from moving ahead. What we did know was that we wanted a desired target state where application developer specifies the application intent. And that's it. We do some magic behind the scenes and that gets deployed into a Kubernetes cluster with the right cloud resources. So we knew what the problem was and we knew what the target state is and we started looking at the possibilities of how to solve this problem and we ran into a paradox of choice. If you were Neo in the matrix, you were given two pills, a blue pill and a red pill. Easy choice, huh? We were given a million pills. Both CNCF landscape as well as our own homegrown and everything in between, there were too many options to actually go figure out a solution for this problem. This is a snapshot of the CNCF landscape that talks about application definition and image build. So we wanted to take a methodical approach in understanding how we can actually solve the problem with a lot of these existing tools that would also fit into a tool chain and our use cases. But before we did all that, let's take a look at what the app specs should look like. Our main requirements for the app spec were it needed to be application centric. There shouldn't be any leakage of cloud or Kubernetes resources into the application specification. And it had to help deployment as well as the operational needs of the application. And this was a pretty straightforward choice. The two choices we had were the open application specification model, which suited our needs pretty well. Or we could go with the templating style model where you had to provide a bunch of input parameters, but there was also a lot of abstraction leaked into the application spec. So it was easy for us to go with an application OAM style specification. So to give a very high level overview, you would take a simplified application specification like shown here in the black box, which had the intent of the developer saying, hey, this is the image that I want. And here is my sizing needs both horizontal and vertical. And I had a way to override these traits depending on my environment. And we would be able to generate the Kubernetes resources. So that was the criteria. With that, I'm going to hand it over to Kevin to talk about the solutions. Test, test. All right. Morning. Yeah. So thanks, Anusha, for handing off. We'll talk about the solutions that we found kind of matched our application criteria. There were four choices actually we found that met with the OAM style specification or be able to allow us to generate resources and deploy those resources. And conveniently, they were kind of bucketed into two categories. One is client utilities versus the control plane utilities. The way we actually wanted to dive into these and evaluate them was POC. Then we took our teams and we divided our team up into the four teams that we're going to basically specify what they wanted to find out about each of these. When we get into the breakdown, the utilities, we're going to go over them really quickly. If you're not familiar with Helm, it's a package Kubernetes, packages Kubernetes configs as charts and provides deployment application lifecycle. It's also a CNCF graduated project. Customize KRM functions. This is client side functions that operate on Kubernetes resource model. KRM configuration supports generators, transformers, validators and customized plugins. These are both client utilities that can generate resources. And just to get an overview, we have a CD pipeline for client utilities. It's very generic where we take the application spec. That's the intent of the user. We use this client utility to generate Kubernetes resources. Those resources would be synced to GitHub and our normal GitOps pipeline here using our CD. When we went through the control plane, just a high level, the open source, we looked at cross plane. So it's open source control plate framework. It helps orchestrate applications and infrastructure. We also looked at Qvilla. It's an open source application engine based on Kubernetes and OAM. And just generically, this is how the CD pipeline would look for either of these control planes where we take the application definition. That's checked in to get the Argo CD would sync that to a control plane. The control plane would then be responsible for continuous delivery and generating the resources to the end cluster. So just dive right through to the results. There is a trigger warning here if you use any of these tools. These are our take on the usage of these tools for our application needs. And we did a comparison. So one of the, we'll just go through all of these as pros and cons. So using Helm for our application OAM like application spec, we found that it was GitOps compatible. It definitely had an active community. There were definitely a couple cons with this chart of charts. If you're ever familiar with Helm, you can get into an explosion of charts where basically how you want to size your chart or your packaging and you might have some packaging with those. There's also a lot of template logic. Both of the logic in the core business logic you'd have to write would be kind of applied at the template layer. Customized care and functions. Similarly had pros and cons. The pros is GitOps compatible. It was also OAM style compatible, which fits our needs. And it has a logical template methodology. So they don't actually do templates and customize. You can use templates with the care and functions, but in a limited way. The cons are it's an alpha stat status. And it's also has very sparse documentation. When we get to the control planes, when we look at cross plane, we found a lot to like. So there was OAM compatible. It was extensible as a multi cloud. The cons are that we have to run these multiple controllers now. I think this is an important point. There's lackluster argocd as of what we looked at this. There was argocd compatibility. However, the state of the CRDs were not actually reflecting the actual state that we were deploying. And there was a limited features. So what we wanted really was application generation. Tools and more, this is geared toward more infrastructure tooling. So when we looked at Kubevilla as well, lots of pros. OAM compatible, extensible, multi cloud. The cons are very similar multiple controllers. So that's more of an operational burden for us. It's kind of an all or nothing. You get to deploy Kubevilla on OAM their way. And actually one of the things that is sort of sold as a pro, as pro would be a con for us is the Q-templating DSL. So implementing all of our templates as this other language would actually be a learning curve for us. So then after we got back and we basically presented all these results, we didn't actually have a clear winner. We had multiple camps of developers. We found a lot to like in each of the solutions. And all of the developers came back with advocacy for that solution that they PLC'd. So we really needed a way to find a better way of getting results and comparing them between each other. So what we want to do is more of a data based approach, qualitative analysis or quantitative analysis. And we want to look at each of these very, from an engineering data perspective. And so what we did is we had presented all these PLCs to our developer groups and we basically gave them this scorecard. And this scorecard has a bunch of criteria and a bunch of weights. So the weights are kind of important to, specifically our importance, what we find important about each criteria. And we have the four results here. And just from the averages, this is how we ended up. Now that's not going to show that well, but this is where we ended up. And you see on the slide was like, what's happening here in this chart is there's a very clear to kind of be between the client versus the control plane. And clients clearly won, they stood out. Why? Well, you have to look a little bit at each of these criteria that we had and dig in a little bit deeper here. The first thing was the learning curve. As mentioned, there's controllers with multiple controllers for control planes. There's either DSL templating logic, something that would make it hard to do. There was also effort involved. That goes also into the templates. The technical fit was compatibility, whether it's getups compatible or OEM compatible, the upper ability, scalability, those are the client versus control plane. You're going to run one binary versus many controllers. The flexibility was down to the code versus DSL. And all of this came to show off that why did Qvilla and cross-plane show as not going to work for us? So we moved to actually the top two winners here. And we really needed to figure out, so now at least we have two potential solutions here. And if we just take the raw numbers, we would go with Helm. But we really wanted to analyze in a qualitative way why we would need to choose one solution over the other. So again, we assembled the team and we basically said, what's the learning curve for learning Helm and customize? The whole team is familiar with Helm. They're familiar with customize. There is no learning curve there. How much effort is there? This is really what's kind of the key difference here. For us, we already are adopted customize. Our entire control plane and CICD pipeline is all based off of customize today. Re-doing all of the templates in Helm, that would be a very large effort. Some other efforts would be kind of adopting Helm as packages and, you know, proselytizing that across the company. The OAM compatibility, when you get to Helm, for our needs, the style of the application definition that we have is more OAM-like. We want to be able to just specify a CRD like definition and have that processed. And doing that in Helm requires a lot of helper functions. There is no difference between the scalability here. They're both client-side tools. They're both flexible. They do have code, demos and templates. I think there is a key difference between customize and Helm when we have community and documentation. Customized CRM is pretty new and still lacking in community and documentation. But if you recall from the previous slides that we had, more weight towards the top of the charts here. The turn to market, the learning curve, how much effort there is. This is how we ended up choosing CRM. Customized CRM. So why did we choose it? Because it supports transformers, generators, validators, as plugins. Supports our declarative spec that we need. It's also get-off-script-paddle. That's very important for us. We want one source of truth. We don't want to have intermediate YAMLs and a bunch of other intermediate pipeline steps that determine where we are in our setup. So with that, I'll head it back to Anusha with the demo. Thanks, Kevin. We're going to see a quick demo. A short prayer to the demo gods. All right. So this demo is on our customized CRM plugin. And we're going to have a sample app YAML. Note that this is our only interface to our developer. The developer specifies the name of the image and also adds in some hints about the traits. In this case, we're going to focus on the vertical size of this service. And note that the base actually has a vertical size of small. And the override for the QA environment has a vertical size of medium. So once we get that information, we're going to actually use customize build. That's efficient. So what we just did there was build, use customize build and specify that we want to use our OAM customized plugin to build that demo app that we just saw. So what it spits out is actually a whole bunch of Kubernetes native manifests from that simple app YAML file using generators and transformers. So you see that there are several services that we have behind the scenes. We have analysis templates from Argo Proj. We have analysis, we have, sorry, a roll out from Argo Proj, which is actually a deployment replacement in our production environments. And finally, we have an HPA object that gets vented out by default to all our applications. Now let's actually copy this over to our deployment repo. Typically this is done as part of the deployment pipeline, but since we don't have a full-fledged pipeline, I'm just going to do it manually by copying it over to the deployment repo. Let's go to our deployment repo here. There's no diff because it's just the same. Let's go to our Argo CD instance that's running locally on my cluster here. So this Argo CD app is pointing to the URL that we just saw. That was just a clone of this repo and pointing to the manifest directory. Now let's actually make it interesting and say that the application developer decided to change the size of the application vertical size in QAL. So let's go from a medium to a large. Use our KRM plugin to generate a new manifest. Notice that there is a diff in the size. It's increased to a large. Now let's come at this and push it. Now let's go to Argo CD. Notice that we can see that Argo CD detects that it is out of sync. You notice that the diff shows that now we've moved to a vertical size that corresponds to a large. Let's go ahead and sync it and sync is in progress. So the essence of this demo is that the application developer only has to work with the app.yaml file that we just saw and make changes as far as vertical or horizontal traits, which is one example we have a lot more traits that can be added. And the rest of it is taken care of by the deployment pipeline. The complexities of the cloud and Kubernetes are all completely abstracted from the application developer and the deployment is going fine. That concludes the demo. So what are the takeaways? Well, just like us, you might be faced with a paradox of choice when it comes to cloud native solutions, whether it's your own homegrown solutions or something in open source or something in the CNCF landscape. Don't get overwhelmed using a methodical data driven approach can actually lead you to pick a solution that works for you. And abstracting application developers from complexities of Kubernetes is doable and we use a client side solution because we are a huge GitOps based shop at introvert and we wanted to pick a GitOps based solution that was friendly for us and worked with our existing tool chain. So KRM plugins are an efficient way to do this. In fact, we're stay tuned. We're going to write a blog post about more details on the KRM plugin that we implemented and what we demoed today. If you have a non-GitOps based solution, then you could go in for a control plane solution as well like Kubella or Cosplane. And the velocity and innovation of platform teams also increases with application abstraction. When we started the talk, we talked a lot about velocity of the developers, but using an application abstraction also helps the platform teams revise much faster because then they don't have to worry about writing technical service bulletins to make sure that the developer teams are migrating to a new solution. We can roll out a new service mesh or a new CRI or CSI without having to expose the developers directly to it. So speed to benefit is what matters to us. Know what matters to your organization and work on finding the right solution for you. Thank you. All right, I think we have four minutes left, so there's time for questions. If there's any questions, this is a really large room. I'm going to run. Oh, shit. Hello. Yes. Did you try making your own operator as an alternative? That's what we ended up doing when we had the same problem because so you just make an operator which spawns all the other resources. Do we try operator? Is that what your question was? If we tried an operator approach? Yeah, we evaluated stuff we did to show you. We ended up with, you know, the cross-plane, the control, Qvilla were more mature than just our own controller. We already run our controllers as well. So that made us a little bit part of the story you don't know is we are very mature in our pipeline right now. Actually, we kind of be too already in our pipeline and everything's in place. We have controllers running for different reasons for infrastructure or whatever. Adding another controller is just a burden, an operational burden for us. That's why. Hi. Maybe I misunderstood when you were talking about polling your organization earlier. Was that your app developers or your platform team that you did the polling? Platform engineers. That was the platform engineers that you polled about their selection. What kind of specification would it have been? It's the same from the developer's perspective. It was the same regardless, right? Correct. Hello. Quick question on alternatives. Like on the Helm side, did you look at Helm file? And on the other side, did you look at KPT? I think the alternatives were the Helm file or KPT. We are aware of those for sure. I think some of those didn't raise all the way up to our evaluation stage. Like we did look at them and just have a dock. Like, okay, is this a useful thing for... I think what we really came down to is we kind of narrowed down our site into OAM-like specification and we want that specification. When you do that, other things, you're kind of adopting their approach. So do you expect the integration between Argo and Crossplane to improve? Are you aware of any efforts there? The answer is yes. I think there's a talk great for Crossplane and Argo today. I think one of the being, yeah, the lackluster part of it was we were deploying infrastructure stuff and app stuff as part of the POC and we just saw that all it says was complete, right? When it's actually in an error state or it's not in a fully upstate. So hopefully those approve and we're not really not running Crossplane for something, but I think the main point was it was more than we wanted. It was more focused on infrastructure. You actually might want to run Crossplane with Kubella or something like that. We have time for one last question. Hi. So my question is did you have a situation where the OAM kind of didn't fit, you know, a use case that you needed and you needed to expand out of that and if you did have that, how did you approach that? Kind of customizing that in your own way. So we have an OAM-style specification that we have implemented. So the plugin, the customized plugin actually is extensible enough to be able to add our own traits like for example the sizing traits that we have environment related traits and so on. So it's very loosely defined. So it works for our needs right now and the implementation is all in the plugin. All right. Thank you for joining the session and I'll now share with Kevin one last round of applause.