 Hi everyone. Welcome to our talk. Thanks for attending. My name is Karim Lakhani, and today we'll be talking about Kubernetes on a budget, how to get paper use right. So as I mentioned, I'm Karim Lakhani. I'm a senior staff software engineer at Intuit, and I'm here with Vasuki Prasad, who's a staff engineer. So our agenda for today will be discussing our contacts, our Intuit API gateway, and our journey into cloud-native technologies. We'll be discussing our cost overruns as we migrated our workloads in pre-production, and then our challenges, learnings, and solutions. And finally, we'll close with our takeaways. So Intuit is a fintech company that operates on a large scale. As you can see, we have over one million active cores, over 28,000 namespaces, over 25,000 services, and over 300 Kubernetes clusters. And we specifically work on the Intuit API gateway. So what is our Intuit API gateway? It's the front door for all the requests that come into Intuit. And a lot of service-to-service communication also goes through the Intuit API gateway. So our service mesh journey started about three years ago, but even before that, we started our microservices journey. And so the Intuit API gateway was used for service-to-service, and it continues to be used for a lot of service-to-service communication. So what does the Intuit API gateway provide? It provides routing, security in the form of authentication and authorization of both the application and the user. It produces metrics that are used to feed our golden signals, which provides availability error rates for all the services behind API gateway, as well as it provides detailed access logging of all the requests and responses flowing through our system. And finally, it provides quality of service features, including rate limiting and traffic dialing. And a few benefits and stats of our gateway. It's highly scalable. At peak, we handle over 30 billion requests per day, over 1 million requests per second. And our infrastructure is about 30-plus clusters and 250-plus namespaces. It's highly available with four nines of availability, highly reliable. It has to be trusted. It has to be up all the time. It has to provide correct metrics. It's low latency. We target 30 milliseconds of overhead at the P99. And it has deep self-service capabilities. It's integrated into our developer portal. And so onboarding, configuration management is all self-service. We're supporting over 2,000 developers using our API gateway at Intuit. So our Intuit API gateway was built in-house. I want to discuss quickly why. So our journey started about 11 years ago into a data center before a lot of the cloud-native technologies even existed. And we wanted to move towards microservices from a monolithic architecture. And so we needed an API gateway, and we decided to build one. Really, the value we saw was building in something that we could customize for our needs over the next decade. And that's what we've done. Our approach is very similar to Netflix Zool 2 and AWS API gateway. It uses non-blocking asynchronous communications using Jetty. It's written in Java. And one of the key features of this partitioning is built in as a first-class feature, meaning that different API gateway workloads serve different services. And that way we're able to keep workloads separated between QuickBooks, TurboTax, Mint, MailChimp, and such. It also has more relaxed quotas compared to the AWS API gateway, such as longer timeouts are allowed, longer bigger payloads. And so we started in the data center. We did a lift and shift into AWS. And then we discussed whether moving to Kubernetes and cloud-native technologies made sense. So why migrate to cloud-native technologies? The first thing was Istio. We, as I mentioned, started Service Mesh journey about three years ago and wanted to integrate with that, improve our security, reliability, and observability by integrating with Istio. And as a project of that, we released Admiral, which is our multi-cluster Service Mesh solution. We wanted to reduce NAID data transfer costs, and we also wanted to improve the client-developer experience by abstracting away whether they're going through a Service Mesh endpoint or going to an API gateway endpoint with Service Mesh, with API gateway coming into the Service Mesh that could be made transparent. The next thing is observability. We had metrics and logging in our old API gateway, but we had some gaps. It was difficult to add new metrics. And we had custom solutions, custom technologies, and it caused overhead in our deployment. So we saw opportunities there to improve using Prometheus and other cloud-native technologies, as well as our structure of our system made cost analytics very challenging. So we saw an opportunity there with cluster namespace partitioning to improve our cost analytics, as well as using some of the platform tools that were built around this. And so finally, we wanted to utilize our Intuit's modern SaaS platform, which is something that over the last few years has been built up to support a lot of new services at Intuit and a lot of existing services. We wanted to take advantage of Argo CD, Argo Rollouts, Argo workflows, Istio Service Mesh, our Intuit Service Mesh, Prometheus, and phase out these legacy tools and deployment scripts that were causing us, like, headaches every week. And at Intuit, we believe deeply in open source and open collaboration, where the recipient of the end user award in 2019 and 2022. And we have a few open source projects, including Argo, NumaFlow, Admiral, and as well where the end user of many cloud-native technologies, of course, Kubernetes, as well as Istio, Envoy, Prometheus. And here's a link for our open source community in LinkedIn. So after migrating to cloud-native technologies, what did our architecture look like? It starts with our applications, our mobile apps, web apps, our users. They're calling Intuit APIs through the Intuit API gateway, which is now running in Kubernetes with Istio enabled. So one cluster might be serving 500 different services. And then we use Admiral to have data for all the cross-cluster communication, all the different clusters that API gateway needs to send requests to. So with that configuration, API gateway is able to forward the request to the appropriate service, which is going to be in a different cluster, maybe in a different region. And then those services are able to talk to each other through the service mesh. And a little bit deeper dive into that. So we have our control plane, which is Istio-based. We have Argo, Admiral, and a certificate management system. And then we have our data plane. The requests come in either through the public load balancer or through the Istio Ingress gateway. That's for our network abstraction. They come into our Istio proxy, which has a MTLS or a TLS termination, goes to our Intuit API gateway container, which does all the routing, traffic management, throttling, authentication, and such. And then through Istio proxy, it goes out to the target service, whichever that is. And again, we have over 2,000 such services that we send requests to. And then we have this MTLS sidecar, which is getting updated certs every hour to do the MTLS. So with this, we improved our security, our observability, and our... So now I want to talk about what we're here for, our cloud-native cost optimization journey. So as we started migrating to correlative technologies and pre-prod, we actually realized that our costs were going to be 2x over what we planned it to be. So that was a moment of an emergency, and we had to figure out what was going on. And so we embarked on an analysis, investigating all the different things that are part of our architecture, where we can optimize the cost, how we can bring it down to what we thought it would be and what it was in our AWS workload. And so as a result, we actually found several areas of improvement, which Vesuki will be going into. And by implementing these changes, we actually reduced our expenditure by 50%, bringing us in line with our expectations. Now I want to talk about how we discovered the cost overruns. It's very important to have the tools and observability to understand the cost, analyze the cost. So it's very important to have month-over-month metrics, week-over-week metrics, and daily cost metrics as well on the cost, as well as if you can have metrics on how different services contribute to the total cost. That's really helpful to understand when costs go up, why is it, and how to address it. Without these, you're basically flying blind. And then we also have further detailed analysis, breaking down at the cluster level, namespace level, environment level, business unit level, and service level. So we fortunately had these tools to be able to understand our cost. And so that helped us a lot to bring the cost down. And then the second part of it is combining the cost metrics with the usage metrics. With our API gateway, we don't control the number of requests coming through the system. They're coming from browsers, web apps. And so the more requests that come, the more it costs. And so just because the cost goes up doesn't mean we're doing a worse job. So it's important to also correlate the cost metrics to the usage metrics to understand that how much does it cost per some number, like some, for us, per million requests. What is that cost? So that's the number that we can optimize. And then to understand how much it costs as well, you want to see how many nodes did you have, how much pods do you have, how many pods are you running per node, how much idle do you have on the nodes, as well as the CPU memory usage of each container on the pod, and establishing those baselines for what's expected. So when it goes up, you're able to address it right away. So these are the results in our pre-prod account, our largest pre-prod workload. So starting off, our baseline cost was around $45,000, which is not shown here because it was in a different account, but it was around $45,000. And as we migrated, the cost started going up. And then we saw that our total cost in that month was 2x over what we expected, so it was not acceptable. And then so we started investigating, applying optimizations, and then the next month we got it down 50%. And then since then, it stayed pretty stable. And again, the cost fluctuates based on the workload that we're getting, but it's been pretty stable since then. So with that, I'm going to turn it over to Vasuki who's going to be discussing more details on what we found and how we fixed it. Thank you, Karim. So while we were in this journey, definitely one of the areas was the cost that we saw that had to be looked in immediately. And some of the areas of opportunities that we found were where are we going wrong and how can we actually account for it? So the biggest was the compute. While I'm saying compute, there are many aspects of compute that how you auto-scale, how is your node utilization, what is your pod density. And we also incurred a new cost, which was the data transfer cost because, which I'll touch upon why that new cost it came to us. While we adopted our latest platform, we also increased on our monitoring cost and also the developer infrastructure cost. That was also additional thing that we had during this whole migration. So moving on, let's first deep dive onto the compute and we'll touch the auto-scaling. So what is auto-scaling? I mean, if there is a demand that is created by your application, the pods will get scheduled, they are impending, the auto-scaler will take that request, it will create that capacity for you, the nodes will get scheduled, and the pods will get the resources that it needs and it will start getting scheduled. So this is a life cycle of the cluster, we'll get that auto-scaling in place and this is a typical scenario for any cluster you can take. But some of the challenges that we had in terms of auto-scaling was, since API Gateway to support that large scale of request, we had to solve for how do we account for rapid scaling or instant surges. The simple HPS that we had configured was never working the way we wanted it to work. Either we were over-scaling or we were too under-scaling, we always ended up having issues and always had FCI's with that. So CPU was too oscillating with simple HPS. We had to run more higher minimums just to accommodate certain surges that we used to get. This itself was too much. Running higher minimums, that means you're adding more cost to your infrastructure. So we wanted to make sure that we solve this in a way that we can optimize this whole usage. Before we touch upon the solutions, I would say for any service, first we should know what metric that we need to be scaling or it could be CPU memory, request, connections, or any custom metric that you want to derive. That should be the key indicator for your service before we start scaling. In our case, CPU was the one that was the bottleneck and we had to look at how do we optimize that. So knowing your target CPU means at what is the range in which you can operate your application in the most optimal way is what you need to look at. So knowing your target CPU is very crucial and how you can autoscale. Aggressive scaling you should avoid it because that has a lot of other cascading effects for your infrastructure which can have impact on your monitoring. It will increase your cardinality. That leads to additional cost. So avoid aggressive scaling. So some of the techniques that we used was step scaling which gave us an option to do gradually ramp ups and ramp downs. And also we established a pattern called overcapacity pattern which helped to accommodate surges. So I'll touch upon what is the step scaling. So if you look at this step scaling, we overwrote the behavior of our HPA to say that I want to do it in the most gradual fashion. So in our case, the most sweet spot was between 40 and 60 percent. And if you look at that particular full block, we are never aggressively scaling. We are trying to maintain our state to make sure that we are under that operating at that target state. And it was a very fine grained controlled approach. We did a lot of iteration to get this number so that we don't have any capacity issues or even we are scaling. We are not overscaling for any within this range. Next topic that I want to touch is what is overcapacity pattern? So just an example that in your cluster, say you have an application namespace and you have set of shared nodes application running at scale. So with the overcapacity pattern, what we do is we create an addition, one more namespace called overcapacity and we deploy a low priority pods in that particular shared nodes, which is the same shared nodes that an application would share. If you look at here, I have only one pod in that shared node. That means you should, the way we need to schedule that pod is it should take the entire resource of that node so that only one pod is running in that shared node. And to set a low priority pod, you can update the priority class for your pod to the least level so that it has the least priority for the scheduler. Say for example, I have an auto scaling event that happened. Now application, there was a surge. I started getting, the HPA started triggering and I got a lot of pending pods on my application. Now the scheduler is looking for a resource. In an idle scenario, if you do not have an overcapacity, that means it will go to the cluster auto scaler, it will request for more nodes and it has to wait. And this wait period is what was always impacting us because we were oscillating that CPU. So with overcapacity, what happens in this case is the scheduler will evade that low priority pod which is already sitting on your shared node and it will make space for the spending pods. All the spending pods gets that instance capacity. That means the amount of time that you took to get to accommodate that surge was instant and we were able to add that capacity on this. So if you look at this, now that application got the capacity that it needs. Now those low priority pods, which were in the pending states, the scheduler will spin up additional shared nodes for it and they will also get scheduled. So that means that pods, that overcapacity that you are running for your service is always available for you in the entire lifecycle of your application so that you are giving that instance capacity for it. So this was our before state, before we implemented step scaling and overcapacity. If you look at the first CPU, where is our HPA matrix, the CPS2 oscillating. The big block that you're seeing was the cause of not getting that instance capacity and that is the reason our HPA requested more and more compute on pods so that it was never actually needed. So if you see the second paragraph where it is desired over the ready pods, the amount of time it took to even get to a state that it had to operate, that point of time we started seeing issues. We are not scaling for the demand. And if you look at the third graph, we were still in the ramp up stage but if you look at our HPA graph, the number of pods started to drop off. So that was a clear indication that we were overscaling too much because we were not able to control our target at the state that we wanted to operate. And after we implemented this change, so if you look at the first graph where we were talking about the HPA, for that initial surge, we never overscaled because that overcapacity was what it was protecting us. Our desired and the ready state was maintaining and the delta was minimal and we never went ahead. We didn't miss the target state and we were staying on the target. And if we look at the whole ramp up of a request, we stayed in the steady state while our demand was met. So what we really saved by doing this? So it was a big saving. It was a 25% on our compute saving cost just improving how we were scaling our services. This 25% in the larger scale when we are multi-tenant, this 25% is a big amount that we saved as a part of just improving how we scaled and accommodated for the surges that we got. So moving on within the same compute area, one of the things we had as underutilized nodes. I'll touch upon this. Cold and idle worker nodes. So say in your cluster, you have an application namespace and you are running all your workload on the shared nodes and it is running operating at scale. Good, this is fine. Now there is a scaled-down activity that happened and the number of nodes that upwards that it had to operate it will start going down. Now this is the state that we stayed in. So if you look at this, the target state, in reality, we really scaled down the pods but the nodes continued to stay. We were not taking off the nodes at all. They were sitting idle and cold and the amount of time for it to get was taking to rebalance was anywhere between 10 to 12 hours or sometimes even given the days. So we were just paying for free, just for underutilized nodes because we were not able to rebalance this. So some of the challenges were, as I mentioned, we were slow scaled-down of nodes leading to inefficient resource utilization and also what was contributing to it because we had a slow cluster autoscaler which was not taking off the nodes quickly. So solution was simple. Don't rely on the default configuration what the cluster autoscaler would give. Typically you need to look at all those configurations that are responsible for your cluster autoscaler, how it will fit in for your service and so that you would be able to make that decision that how hot you want to run on your nodes. So these are all the settings that are available for the cluster autoscaler which are responsible for draining your nodes quickly. It decides when you want to scale down and how long you need to wait to remove the node, how hot you want to run on your nodes. So the third parameter which says that scaled-down utilization threshold, if you keep that higher, that means you are using the maximum node utilization. So you're running hotter and while your pods would stay in the target state that you want to operate. So with these changes that we added, we got better pod-to-node ratio. We were able to cordon the nodes quickly and more efficiently and the rebalance of pods and nodes was quick and efficient. By doing this change, we had a big saving, 15% on the compute. This was an cumulative saving on all the once we fixed autoscaling and also the cluster autoscaler. The default configuration was utilizing, like making the nodes idle. So let's touch upon another thing within the same compute area of pod density. This is a little interesting. Just take a single node in your cluster. In this, I have like that the whole, this is like a node capacity that I can have. I have pod1 which is scheduled. I have a pod2 I was able to schedule and also have a pod3 which I was able to schedule. And there is a certain system level resources that needs to get allocated, which is for demon sets and other system resources. So this top portion of our node, which was 20% of our capacity, which was not utilized by anything, it was just sitting idle. So when we see this 20%, when we have thousands of nodes that you operate, this 20% is huge. You are looking at wasting 300 cores for not utilizing anything or 400 cores not utilizing anything. And the main cause, the challenge was, we were inefficiently been packing our pods so that it was not taking the full capacity of the node. A lot of the times we would use platform capabilities for sidecars which will come automatically. You need to even resize that. You should never go by default. Always resize based on what instance type you operate so that you are able to utilize the full capacity. You will have your own sidecars that you develop. Even that needs to be resized based on what instance type you operate. Solution was simple. Make sure that for a given instance type, fine-tune your request and limit and identify how many pods you can fit into that given node. And that should be the target that you need to be operating. Once you know that you are reaching your desired state, that means you are utilizing your full node capacity. Continuing to the same within the pod density, selecting an instance type plays a very big role here. For in our case, larger node gave us a better throughput, but it increased the blast radius for us in the sense our error appetite was low, so we couldn't operate on a bigger, bigger instance type. So we had to operate on certain instance type so that we are able to... Our FCIs are lower even if you lose an instance even in case of a failure. So depending on the instance type, you need to be always resizing your pods so that you are not wasting any resources within your node. I mean, it's all about your service. In our case, availability is the biggest factor. So we make sure that we size in such a way that we are able to get the desired throughput and operate in that fashion. And again, always look for an opportunities if you can upgrade to a newer instance types, which can give you better performance. So in our case, we were operating at C5N before the analysis. We went to C6N. We got a 15% performance improvement, and it was an improvement in our cost as well. So one example, how we calculated the pod density. So I'm just taking one simple example for this one specific instance type, which is a C6N for extra large, which comes with a 16-course. But in reality, the usable space is around 13,600 cores because the rest of the CPUs for the systems and demon sets. Then we have all the sidecars, app, stereo, Splunk. It can be any sidecar that in your case. So you are saying that I would be operating per pod, I would have around 3,300 cores. And then I have a pod density. So pod density is the calculation that you can do from a usable compute over the compute by pod. So that means this is a number. So that means you should be having this metric on your dashboard to say that on my ecosystem, I'll be operating four nodes, four pods in that given node. If the factor is in somewhere in like 4.5, 3.5, that means you are doing wrong. You need to resize the pod again because you are literally wasting 0.5 pod of the compute. And that is what we always went on computing and made sure that we have this whole number all the time, which we're always targeting. So this was a big saving just by resizing the pods in such a way that we are utilizing the full capacity of the node. We saved almost 10%. So this is all cumulative saving on top of what we were saving with auto scaling. So this diagram is an outcome of all the changes that we did in terms of compute. Say for example, we have a ramp up that we had to accommodate the surges. We were able to scale up the pods quickly because we had that over capacity which was protecting us. We got the nodes scaled up because the demand was higher. We were able to match the demand. And say we got a ramp down at some point. At that point of time, the pod density is going to drop for you. Because we are going to bring down the pods very quickly and nodes will continue to stay. The goal was, and in this case, we have an optimized scale down because we updated our scaling step down scale down. And then even nodes are also getting rebalanced. The graph for node is very crucial that you should see that nodes are dying quickly while your pods are also going down. And the outcome was this quick rebalance. I think the amount of time that it used to take previously was anywhere between like a day or even 12. And it never reached the stage that we wanted it to. And that particular delta was the entire saving that we achieved as a part of our auto scaling and within the compute space. Moving on, data transfer. So as Karim mentioned that during this migration journey, we also onboarded to Istio. And with Istio, with the scale that API Gateway operates, we incurred a new cost. We never knew that this is the new cost that we would be incurring. So you have a control plane. You have an application namespace. You have all the Istio proxies which would make a connection to the Istio D which are sitting on the same control plane. Now imagine like you have thousands of pods that you operate in a given cluster. They would make connection to all these Istio Ds and that connection can go all over the place. There will be cross easy connections and that they need to be communicating to each other because they are passing on the configurations. So the challenge was we encountered the high volume data transfer because of this change and we never had the observability but we established that observability first who is which namespaces are contributing which pods are doing. We did that deep dive analysis. The solution was simple. We had to just annotate our services to stopology our hands. So the service object that sits in front of your Istio D, we need to annotate that so that that will help establish, it establishes the easy awareness and all the connections and the data transfer that happens will be within the same AC. Starting EKS 127, they updated the annotation to the topology mode. So while in this journey, while we had this migration, we also had a new monitoring solution. We ended up with Prometheus. So every cluster that we operate has a different workload and they have this different capacity requirement. Everything is different. So we have Prometheus that we deploy for all our monitoring purposes. So we started seeing an increased monitoring cost, our metrics cardinality increase because our scaling was too high and we were scraping too many irrelevant metrics. The first thing we did, we did a T-shirt sizing based approach. We T-shirt size all our Prometheus in such a way that we could match to its requirement so that we could operate in certain instance types. We were able to control the cardinality based on we are protecting our system with certain metrics that we could get so that we know that if there are any offenders, they are actually dropped at the source itself. We filtered a lot of unused metrics. A lot of the times too many metrics get ingested which are never used. We started filtering them. This is a simple example from one of the service monitors CRD where we can control how you drop the metrics from a monitor and also how you protect the how many scrapes that you can get from a monitor. So this kind of saved a lot of Prometheus capacity that we were just wasting. So we saved, this was a significant saving while we did just small changes. It was a 40% saving on our Prometheus cost because we did the right T-shirt sizing to operate for the gateway. The other thing was the developer infrastructure cost. Now we are in a platform. There are a lot of developers contributing to a platform. They are developing APIs, controllers, operators, add-ons and they have their own independence of creating clusters, testing and certifying it. So there are so many clusters that we're getting created. There was no hygiene process involved with that. We had hundreds of dem-named spaces which were again not hygiene. A lot of resources getting unused. So to solve this, once we started, we created an observability for every cluster that was created which was not managed. So that was the key. We wanted to know that we have the metric for everything so that we know that how many clusters are getting created. We created an internal tool. We never outsourced it. We called it as an hangman. In reality, it actually just pauses. It actually brings down all the worker nodes for that cluster, keeps the control plane up and running. The key here is we didn't want to impact the developer productivity so that we are... And we are able to do in such a way that it is not too much of operational churn for them. And also we established the same pattern as PodSnoozer and we were able to achieve the same thing. This was a big saving for us just in terms of developer infrastructure cost. It was a 50 to 60% saving for making sure that we do it in a way that it doesn't impact developers. It was seamless implementation for all the developers who were contributing to the platform. So, yeah, that was our presentation in terms of how we were able to do and how we were able to save in certain areas. We have many other areas that we saved. We had a lot of saving that we did in WAFs, logging, and we are further optimizing and we're looking at more opportunities. So some of the takeaways from this presentation is how cops to optimize design during the time you are designing your system. That is the key. Second thing is I picked up this code. It says don't throw money at the problem through ideas at the problem. You do get short-term gains when you just, you know, you throw money for your situation but try to go back to the drawing board and see how you can fix it. And then third one is the most key. You need to have an observability. Don't rely on the default metrics that you get from any monitoring solution. You need to derive your own metric and make sure that you always look at it. Looking from that particular metric, that means there is already, already you are starting to see that there is a deviation and you are going to get that cost. Thank you. So this is our QR code for our booth. Please do visit and please share your feedback using this QR code. Thank you. And so we have a minute for any questions. Yeah. So the question was what's our process for dealing with requests, limits, or HPA thresholds? That's a very good question. And it was basically a lot of testing. You know, it's hard to come up with those values right off the bat. We knew sort of what we had with our EC2 but it doesn't map exactly. So we had to run a lot of testing as well as even in production, observing what or how much CPR we're using, four different workloads and adjusting to see, like, you know, and very important to watch the CPU throttling metric to see if you're actually getting throttled because we didn't have that. So thank you for the question. I think with that we're out of time. So thank you everyone. Thank you.