 Hello everyone, welcome to our talk We are here to share our experiences on how we reduced the DevOps toile on eliminating the patching and Upgrading toile with cluster autoscaler So I'm shake Israel and I go by Israel I've been with Oracle for the last three years and I'm I'm just starting out in the Kubernetes world and I've and I'm actually really loving it So I'm a Kubernetes enthusiast and I actually am very passionate about solving distributed system problems I am really enjoying this. This is my first conference by the way And I'm very excited to be here to be on so many talks and meet so many people here So thank you cube con for arranging this Hey folks, my name is John Moore. Most people call me JMO. I am an SRE by trade Probably got somewhere close to 20 years now in the industry started off in networking and Stumble my way through programming Eventually got to Basically data stores. I don't know what it is about them, but I fell in love with them Loved MongoDB. I don't know what it was about it, but document databases really turn the turn the page for me Before I knew it. I started playing around with Kubernetes and honestly, I'm never gonna look back. So First of all a little bit about us and kind of where why we're here Israel and I we work in OCI. That's the Oracle cloud infrastructure We operate the service known as OSS, which is the Oracle streaming service. So much like kinesis in AWS We offer a fully scalable streaming environment that customers can use for all of their data in motion in real-time use cases Behind the scenes. It's incredible incredibly run on top of Kubernetes with a lot of staple sets We deal with customers data and we take that very very seriously So whenever we're talking about scale, we're usually talking about maintaining uptime and availability of these back-end systems But we're also talking about the sheer number of regions that we offer our service in So a little terminology Cluster auto scaler is typically called the CA I call it the cast because sometimes when I hear people say I'm gonna go rotate a CA I just have like a minor cardiac infarction. I'm sure some of you folks know what I'm taught this guy's shaking his head He knows what's going on. Yeah scares you, right? So I just called cast and makes explaining stuff to people so much easier. So if I say it and I sound wrong I'm just weird Okay, if I drop that that's our version of our Kubernetes environment Same thing is like a k s e k s g k e that kind of stuff and then node pools very similar to like AWS ASG's it's a way for us to templatize the different types of worker nodes that we're gonna have in our cluster And give us a way to kind of facilitate that scale that up and down. That's our that's our interface to doing so So today we are going to talk about a few things So first we are gonna dive into what our patching requirements are then we are going to talk a little bit about a tool that we developed to automate our patching and as We grew we grew in a lot of so the scale that Jim was talking about we grew very much Into so we expanded in a lot of regions. So that is a scale. We are talking about so we had to be present in a lot of regions So with it came a new set of problems Which which brought in a new set of requirements, then we moved on to figure out, okay How do we fix this automation? How do we manage to automate all of our security patching at this scale? So we are gonna then talk a little bit about the impact we had with this new solution and Then we are gonna deep dive into how we implemented it. So Let's talk a little bit about the requirements we had So every month we have a deadline to patch all of our machines with the latest security monthly image So this image comes with the latest security fixes that we have to run all of our machines in to meet the compliance requirements So all these worker nodes that we have in our clusters they have an in-house tool that automatically patches the machines and We wanted to move into a world that would be That would automate all of our patching for us. So We wanted it to be simple so that it could be operated in disconnected regions and we wanted it to be hands-off So I'm gonna talk a little bit about the solution that we first had One fight This is a motor comment. So What we did was we actually created tool that got deployed outside of the Kubernetes cluster What this tool did was it would collect a list of all the nodes that we had in our clusters And then it would go and sequentially patch all of them Now this was deployed as a privileged part on the nodes and it would run the in-house tool that we already have So this this tool would also follow the Kubernetes semantics of How you maintain a node? Doing maintenance in a node. So it would first go drain the node And run the update this update would automatically reboot the machine So when the machine comes back up it Uncoordined set and all the parts that were pending would go back to the node With this tool though what we had was it would take 25 minutes to patch one node Which during which time the parts running on the power on the node would be unavailable So as I was talking about the scale OCI grew very fast and got Like spread very fast in many regions. So We did scale but our deadlines did not scale right like we had to go Do our patching within the fixed deadline. So Deadlines were actually too close to comfort so at times the other problems that we saw with this tool was Sometimes the nodes would not come back up so it would just be stuck and be waiting for the node to come back on healthy because it would not be able to Uncoordinate so it would wait on the step node and we would have to go and check Oh, what's the status? We would go and look into the graphs and see. Oh, this thing is stuck because the node did not come up so With the scale that we grew in we often started to get a lot more compute maintenance notification So we would go get a notification from compute saying that hey This node is this node needs to be replaced or it has to be rebooted So we run into we ran into a lot of these kind of cases where we had to go periodically time to time update our nodes like that So one other problem was that with the compute maintenance that we were doing if we rotated a node out We the node that came out the came up would be on the old version and then we had to go and patch it again Time to time. We also had to deal with a lot of Kubernetes upgrades And with this scale it became a challenge for us So my friend James Satterfield and I we've been SREs for quite a while We tend to sit down every once in a while and just kind of have one of those reflective moments, right? We ask a lot of why's when we see something. We're like, why do we do this? Why do we put up with this? It's like 2023. We should do like something more modern and At the same time We also realized that trying to induce introduce big changes can be scary But we don't have to reinvent the wheel You know, I might want to put new tires on my car But I don't want to have to go get like an oval instead of a circle, right? I want to be able to use something that other people do I want to be able to go and do things that If we go and have somebody join our team that it's not just some internal tool That's kind of hard to reason through and more importantly James and I had actually done something like this in the past But not really for security patching, but when we really saw it, you know kind of thought about it and sat back We're like wow wouldn't this work here, too? And that's where the cast comes in one of the other things that we try to do here is I Think most folks have kind of adopted like a replace versus repair It's kind of tough to troubleshoot some issues sometimes You really don't want to have to keep maintaining these machines over and over and over again We actually had machines with an age of like 768 days. They'd gone through two cooblet cert rotations At that point in time, it's kind of like okay. We should probably let this thing go off the greener pastures The other big thing that we wanted to try to accomplish was we're a big fan of surge We really want to try to favor bringing up something new before we touch something that's existing and dealing with customer traffic It's trying to minimize the impact not only to ourselves and our alarms and our alerting but also to our customers Kubernetes upgrades I'm sure folks have gone through many of them at this point I don't know what the oldest Kubernetes runtime that somebody's run over here But I'm in the single digits ones, you know like 1.2 1.3 something there. Anybody remember petsets Yeah, this guy. Okay minions Remember anyways been doing Kubernetes for a while and upgrades are Relatively easy if you can keep them going fast enough and that's one of the issues we currently had, you know That two version bump was a bit of a problem So these no maintenance is that Israel talked about they were coming up all the time And we really should be able to handle this if a compute notification comes in Why doesn't our system just like get rid of the note? Additionally, I think everybody has been trying to focus on cost and we knew that we were over provisioned in several places several situations we had Several machines that kind of had like a little bit of room left And we really could use Kubernetes to pin pack a little bit better We just needed our infrastructure to not be so rigid and most importantly We needed this to remain simple enough because in those disconnected regions where we physically cannot see or operate We want to make sure that we are good providers to our customers there who help us maintain and operate our systems in those disconnected regions so My buddy James and I decided to write up a little doc and propose something we call that project ectasis bonus points if anybody knows what that means But basically this was the next stage of the fight round to fight Now we can leverage the cluster autoscaler for all that it's capable of you know the standard It goes ahead and brings up a machine if you don't have enough places to run a pod If you don't have enough resources defined on a on the existing infrastructure that's there We're huge fans of PDBs We think they're a great feature and we emphasize that when engineers want to add new services to our fleet Please think about pod disruption budgets. We are going to have to shoot one of these services in the head But let's maintain availability and consistency with what we're doing These replacements of the fleet It might be an OS image. What if it's something else? What if I wanted to change the CPU architecture? What if I wanted to change the network that it's running on like a CNI type? What if I wanted to change basically anything? Metadata about a node. We should be able to turn that into some sort of trigger Kubernetes upgrades that's definitely something we'd have to deal with as well and most importantly We needed this thing to parallelize because this sequential nature when we really get up to par in some of our bigger regions It was going to kill us and at most of all I do when I say this I always try to joke it I want to talk about this with the internal teams. I want to just do like drinks on a beach Man, that's the kind of stuff that I want to run. I want to be able to just sit back and watch my systems run themselves So when we did this when we implemented we were kind of shocked at actually how fast it went We're using the Custer auto scale the way we did we went from five days in our biggest regions with many hundreds of nodes We went down to a matter of hours in some regions. It actually finishes before you could go make a cup of coffee So the pod pending time is a huge thing for us. We no longer had a touch in existing service We got to basically try it before you buy it with all of our new infrastructure This also opened up room for us to basically go back and say hey We have an alarm that says if a machine is going to take a while to come up Maybe we could tighten that maybe we can actually get a little bit closer to that kind of real-time Infrastructure workload that we were looking for Okay, it's I mean we moved through I don't know one dot 18 all the way to one dot 25 and Probably what less than a month. That was pretty crazy So and the other thing is the adaptive infrastructure by using the cluster auto scaler We actually gained the ability to scale out as our services needed to scale themselves up Which means we could start doing something like HPA KEDA all the fun stuff The other thing was the migration when we moved to this we introduced a completely different infrastructure layout But none of our services internally had to deal with it. We just changed our node selector Nobody even knew that we moved to a completely different runtime So So the next big question was how did we achieve this so We used so we were in a place where we wanted to go to a direction where we adopted a lot of cloud native Practices so as J. Mo mentioned earlier. We were thinking about implementing cluster auto scaler and we we also actually use the concept of taints and tolerations and We introduced a new system a new way to version our hardware by using rotation hash and We're gonna talk a little bit about the infrastructure layout that we had and how the rotation hash is integrated into our infrastructure We're talk. We're gonna talk about a little bit about the rotator parts So these parts are basically deployments that we inject the rotation hash with Then and we'll touch a little bit about how we use pre-stop hooks to make our service more resilient Since we are operating a lot of stateful sets So what is the cluster auto scaler there's three basic things that it does scale up scale in scale down It's as simple as that So Israel put together a cool animation here. Well, I'll talk about it So when we scale up, it's typically I make a pod I asked for it to be created and if I don't have the resources to schedule it the coop schedulers like man I don't know what to do. Well the cluster auto scaler sitting there going like hmm. I want every pod to have a home So let me go and find the right place in my infrastructure to scale this up Just curious show hands who here's run the cluster auto scaler Awesome, it's pretty simple But there's beauty in that simplicity. Do you really have to overthink every problem like the folks who were behind this? They put a lot of time and energy into this So this scale up mechanism that we use just going to run through it real quick So you can see it and it's going to go ahead and provision a new node whenever a pod can't be scheduled Then it's going to get that It's going to get that pod ready and then once it's ready your pod is now running on it simple as that So the scale down this brings a little bit more fun. So first of all When the cluster of a scaler wants to scale something down typically you're going to deal with something called the min utilization percentage Which is going to say how much utilization of this node do I make it eligible for me to get rid of it, right? I think the default is what 50% I think that's right What if you made it a hundred? What if every node in the cluster could be deleted at any point in time? Well, that's the question that we asked and so we try to do it that way So when the cluster auto scaler wants to scale down a node It's going to go ahead and look and say hey hypothetically if I were to get rid of this node Can all the pods here go somewhere else? And if they can it'll go ahead and start scheduling them somewhere else utilizing evictions boom checkbox number one got my PDBs in action So after the cluster auto scaler goes through this routine That node is going to end up being empty. That's the last one once you have an empty node You really don't need to keep this thing around anymore Pretty slow slide. We got to work on some animations. So cluster auto scaler configurations There are way more that I want to go over here But the big ones that stand out to me and I want to call it out scan interval as your clusters start to increase and And the complexity of your scheduling goes up You really need to start looking at that scale interval so far We're at 10 in most of our regions But I've had to start to long and even monitor it to see if we need to go up to 20 One of the big key take takeaways there is keep track of how long it takes your cluster auto scaler to run through your fleet And ask those what if questions the bigger your clusters get You need to pay attention to this feature secondly new pod scale up How many folks have like done a deployment and maybe one of your pods has a slower than normal shutdown time Right something where it just doesn't go away as fast as you wish it did Well, you have another pod that's gonna come and fill on that note as soon as that guy goes away Maybe do to an anti-finity or some other resource constraint, right? Well, I don't have to custer autoscaler just churn in nodes just burning our infrastructure adding or moving adding or moving, right? I really recommend this new pod scale up delay now We use a flag at first when we did this But they do support annotations and I highly recommend that as well So using an annotation different pods have different characteristics for whether they trigger a scale up or not The balancing several node groups is a concept that we used heavily here as well because of those staple sets We were able to make sure that when we moved over to have say three node pools that looked all the same We wanted to make sure that they each scaled up individually So we had good striping across availability zones fault domains. That's concept inside OCI for a single AD regions So we wanted to definitely set that to true But as Israel mentioned, we actually ended up using tolerations and taints to do a lot of this cool work So you need to tell a cluster autoscaler this taint. You don't need to worry about this one This one should not be factored in when it comes up to scaling your nodes Balancing ignore labels this one is interesting because whenever you're getting an infrastructure that has labels already applied for you They may not be part of the labels that are already defaulted and or you know Acceptable to the cluster autoscaler when it treats a group as balanceable or something that is of the same group or set So make sure you take a look at those labels in order to set those up Scale down utilization one right here Every node is always is always eligible to be scaled down And the reason we want that is because the simple the simple rotation that we're going to do Needs to be able to run on top of nodes that are perfectly been packed by design The next thing is the scale down already needed or scale down a needed time that one We turn this down because we actually got to the point where we trusted this thing so much We kind of want it to run on like like warp speed So we started tuning all of these things down So the next thing we are going to talk about is the tints and tolerations. I like pod affinity It attracts parts to the nodes. So tints is the opposite. It basically ripples the parts from the nodes So this demonstration here is a little bit that says a little bit about how tolerations and tints work So if you can see the green part is only going to be scheduled on to the node Which has a which for which the taint it tolerates right like the green one so We we came up with a new concept. We called it at called it as a rotation hash so this rotation hash we basically thought of it as Why don't we try to version of our infrastructure like we want to move from one per one version of infrastructure to another version of infrastructure so every month for example, we had to go through the OS patching cycle and Since the image would be different in every month. So that would be a component that Told us that hey, you need to change your upgrade your infrastructure So we came up with a couple of things a couple of parts that we could use as Versioning our infrastructure. So for example the OS image the Kubernetes version For example, if you wanted to move from one Kubernetes version to the other one This also required us to like rotate out the whole fleet. So we added it as a part of the rotation hash Similarly moving on from one architecture to another architecture Similarly from time to time cloud and it and all that So this is this is a simple example on how we implemented the rotation hash as part of our monthly patching cycle So every month since we have different OS image. You can see in line 28 Line 20 the image is October OS image and it calculates a rotation hash As you can see from the rotation hash parts, we just calculate a shout of it and we get a rotation hash out of it and Similarly, we do this. Let's try to evaluate and compare it with the month of November so in November the OS image changes and I would quickly show you the difference between these two Months so on the top you can see the rotation hash for October is different to what we have in November but the only field that changed was the OS image part and This basically helped us version our infrastructure and we applied it to our security patching So we're talking a little bit about pre-stop hooks Running a staple set is kind of hard at sometimes like you you have to deal with the Kubernetes knowing about the cluster state Now we hadn't gotten to the operator set the section of our of our workloads due to some unique constraints But what we could do is we could utilize a lot of the Kubernetes tooling that's already built in place such as Pre-stop hooks. So what if we could run a script that actually would inform Kubernetes that we can't quite stop yet? Like we're in a state where everything to you looks cool But inside I need a little cleanup or I need to make sure some data is replicated first that three way replicated data store is one of the Things that we use and we notice that from time to time during certain failure scenarios that we might not want to have More than one node leave the cluster at a time So these pre-stop hooks became mission critical to being able to maintain our uptime The second thing you got to take into account though is termination grace period You got to tell kubernetes that it's going to take a little bit for this pod to shut down and then what's unreasonable So in some of our cases we actually started to put this thing up towards Six seven maybe eight hours because we had alarming that said hey If you have a pod that's stuck in terminating and you're in the middle of your pre-stop hook for a while Which is constantly emitting metrics. I should probably get a lot get alarmed and page in somebody to take a look at What's going on something is atypical and I'd like to react appropriately so Basically, we're going to try to put this all together as As Israel mentioned we go ahead and we go calculate this rotation hash now this rotator pod that he mentioned to That's just the application side of it But the infrastructure side is really key and it's it's governed by terraform and that was the other aspect of this How can we do this without introducing new tools and what's available to us like again? There's beauty in the simplicity that sha one Sha one aside is just a simple hash function that will able to tell us that things are changing So we actually pass that to our nodes so that when they come up they'll have an extra arg that says hey start up with this team Which is going to repel? certain pods So there's this rotator pod deployment. Well, hold on a second How does our pod know about the infrastructure version that it should target? Again, we utilize terraform We have to use terraform as part of our deployment mechanism Well, why not just use a data source? So each one of our node pools that we have basically is also going to get a sister or companion rotator pod at the time of Deploy it says what is the current infrastructure version that this node pool will spin up if it were to add a new node? Add that as part of the deployment and make sure that when it goes out It causes that node or that excuse me that pod to not go on any existing nodes It forces the cluster autoscaler to get involved so One of the other things here is that max surge So I can sit here. We don't really care about the rotation pods But we do get that max surge capability that we're looking for I get to try it before I buy it and we've all been there I'm sure people have tried to make changes to their infrastructure and then Everything works in dev and then you try to bring it up and you're just node doesn't join in time Your node comes up with something unique. We have all been there I'd really like to maintain not touching my existing software and this worked beautifully One big aspect though is this progress deadline So for those are not familiar deployment will basically time out the deployment controller They're like hands is taking too long. I'm gonna give up We have to bump this up a little while and that's okay for us because this particular rotator pod is the only one That we put this on everyone else we alarm on if we can't get to that progress deadline in the in the correct amount of time now Helm is one of the ways that we deploy our apps So just a simple again same tools we already used today Just a template variable gets thrown as a toleration and these rotator pods look for the exact match of this particular hash That's what couples our infrastructure to this rotation again super simple What's also cool is all of our existing fleet? Is already ignoring that same rotation hash in terms of an application perspective? So we don't repel any of our other pods just the rotators so Israel's gonna run through a quick little demonstration of this this deployment here But imagine that this is also at the same time also flowing through the fleet and upgrading our infrastructure We're using that max surge and we're gaining the ability to have new infrastructure come up. That's vetted before we move forward So to bring it all together Let's say we have two nodes here right and we see that for a particular node pool all the nodes are tainted in yellow Now we are trying to do a rotator pod deployment. So currently the rotator pod deployment is in v1 So as the as we move on to a new version of our infrastructure, let's say it's v2 So our so we got we get a new part that is pending now Which cannot go on to any of the nodes because of the toleration it has Because it has a new toleration that is pink So now the cluster auto scaler sees that oh this part is pending Let's find it a home and it brings up a new node and now you see a new node that comes up with a rotation hash Pink now the pod can actually go and settle in there So once the pod get settled in the deployment progress is further up So it removes one of the older version of the rotator pod So Now as J mo said earlier The cluster auto scaler comes into action and it starts its second operation the scaled-down operation So the cluster auto scaler sees that there are a couple of nodes and let's try to figure out if these parts can live somewhere else so it evaluates it and sees that these parts can be moved to another node and Since these parts can be moved to another node. I do not need this node anymore So that node goes away and it is scaled on by the cluster auto scaler So the cluster auto scaler does this for all of our nodes that have that are in the system So it evaluates the previous node this node and it performs this operation in all the nodes So I want to add a little bit more on that in closing Most folks probably run the cluster auto scaler on a dedicated fleet of nodes Outside of the cluster or at least outside the purview of its own ownership We tested this heavily and as one of the core requirements that we had we wanted to make sure that not just the nodes that cluster auto scaler is managing Could be replaced. What about the cluster auto scaler itself? running those as a Deployment with multiple replicas was was key for us reaching that goal at this point in time There isn't a single node in any of our fleets that isn't governed by the cluster auto scaler or Capable being killed at any point in time So this is our talk and I wanted to thank you all for coming out Really appreciate it and really appreciate the support. It is also my first coupon as well. So thank you so much for your time I really hope we get some questions, too Thank you Thanks for the great presentation So a question that I have is about the testing process that you had for days. I I assume it's gonna be challenging to do that and I would love to know a little bit about that and also Another question that I had was about you mentioned about okay What would be different in your deployment process if you wanted to do everything in native? Kubernetes Okay So that's a great question. So to that I heard there first of all the testing process Because we didn't actually introduce any new Componentry to the system. It wasn't that we needed to go through extensive validation of existing or new tools all of our existing tools Basically, we could use the same workflows, but what we did start doing is we made sure That this thing runs all the time in several of our testing environments We're constantly running just a little pod think of it like a cron job. That's going through and changing these tolerations So one one aspect of the the slides you might have missed is we use something called no execute on that toleration Which means that when you have a mismatch we kick off that rotator pod So that means that as soon as it gets touched and for say for example, we just touch it and say replace me We start the process. So we're constantly running this thing through its paces looking for any regressions that introduce into the system We have regions that are being built tens hundreds if not faster times a day So secondly, what makes this unique about OCI or is it is let me rephrase that question Is there a way to use this in any other cloud and the answer is yes So because we're using just simple terraform which most people are using today in order to manage their infrastructure You have access to the same hashing functions You have access to that same concept if you really think about it There's probably a way that you look at your infrastructure today and say oh, there's this class of nodes for these And there's this class of nodes for these think about what those rotation parts mean to you They're unique to us and we're constantly adding the other day I think we added what three or four new things for node labels and all sorts of stuff like it's actually pretty simple And that's where the beauty of this is and that's why we wanted to share it You should be you should be able to look at your infrastructure and really break it down into those small Component parts that may change from time to time. Yeah, I'd like to add like we got a lot of side Additional benefits with this so when we had the system in place It was easier for us to add the rotation hash parts like we talked about like if you wanted to move our fleet from one Architecture machine to another architecture machine We could do it with one deployment like and it would be done like all of your fleet would be just migrated from one infrastructure to another infrastructure and It would do it like the Max search and like JMO said very importantly for stateful sites You would get a machine first and then you're it will not touch your stateful site before you get a new machine That's that's very important. Yeah Great question. Yeah Any other question? Thank you. Yeah, I'm curious about Your work organization there. It sounds like you guys are the you know innovators the creative thinkers, right? You're coming up with new new ways of doing things How many people are on the team that runs this infrastructure and and what kind of support do you have at the like system administrator? I mean, obviously you have a lot You know like I work in an academic setting where I kind of Wear all those hats. Yep It can be a little bit difficult to have enough time to do to do all those things So how much of your time do you get a devote to this innovation and how much do you just have to do the grunt work of? Watching pods spin up and spin down Great question. So As Israel mentioned back in the talk We used to have folks who had sit here and have to watch graphs as we would rotate our fleet with that manual tool 24-7 we had to have somebody there if it was off hours and somebody wanted to go out to eat or something They'd have to turn off the rotation. We'd have to slow down. We lose time Right when you have a system like this and hopefully you have things like Prometheus and alert manager various other alarms that are available You can get notified if it gets stuck and The concept of getting stuck at least in this particular case is typically in the situation of I've run out of compute I got some weird transient error from, you know, my cloud provider that says pools closed man Like that's basically the things that have held us up But at the most for the most part. We don't actually have to watch this anymore Tune your alarms would be my biggest thing You probably have alarms given the way that you're currently thinking about your infrastructure today Once you truly kind of break yourself of that existing model step back and say, what if you can really find out that? Three minutes is your average spin-up time alarm yourself in five You know tailor that closer down to a your workload and be your runtime As far as the hats I wear many of them mostly my favorite But the the real thing is like we're all over the place We don't have that kind of dev and ops mindset So we're constantly having the ability to say hey I'm currently right writing software right now that manages my infrastructure one way or manages my this part of my application Another way and we have that kind of luxury where we're at to be able to take a step back and kind of think of things From a different lens and I think that's actually one of the things that I really enjoyed working with Israel on is I don't know if you saw it in our opening slides, but he's a software engineer and I am not a software engineer So I hope that answers your question. It's good. Thank you so much We don't have to wrap up now, but we are happy to take questions on the Investage, did you have one more man? I think we got enough time for one more Yeah, oh, excuse me. I couldn't hear you make I said a question about the taints and tolerations. Yes, so I was a little confused So when you boot up that new node group, right? Yeah, have a taint. Yes new hash. Yes And that first pod right it has a toleration for that new hash and therefore the Autoscaler can then put up that new node. Yeah, but the old workloads. They don't have the toleration For that new taint, right? How do you get that? Great question? So Tolerations can be based on key and value or they can be explicitly just key And there's also a third one which is the type whether it's be like no execute prefer no execute, right? Or, you know, no schedule stuff like that So we specifically for all of our apps They already trust that rotation key and they're willing to run on it regardless of the version. That's the magic Remove that value field from your definition of tolerations and now you just care about whether the it's like a it's like a three-part Tuple now you only care about two of them the ones that are always static I see so in combination with the setting because I was here we have like a hundred percent utilization Yes Yes, so the cluster Autoscaler I might have glossed over it a little bit cluster Autoscaler says can I hypothetically move all pods on This node then I'm eligible for scale down if one of them can't go away your new nodes never leave you Because that new pod can't live on your old infrastructure. It doesn't trust that hash Does that make sense? Yeah, okay. It's I mean it's it's so simple, but it actually worked Remarkably well for us great question, man. Thank you so much All right. Thanks again folks Before people go, I'm sorry. I forgot to mention this. We actually have a lot of swag For everybody who's interested at the talk Some helpers over there are gonna be handing out little cards Stop by the Oracle booth upstairs and you're gonna be able to get some, you know T-shirts and all that kind of cool stuff for attending our talk. We really appreciate your time. Thank you so much