 Okay, hey everyone, thanks for coming to our talk today on rapidly scaling for breaking news with Carpenter and Kata My name is Mel Cone. I use any pronouns. I'm a senior software engineer at the New York Times in Delivery Engineering, which is a platform team that is building the shared platform in New York Times and I'm specifically on the team that manages and Maintains and adds features to our shared Kubernetes clusters in AWS Hi everyone, my name is Deepak. I'm also my pronouns are he him. I'm also on the same team That better. I'm also on the same team as Mel working on the same shared platform Get right into it. So what's a breaking news alert? So a breaking news alert is a push notification that we send our users And I'm sure you've gotten some of these or might even get one get one during the talk But as you expect this leads to a sudden influx of traffic of people to the website and the mobile app And this leads to them interesting traffic patterns we see sudden spiky traffic and this is an example of an average traffic spike usually caused by a push notification We regularly see traffic spike to two to three X and this happens within a minute in The past we've over provisioned to deal with this, but that costs a lot of money I mean want to minimize amount of infrastructure that we're using so you want to need a way to scale down and scale up quickly This is another example of a traffic pattern that we see This is a result of a daily release of a game And this also leads to about a three X increase in traffic and a very short time span This traffic is regular so we can we know when to expect this and we know how long it lasts So we can we can also scale particularly scale for this So now Mel will dive deeper into our architecture and what we're scaling Okay, so why do we see this traffic spike the New York Times has a unified HDB ingress This is a relatively recent thing that handles some of our internal and external traffic It's the goal is to have everything eventually pass through the ingress But the ingress controller needs to scale to handle the traffic spikes from the B&A's and games so we often see the three X increase in terms of three X number of replicas of the unified HTTP ingress and This is running on the Kubernetes clusters So when a user opens the New York Times website or app the request is routed through our HTTP ingress The ingress sends the request to the corresponding upstream for the appropriate services That said these services often need to make calls to other New York Times services behind our ingress For example, the front page service will make a call to our auth service to check if the user is logged in and has a subscription or a call to our personalization service which Customizes the organization of stories presented to a user and then those services need to make calls back to the front page Service this means that there's a spike in user traffic But there's also a spike in internal traffic, which makes the traffic spike for ingress even bigger So in this diagram, this is a portion of our shared platform specifically the shared Kubernetes clusters in AWS We run a multi-tenant Kubernetes runtime with clusters in multiple regions and environments including a sandbox cluster we use for testing changes and teams are Provided with a tenant cloud account along with access to a team namespace in the cluster in the dev stage and prod clusters So whenever we need to scale we need to scale across multiple clusters obviously prod is a bigger scale, but it's all of them So now we'll talk about how we're scaling with carpenter and how we're scaling our nodes So why wouldn't we just use autoscaler? So first Carpenter is a native Kubernetes autoscaler So it's installed as a CRD and we're already using a Kubernetes-based workflow for installing our other parts of our cluster So we're able to configure carpenter via YAML as part of this workflow With autoscaler we had a lot of terraform that have to be run separately to be installed on each of our environments So and that had to be maintained separately now. I'm going to hand it back to Mel to talk further about carpenter Yeah, so Carpenter is able to scale at the pod level by finding the optimal way to pack pods into nodes using a bin packing algorithm Well, the cluster autoscaler is more naive in its approach It scales at the node level based on overall resource demand So for example if you have resource limits on a pod and you need to increase the number number of replicas It looks at the resource limits for that pods and sees where it can schedule it There's a bit of an algorithm there, but it's more naive and then Carpenter also takes into consideration all instance types that meet the requirements you specify So we're going to show an example in a minute But you can give multiple instance types as options Which is what really gets a lot of the power of carpenter because carpenter has multiple algorithms that are like optimizing from a cost Perspective which thing to spin up so if you look at the diagram and Carpenter if you have pending Kubernetes pods if they can be fit on an existing node They go ahead and get scheduled by the scheduler if they're not then carpenter will spin up a new node That said Carpenter will do node consolidation and so if it realizes okay I can get a bigger node that's cheaper than the total cost of the other nodes then it will spin up that node and move things over there So Carpenter uses groupless auto scaling so the migration was a little complicated We went from managed node groups to this groupless auto scaling that leverages the security groups and Security groups and the subnets it adds a tag so that it knows where to spin stuff up But it's groupless because of the different instance types and It can't be used with the cluster all out of scalar. They explicitly say not to So Provisioners which now as a very recently are with carpenter graduating to beta like a week and a half ago are being renamed to node pools which that's gonna be fun when we upgrade And they're the equivalent of a node group in EKS So this example right here is one of the provisioners that we use in our cluster specifically for tenants So we have a bit more flexibility here than say the provisioner for our unified HTTP ingress and you can see you have the different instance types and you have the The part for selium specifically like the startup taint and then you can we also have our nodes expire every 14 days Part of the reason is spot instances Can't be replaced and like changed to a different size They're just deleted and so we want to cycle the nodes and then it's also like fun, you know chaos Engineering to have our node cycle every 14 days, and it's at max 14 days. Sometimes they go away sooner than that and then we have consolidation enabled which you don't have to do and I'll go into a bit more, but we don't do that for our unified HTTP ingress So Consolidation so this is kind of like the big way that we save money So carpenter has three ways to consolidate nodes It reduces the cost by either deleting or replacing the nodes Either if they're empty if their workloads can run on other nodes or if they can be replaced with cheaper nodes and by cheaper I it could possibly mean one node replacing multiple nodes or one node replacing one now So in order it goes through and it sees can it delete any empty nodes? Or it tries to delete multiple nodes and launch a cheaper single node Or it deletes a single node and then tries and replace with a cheaper node on the right hand side This is something taken from the carpenter docs. I took a portion of it. This is actually the node pool which has all these new pieces in the spec and everything too But it allows you to define how disruption works So you have this consolidation policy either you say when under utilized or when empty and then you say how long it will wait before it makes a decision and then you can also Set it to expire after a certain amount of time, which is the same thing that you saw on the previous slide with the provisioner that we had the 14 days Okay, so this is code taken from the carpenter get hub repository. I'm not going to walk through like every line Just the high level. So this is the cool part with spot instances So part of the reason we migrated a carpenter it was a lot easier to do spot instances there than it was to do with our managed node groups and We really wanted to use spot instances And so this filter unwanted spot what it does is it filters out spot types that are more expensive than the cheapest on-demand type that We could launch so if the on-demand type is Going to be just as the same price or like cheaper then doesn't make sense to do a spot instance We also For anyone who doesn't know a lot about spot instances you essentially can like bid on them And so people could like outbid you we don't bid and so it's kind of like an I even how we do it So it's either it's cheaper. It's not the end and It also is going to find like the cheapest offering for the spot instances From the options that you give it So if you don't give it as many options, you're not going to have as big of a like bang for your buck because it doesn't have as many options To choose from and that also makes consolidation harder if you don't have as many options to choose from Cool, so now we have a quick demo and on carpenter works It should autoplay and I can't see it on my laptop, so I'll explain it from here But on the left side just have a quick HTTP deployment that I'll be scaling up and on the right side Just tailing some logs from the carpenter controller so we can see what carpenter is doing So right now you can see that carpenter has might be a little small But the carpenter has some spot instance pricing that it's aware of it's aware of what instances it can pick from when scaling and on the left side Just scaling up say we have a breaking news event something big Scaling up to 500 Replicas and we're going to watch that scale up On the right side once once the pods scale up and carpenters aware You should be able to see that carpenter will decide first what note which go through the algorithm see Which note pick an instance type and see which it can provision And we should see on the right side Yeah, so now we can see that carpenter has decided to provision a new node it picked a specific instance type And on the left side we can see now that the node is there And we can also see And we can see that that you provisioning a second note also to meet the demand to schedule all the pods and Then if you look at the tags on the on the node We can see that carpenter picked a certain instance type and it's tagged by carpenter And now the important part is that we want to be able to also scale down if we scale up, so I'm gonna try scaling down on the left And scaling back down to one Yeah, now we're scaling down. Yeah, this part I was trying to see if there was pod the node that command didn't work, but yep Waiting for the node to be ready, but yeah, I'm just waiting for the Waiting for me to scale it down Sorry, we didn't do a live demo. We didn't want to test the demo gods, but this was recorded yesterday Close enough to live Yeah, now we can see that this is me showing that the pods have been scheduled on that node that was provisioned And part of what you're seeing is it takes a little bit for it to decide to disrupt the node and get rid of it It needs to like move around the pods and then also get rid of all the demon set pods Yeah, and now we're scaling it down to one and then we can see that the deployment has now as one replica So so we should see that carpenter is not gonna go through that consolidation algorithm where it's gonna either Delete the node or try to replace with the single node But in this case it's likely just gonna delete both nodes and we've scaled down from 500 pods To one pod so on the right side soon We'll see the carpenters and decide to delete each of those know that it's created and we'll be back to our original state Very quickly Part of the other reason we decided to use carpenter was also it's faster than the cluster autoscaler so we did like some benchmarking and I want to say cluster autoscaler maybe takes like two minutes and Carpenter is like 90 seconds sometimes even 60 seconds a little longer to schedule something on it And that sounds like not a lot But when you have the traffic spikes we have that is a lot and also is still not really enough Yeah Cool That's the demo now. We'll talk about more of Thank you And now we'll talk more about how we're scaling with Kata So just to give an overview of Kata Kata as an event based scaler and and it's extends the HPA So it configures the HPA for your deployment instead of it doesn't replace the HPA So scaling is set up via scaled object and the scaled object defines which deployment is the target and then it also Define what the trigger is so triggered could be a bunch of event sources that Kata provides out of the box So let's revisit some of those traffic spikes. So this is the this is the release traffic spike the predictable traffic spike So we needed a way to make sure we can scale up in time and we know when these spikes are so we're able to use Kata's Cron event type to scale our ingress controller So essentially we specify specify time for to scale up to a certain replica amount and then Kata's Patches the HPA to scale it up for this time period and this sort of been very doable with the Kubernetes Cron job also But Kata allows us to define our scaling logic in one YAML and it also modifies the replicas for us So this is just a diagram combining carpenter and Kata and how we scale together So when I want to when I drop happens or at least happens We can see that Kata first adjusts our ingress controller replicas so it modifies the HPA which killed up to deployment and then carpenter in response has to be able to schedule all of these pods that have been That have been created So carpenter adjusts the node pool that manages the ingress controller and then optimizes for Kata picking the best in in sense And once the Kata Cron event ends we know that Kato scale down the deployment and in response carpenter will either delete the nodes or replace the node cheaper nodes and We've also explored other ways to scale. So this is the revisiting the breaking news alert So we know when we're gonna send a breaking news alerts We need a way to like scale in response this event, but these aren't predictable We don't know like it's gonna happen at the same time every day So these spikes are more random and more frequent so we can't use the Cron solution So Kata supports an external push trigger. So we essentially set up the server ourself to Send alert to Kata So as you see in this diagram we have We first have like the breaking news alert going out after the article is published No, the webhook server that that the team that sends out the notification sends an alert to and then in order to set up the External server we have to set up. We there needs to be a gRPC server that Kato sits to so that's where the external push goes So the the request gets sent to the server webhook server Which gets sent to the gRPC server and then once it's a gRPC server Kata knows to scale up and it would be a similar type of Solution whereas Kato would scale up to an X amount once it receives that alert that a breaking news alert is going out And now we're going to discuss some lessons learned from this process So cost savings We were looking to save money by not over provisioning as much So this is really hard given the way that the breaking news alerts work And what Deepak was talking about with like getting a heads up the heads up is like a couple seconds And when your spike happens in like 20s to 40 seconds and lasts for only two and a half minutes And is so huge and it basically means that like by the time you scale up you have to scale down So, you know over-provisioning was kind of what we were able to do and to be honest We still do it somewhat but a lot less now And a lot of that is to do with the cron job that he mentioned but also the things in carpenter So there's a no consolidation So that means less over provisioning right because it's going to pick like the cheaper nodes It's important that we have like a variety of nodes that it can pick from otherwise like if you say, okay You can only do this specific instance type then it can't do anything It's either like okay It's spot or it's on demand the end and we do have some provisioners that are only on demand because they're like more critical workloads Yeah, and so With the cron job, we don't need to over provision things as much and can adjust the scaled object max replicas based on observed traffic patterns So for example Recently with some of the releases of the newer games there has been like a record number of downloads of games And so it was like, huh, I wonder what's going. Oh, okay, cool So because we scale up ahead of time We also have fewer alerts and things breaking in the cluster since it's not attempting to scale on the fly And we also have more instance types to choose from so with the managed node group You have like an instance type with carpenter. You can give a set of instance types So further improvements. Oh, and I forgot to mention we saw a 25 decrease 25% decrease in cost, which is exciting Further improvements. So our unified HEP ingress Because of how important it is and because of like the networking and compute means has one instance type that it can do And also we only do on demand So obviously we can't really leverage carpenter for that and also we don't do node consolidation for it So we're already looking into ways that we can either like have more options for nodes or maybe like switch from like a network Optimized instance type to compute optimized instance type and just like really do more investigation into that and reminder that this unified HEP ingress is like maybe a year old year and a half. So like yeah and then We also the external BNA scalar. We're trying to figure out like how To do that because of the little bit of heads up that does it's not really worth it to like implement a whole scalar when it's not Even a little less Overprovisioning would be good, but it's not worth it given the like little bit of heads up and the way that the workflow works is It's kind of like unexpected There's a whole process of getting an article approved and I think it's like in slack They post like hey, please give us a breaking news alert now And it's like there are other teams that are doing different things and so by the time it gets to us it's like And then we also want to have potentially Kata as a service for tenants So right now we have Kata rolled out to the tenants in our clusters But we don't really have many who are using it And I think part of the reason is because they don't know necessarily how or why I know deep luck said that We they have different metrics that you can scale on so there's a lot of stuff built in so datadog All the different clouds like CPU pretty much anything you can imagine and like we said you can write your own So ideally we'd have something like in the shared platform where it's like Hey, do you want to scale her for this deployment and like Give us a bit of the configuration and then we'll do it for you so with Kata and Carpenter together we were able to save significant money and Our lives are a lot easier to get paged way less often Okay, so the other big thing is the shared platform. So shared platform is Centralization and standardization, right? So Before the shared platform people were running and managing their own infrastructure Understandably there's going to be differences in that infrastructure not only because of the needs of the team But also because of the expertise and what the team prefers But Now that we have A lot of our services on the shared platform is to shift really the shared cluster We're able to establish a baseline And more easily compare across services and then detect outliers So for example, we can ask questions like why does this one service have significantly higher rps Sometimes the answer is the scale is larger and that's fine Um, but other times it may be because there's like not an optimal traffic flow and something needs to be changed Like maybe it's making too many too many requests. Maybe it's not caching whatever Speaker notes are cut off from the bottom. So I guess I'm just gonna wing this. Um So I already said some tenants make significantly more requests I talked about earlier that's kind of like you have to make calls to other services Within the new york times and then they make calls back and how that contributes to the spike But there are some that are making more requests than others and it's we haven't like fully figured out what's going on But it's a lot easier to tell and also with there being Us scaling the architecture rather than the tenants needing to And like I said this like standardization We have a much better idea like okay, this is potentially something with the cluster versus this is something going on with the tenants Like this is expected behavior All of that. Um And then it also means they don't have to worry about it. So in the past a lot of different teams Scaled up for these breaking news alerts or for the games release because like for example like off or uh I there I gotta be careful off or say like personalization or algorithms stuff like that Uh, and now they don't need to do that anymore I think we mentioned early on that not all of our services are Even behind the ingress which you don't have to be running your service in our shared cluster to be behind the ingress You can be elsewhere. Um So we're hoping that the more people that move to the platform That that means it'll be easier to see some of these outliers and then it'll be good for everyone They can go back we can go back to the team We can advise them as sneeze and say like hey, we're seeing this Can you tell us why this might be happening? Can we help you improve your service? Okay, um, so with that, thank you very much. Um We had a couple talks earlier this week from other of our co-workers They already happened, but when I argocon and went at multi-tenancy con I encourage y'all to check it out if you get the chance Thank you everyone Very very good talk one of the best talk of the whole coup con seriously It's the last day the last talk And like it's full Appreciate it. It's also both of our first toxic cute con So still I have a question So you you're using hita it's scaling the number of it's using the hp to scale let's say 500 pods Some of them are going to be scheduled because you already have the nodes But then carpenter is going to create new nodes It can take some time So do you have any way to kind of or how do you manage to have some resources? Like when carpenter scale down the number of nodes you still have some Balloon you can you can schedule pods directly Say that last part again The last part of it you were like, how do you manage to blah? How do you manage to The the the number of nodes or the compute power you still have when carpenter moves down So you can accommodate at least some of the spot some of the pods were when a new scaling happened Okay So first of all part of the like benefit of the crown job is rather than doing like node by node It can start up like a bunch of nodes Um, and you're specifically asking like when it's scaling down how we still have new power So the answer is it doesn't so it's it's basically like they're kind of working separately So carpenter is working like on the node level and then kata is working on the like pod hba level So there's not going to be nodes just sitting there unless there are nodes that are underutilized Um in general because of the node consolidation and everything We don't really have many nodes that are like at a high under utilization unless it's the hba engross So What that means is as it's increasing the replicas with kata. It's like, okay I need to do schedule this many more replicas Carpenter sees it's not interacting with kata. It just sees that it needs to schedule more pods And then it will spin up the nodes which is part of the reason why this is like difficult to do with a bna because Like like we would have to either like purposely create them So let me rephrase Yeah Can can you tell carpenter to keep that number of nodes when it's scaled down? Yeah, yeah, yeah, you can so you can say so you can either say like uh You could say keep them around for longer Um, you would have to have like a like different configuration for the provisioner So potentially like have a different provisioner specifically for This event which you can't overlap the provisioner So you can't have like a provisioner that would apply to where a pod could be like Scheduled based on one or the other and so We just And I just lost my train of thought But yeah, so it's basically like we don't have extra ahead of time It's just like They're there and then they're gone and we would have to do extra work to keep them around And in general, we don't really want to keep them around It does it's pretty smart about like it's you saw it took a little bit of time for it to scale back down Um, but yeah We have seen some errors on scale down because you can imagine like it's going like this and then going like that And so it's a lot Go ahead. Thank you. Hello guys I have a question like is it okay to run cluster after scaling and carpenter at the same time? No, you have to run only one tool correct Um, so usually what you do is you scale down the cluster autoscaler to zero And you do that after you have carpenter set up so you can't have the two together because They are both scaling the same thing in different ways and so they don't they explicitly says in the carpet documentation to not have it Okay, thank you. Thank you um, have you had to Make any changes to Cube scheduler or have you observed any weird interactions between carpenter and cube scheduler? Because I can I with carpenter has been packing and cube scheduler doing whatever the heck cube scheduler does Uh, I can imagine that those are going to fight Yeah, yeah, so we haven't made any changes to cube scheduler um Some of the stuff that we've seen that's kind of weird is we had to like implement these interruption cues specifically because there was not like graceful termination of stuff, especially with like the demon sets and so It was essentially like sometimes weird things would happen We're like it would struggle to like drain and delete the node or for some reason like something wasn't scheduled Wasn't at the right number of replicas and so that's how we've gotten around it. But yeah, you're totally right. Um, so I think it's kind of like Wonky stuff happens, but it's like the way it is right now It's like okay the carpenter can kind of like figure it out and fix it as in like Say the cube scheduler like naively schedule something on a node and carpenter's like I don't know. I want one node for all of that instead Then it's able to like delete the node and then schedule things on that node And you saw in the logs that it will show you like how many pods it's scheduling on the node. Thank you. Yep Fantastic talk. Thank you. Are you concerned about capacity limitations in the cloud? So you're moving from free scaling from having an over provisioned by the sounds of it And you're now scaling on the arm based on breaking news as we enter the holiday periods the clouds get busier Are you concerned about running out of capacity when breaking news occurs and kuda carpenter can't scale anymore? I'm going to answer this very carefully Yes, um I think in general. Well, there are a couple things to consider. So first of all, like we are currently migrating more people to the clusters Um, and the ingress which means more scaling needed I would say holidays are not as big of a thing. It's really the election And I would say the big one first big one that's coming up is the iowa caucus, which I think is in january um, so in general what we do is we have A lot of stress testing and load testing and in particular this like a separate thing, but we have an in-house Load testing setup specifically using k6 Which allows people to like configure their load test and they can run stuff themselves But there's a whole team that does elections readiness to make sure that we're like in a good situation for this But yeah, you're totally right That's brilliant. I would say that's a fantastic talk for the next q-com. Thank you. Thank you Hi, uh with mixing up different instance sizes. How are you managing? Demon set scaling like setting their resources is just like statically to the largest instance size you run or Is there like any dynamic scaling there? So I think that So specifically what it'll do is it it doesn't consider like Aside from knowing how much capacity it needs for the demon sets it doesn't consider that when it's spinning up new nodes So it's just like what it knows. Okay. I need like maybe there are like seven demon sets So I need seven pods that are like this resource constraints that need to go on that node But it will spin up the new Node based on the other stuff that needs to get scheduled. Yeah, I'm asking like um For larger nodes, you may need more demons more demon sets resources than smaller nodes You just provision like for the larger sizes at all. Yeah, so we This is not ideal, but currently what we do is we've been shifting it. So basically like Carpenter doesn't do stuff exactly the same every time But it seems to kind of pick similar instances And so what we've had to do over time as more stuff is migrated as the like node sizes increases We've have to change those resource limits and We've actually recently removed the cpu limit for the datadog demon set specifically because of the way that like cpu starvation fails It's like a silent fail compared to memory starvation Um, and I think ideally and I know we're doing this in the shared cluster we have in gk8 We want to just like completely remove cpu this is me speaking. I would like to completely remove cpu limits Um for the deployments all the deployments tenants and everything because There's been some gnarly stuff because of the cpu starvation Thanks Hi, I was wondering um as you started migrating to this like really fast cluster auto scaling How much did your application start of time? Uh play into that like if you had a few applications that were really slow to start up that were really like blocking it Would you see something where all of your applications would like, you know, be high cpu usage and then it would scale even more Yeah, so that's a great question. Um, so part of the difficulty is like we want to live in a world with the shared platform Where people don't have to worry about the infrastructure That's not the world we live in and honestly like if you have a shared platform like that congratulations because that Um, and so essentially what we've had to do is be really clear in our documentation And then also when we see like outliers and advice people and basically be like, hey So here naively is what kubernetes is doing like pods can come and go all of that You don't know where they're going to get scheduled. So you need to be more resilient So we've had stuff happen in the past where tenants are like having to change things But usually because it's in one place we look at it we go help them We talk to them we get an understanding and we've also done some of these migrations ourselves like with the team Um, but yeah, it's a great question. It's it's going to shift Cool. Yeah, thank you I have a couple of questions. So the first one is uh about the consolidation So I see you're no poor configuration. You set up the consulate dead after 30 seconds So have you ever seen um the no that got the Consolidation replaced through the small no and in the next five minutes it terminated send no and spin up the Big sizer And that created a lot of trigger a lot of the part get the rescheduling Yeah, um, so That specific thing maybe not but like something similar as in we've definitely seen situations where like especially because of these like really fast Short-lived spikes where nodes will start get started up and then it'll be like, oh shit. Wait. I need like I have now like five nodes. It could be one node. I'm going to consolidate them I would say it doesn't happen that fast And there's also the fact that like we're not doing node consolidation for the hdb engras And so we don't have to worry about that happening with it And that is the main thing that has to scale just because of the scale with everything going through it Uh, yeah, I guess because we since we have a continuously delivery um every Minute so we have like a 10k user running our platform So that deployment is actually triggering the consolidations very frequently and users try actually complain for that Um, the other question is about your spark instance. So how do you dealing with the spark instance get recycled from the aws? As in like when they're like recalled you mean or when they're in cycle to recycle to ask you Your spark instance get back to the aws pool. Yeah, um, so The answer is we we just the The interruption queue has been important the interruption queue. I mentioned mentioned before but in general I think we would have to do like significant more work to Deal with that gracefully because we're not doing like the bidding and all of that So in general what will happen is it like gets recalled or whatever it's called That node goes away and then it like gracefully terminates and then it's going to like spit up new nodes to schedule this stuff And to my knowledge, it's not going to like de-schedule the pods before it like spins up, you know, but i'm not 100% on that So um regarding to the k dot component This for for the predictable use case for the breaking news. What about the Unpredictable news like a earth grace gunshot Is that also possible to use the k dot to scale that Yeah, so currently we're mainly using it for the predictable use case, but like that's the custom external push Thing trigger I was showing that's what we're prototyping and that would account for like, okay A team tells us they're about to send out like a news notification They know they're we're probably gonna get more traffic that would we would account for that And then we and other than that we used to use k dot for like the normal cpu and metrics that you can do with hp So you rely on the some external matrix to trigger that Oh, not currently we are prototype. That's like what we're working on, but we will Just to like walk back to what I said during the presentation So the reason we're not doing it right now is because it's so little time It's just not worth it Like I think it's literally a couple seconds and then also like we're one of the downstream services And so it's like other stuff is doing things Um, I know that like the team that sends out the push alerts have an has an idea of audience that they're sending it Out to and that's really helpful But it's it's basically like ed has been worth it There are other things to work on but we definitely want to do it And honestly ideally we would probably talk to the folks sending out the push notifications Or the ones requesting them and be like we really need a heads up because I think historically maybe like years ago We got like 30 minutes now. They're just like whatever a couple seconds. There you go So sorry last question. No, these are great questions about the A carpenter controller. So when I first deployed that I actually Mace up a lot of configuration. It's leading me to the terminate all of my Cup no, yes, I remember when we migrated Have you had any Alert or how do you prevent that? Yeah, um, so we definitely have alerting. Didn't you set up some of the learning for us? Yeah, we have we have alerting and we also have multiple node groups We have or node pool sorry a note a node pool for the actual ingress controller, but we don't use spot instances because that's more That's more resilient and then we have like other node groups to manage tenant workloads So I think that's also we separated. Yeah, they're based on the carpenter log or you based on the carpenter controller matrix The controller metrics matrix, okay And we also I don't remember the specifics of it But there are like certain anomalies that we look for that we page on or at least post to like a slack channel that we can Look at so it's like ah, you know this this node hasn't gone away like what's going on stuff like that But yeah, it's not it's not perfect. It's not ideal. Cool. Thanks. Thank you Carpenter is considering the node types in order to do the auto scaling. Is it also considering the region or the zone? You can do that. I was actually something I didn't talk about. This is a great question So in general, I'm pretty sure it balances stuff There's a bit of weirdness in that especially with these scaling events and that like say we have you know US east 1a b and c right the availability zones It will balance them across and I think this is something like in the configuration that we did The problem is is that if it's like, okay, we have 20 and one 40 and one and 20 and the other It's like well, let's spin up 20 and the other two But then if it like takes them down then it's like wait We have to get rid of the other ones and then start up new ones And then it's like this constant loop of like wait, we don't have the right number That was one of the weird behaviors we saw So I'm trying to remember what we did for it. I think it just got into like a gnarly state at some point and we just had to fix it But I think part of that was for like the really the like regular predictable spike And so we were able to kind of be like, okay, wait I think there was some setting like with rebalancing so that it wasn't as naive about how it did it that we were able to use Cool. Thank you. Thank you So with the really short discreet Traffic Spikes that you get have you And I'm not really sure how like Amazon exactly like bills like when it starts when it ends Have you all thought about integrating like fargate? Which might have a faster like startup and shut down I think initially we did consider fargate, but we never actually used it in production I think we might have done a poc So I'm not sure I think initially we might have if you remember If you if we've used fargate To also solve the similar type of problem A great talk by the way, it was one of the most valuable ones. I think so far that I've I've been in Because we use carpenter. We we're just starting our journey with that. Yeah Have you lived through any availability zone gray failures and how does carpenter handle that when you say gray failures? Can you be specific about what you mean? Yeah? Oh, I mean like an availability zone goes down or something launch and um I don't think we've had Availability zone gray failures or at least not to our knowledge and now I'm like we should set up an alert for that Because I would not be surprised if that's not happening. So maybe I should go do that now The fun part of that with amazon is you don't get an alert you find out about it later I would set up an alert to like figure out their shenanigans is what I'm saying Like I'd have to look at it and see what's going on But that's like a great point like I I should go look at that so I can figure What I would think you'd probably have to do is is Tell carpenter to not deal with the availability zone and do it manually which is a problem But I was just wondering if you've seen that before yeah, and for what it's worth We we do have like multi region. So we have clusters in multiple regions and as of recently we have a cross region service mesh and The unified ingress is set up to fail over So that's not exactly because if there's like one availability zone that's having problems But if there's something like really wonky happening it will just fail over to the other region And so maybe then but like yeah, it's still not great. Got you. Thank you Thank you for the great talk Does carpenter take into account compute savings plan if you have someone That's a great question that I actually thought about today when I was uh doing the talk and I was like I don't know the answer to that and I don't have the time to look it up I don't think it does especially as someone who is like Weighted around a lot and aws cost explorer which is confusing very confusing when you do savings plans and spot instances and all of that I don't think it does um and then also part of the problem And I know for sure it doesn't in the sense of if we tell it hey Do these like you can pick from these node types It's not smart enough to be like you have a savings plan for this specific node type So i'm going to make sure I use that node type. It's a little more naive, but that's that's a great question Can we maybe tune the weights on the node types to say that this is more preferable? I don't know. I want to say this is something that came up before and either like we couldn't do it or it was not Like we couldn't figure out exactly how to do it because like mind you carpenter just graduated to beta Um, so i'm not sure Thank you I have one more question I know you guys will mention about scale up the by parts right or by nodes I'm wondering what you guys to do with other resources like a db or Like a messaging system. How do you guys scale like database or yeah like other resources not only parts or nodes As in like how do we scale like say a database that's not running in the cluster that like an application is communicating with Like I was thinking like uh, when you guys get a break in use, right? Like not only it requires a pot Level of the resources, right? Yeah, maybe other resource dependent resources that you guys may need How you guys scale those like So There is like the like built into the hpa, right? There's like the cpu and memory But in terms of like the database all that i'm not sure But I do know that for stuff that's outside of the cluster Which is usually like the more common thing is that people are running like their main app in the cluster And maybe they're like communicating out to a database or something that a lot of times they have to scale up ahead of time There are like multiple other teams are dealing with some more problem. Okay. All right. Thank you. Thank you Sorry to keep everyone here So in your demonstration you kind of mentioned that use the cron job part of key to however I believe it does require that you give it like an actual Number to scale up to so like how did you derive that number to use hysterical data? Um, I'm guessing this is for the regularly occurring spikes. So Yeah, we use historical data and it's like obviously like like slightly over provision So it's like historical data plus like 20% But yeah, it's a set number of replicas that we scale our ingress controller up And to be clear the number in the slide was not the number we do because I'm not going to put that in the slide Kind of on the flip side. So when you have those sudden spikes, what do you actually use to determine the number of pods to scale up to? Is it a cpu memory like resource thing? Or is there an external metric because you were saying what previous you got like a 30 minute heads up and now you get a few seconds Um, that's assuming that you have some sort of coordination between the breaking news Alert versus like if there's an earthquake setting, you know, everyone's going to check their phone and google You know, is there an earthquake near me? So how do you manage the sudden spikes rather than the regular spikes? So the answer is um, just hope for the I mean, okay. That's a little bit of a person location But we are like continuously trying to decrease the amount of time it takes to scale So I think at this point we've we've gotten like most of what we're going to get out of it But we're hoping the stuff improves over time and I do really think that the main way that's going to like Help with this problem is either getting a heads up from the team that requests the push alert or uh Doing this external scaler in kata. That's like the custom scaler that uh subscribes to the web hook Because they're really unpredictable and also you don't necessarily know what's going to go out I think the only thing that we have is like a slack channel and then maybe you could get some idea But in general like if an election is coming up, we're like, oh, we know we're going to get more traffic So we're probably going to over provision some because like I'm sure for when the debate was happening Right, we probably saw a really big spike. I wasn't paying attention because I was on call and I was here, but yeah And also the the expected traffic we were showing that scale that spike is much quicker Like that's like a few seconds versus like 30 seconds. I meant another one. So it's like we're able to scale for that Yeah, just curious if like cpu or memory was an adequate signal to scale up in that rapid time period because you know You have pod startup and scheduling and all that stuff. So just wondering. Thank you. Thank you