 Thanks Thomas for the introduction, thanks everyone for being here, hope you all had a good lunch, also welcome to everyone who's online, happy that you're able to participate as well. Like Thomas said, my name's David, this is my colleague Evan, we're both software engineers on the compute infrastructure team at Airbnb. We're going to be talking about how we moved, how our Kubernetes clusters have been evolving into heterogeneous clusters over the past several years. Quick outline of our presentation, Evan's going to talk a little bit about where we were before, give you some historical context on what our clusters looked like two, three, four years ago. And we're going to do a deep dive into three specific problems that we encountered. We encountered a lot more than three, but we picked three that we thought were sort of the most interesting. These are both sort of technical and organizational hurdles that we had to overcome on this journey. I guess the other thing I'll mention here is this journey is ongoing. Some of the stuff that we're talking about is done, some of it is in progress, some of it's not complete yet, but I think we've got a good plan at least. And then lastly, we're going to give you some future plans and lessons learned. The TLDR here is that heterogeneous clusters are great. They've already been instrumental for Airbnb and for our team. So with that, I'm going to hand it over to Evan to talk a little bit about some historical context. Great. Thanks, David. Yeah, as David mentioned, I'm going to start with a brief history of time of Kubernetes at Airbnb. In 2016 and 2017, we started evaluating Kubernetes for production use. Previously, we were running Chef on AWS EC2, where each replica of each service would have its own machine. In 2017 and 2018, we started building OneTouch, which is Airbnb's abstraction layer on top of Kubernetes for developers. The philosophy here is that all of a services config lives in one place in Git, and this config is then converted and applied as Kubernetes manifests to the cluster. In 2018 and 2019, there was a huge effort across our engineering org to migrate 90% of our 700 plus services at the time to Kubernetes. Our clusters were separated out by environment. So we had a prod cluster, a staging cluster, a dev cluster, et cetera. But this quickly grew out of hand as we hit Kubernetes single cluster node size limits, and we're forced to split these clusters into different cluster types. So we split our prod cluster into multiple prod clusters, our staging cluster into multiple staging clusters. And these clusters happen to be comprised of a single instance type. So single instance type clusters. This is what a single instance type cluster looks like. We have a pod come into a scheduler, and the scheduler tries to schedule the pod on a node with enough resources. Initially, we only had C5D9XLs as our node type because this happened to be the latest one we were using in EC2. And this was enough at first, but as more and more workloads started migrating, some specialized workloads required different instance types. So GPU workloads started migrating. We were forced to create a GPU cluster type, workloads requiring a lot of memory. We created a high memory cluster type for. And these cluster types were identical in both setup and config, and allowed us to expand them horizontally. Yeah, so why single instance type? This was before we had cluster autoscaler, which allows you to dynamically scale cluster size based on node capacity. Without this, multiple instance type clusters were not yet feasible. Any scheduling and capacity errors we ran into, we would have to resolve manually, which was quite a pain because we'd have to look at the different pod requirements and manually determine which node types would fit those scheduling constraints. Additionally, no cluster autoscaler means a lack of automation. We were at the time manually scaling all of our ASGs when we got alerted, which was not great. Yeah, so back to the timeline in 2019 and 2020. As more and more specialized workloads migrated, we had a cluster type explosion, and we went from around five clusters to over 90 that we have now, and we have over 30 cluster types. This leads to a lot of painful operational overhead for our team. Here's a screenshot of some 700, almost 700 alerts that we have just for ASG sizes, which really isn't great. This is our attempted solution to fix this, multiple instance type clusters, which look like this. We have the same thing as before where a pod comes into the scheduler, but this time the scheduler has many node types to choose from. It can schedule it on any node type that can fit the pod requirements of, in this case, the four that it has. But this time we also have cluster autoscaler, which handles the complexity of managing the ASG sizes, and it scales up the certain node types when required by a pod, but there's no capacity in the cluster. So yeah, why did we want to migrate? Well, first it brings our team quite a lot of quality of life improvements, reducing the number of cluster types to generalize clusters and make it so that we only have one process for cluster creation is really beneficial for our team, and it also reduces the number of alerts and alert fatigue. Secondly, we wanted to be able to control cluster composition at a global level. It unlocks some cost savings for us when we're able to utilize more node types rather than being locked into one, gives us some contract flexibility with AWS. In the future, we are trying to utilize spot instances so we can run part of our clusters on spot, and also new instance generation upgrades we get pretty much for free in the multi-cluster setting. Historically, this has taken us years to upgrade our entire fleet across new instance generations. Yeah, and the next section is solutions and other problems. I'm going to pass it back to David to talk about the first one. Thanks, Evan. So in this section, we're going to talk about three of the major technical and organizational hurdles that we had to overcome to do this migration to heterogeneous clusters. The first one I've entitled in which nobody knows what's going on. Specifically, this is going to be focusing on cluster autoscaler because it turns out that we didn't know what was going on with cluster autoscaler. When we started this process, one of the key questions that all of our stakeholders asked us is, how do we know what's going to be running in our cluster? Like, how do you know what mix of instance types you have? How do you know where things are going to get scheduled? And we thought we had the answer to that. So this first diagram here is not actually what happens, but it's what we thought happened. So a new pod comes in and KubeScheduler looks at the pod specifications, the resource requests, et cetera, and it's going to do one of two things. In the first case, which I'm calling the happy path, there's a node somewhere which meets all of the requirements and KubeScheduler just binds the pod to that node. I'm not going to talk about the happy path anymore. The unhappy path, there are no nodes in the cluster where the pod can run. And so KubeScheduler puts the pod into a pending state. It marks it as unschedulable. And cluster autoscaler then monitors the list of unschedulable pods. And it's going to try to spin up new EC2 instances that match the pod specifications. It looks at things like the resource requests. It looks at node selectors, node labels, looks at things like the pod topology spread, et cetera. It's going to iterate through all of the different node groups that it has, and it's going to select one that matches the pod specifications. And it's going to spin up a new host from that node group. So in our setting, node groups correspond exactly to ASGs. Because in a node group, you have to have all of the nodes identical. So they have to have the same CPUs. They have to have the same memory. This goes down to the labels and everything that are applied to nodes in that group as well. And so we have a different ASG that corresponds to each different instance type that we want available in our cluster. So anyways, once cluster autoscaler spins up that node, after it joins the cluster, KubeScheduler will then see, oh, look, there's a node that can accommodate my pod, and it binds the pod to the node. Nice and simple, right? Cluster autoscaler is making all these decisions about which node types to spin up, and we thought, great. We'll just hook into that process. It's not quite as simple as we thought. This diagram here shows what's actually going on. You can see that I've added a third middle path here, which I've entitled pause pods evicted. So what is a pause pod, you might be asking? Well, it turns out that for a variety of reasons, we want to be able to maintain additional overhead in our clusters. This is useful in the event that we have a spike in traffic, or we have some sort of an outage, and we need to fail over a service from one cluster to another. Having this overhead gives us the ability to handle those bursts in traffic seamlessly. The way that we maintain this overhead is through something called the cluster proportional autoscaler, which is different than the cluster autoscaler. What cluster proportional autoscaler does is it spins up pause pods. A pause pod is exactly what it sounds like. It just sits there on a node doing nothing, but it reserves that extra capacity. All of the pause pods are marked with a lower priority. So we've got our regular priorities, and then pause pods have a lower priority so that when a new pod comes in, cluster scheduler will look at that. It'll say, oh, here's a node that has a lower priority. It'll evict that one, and it'll make room for the new pod that just came in. We did a little bit of data analysis here and discovered that this middle path here is taken approximately 98% of the time in our clusters, which is great. It means that the proportional autoscaler is doing its job. We have enough capacity, 98% of the time, to handle our bursts in traffic. So what's the problem here? Well, in a single instance type world, the way that we configured things is we applied a instance type specific node selector to our pause pods. So that means that we're running C5D9X larges. Our pause pods could only spin up on C5D9X larges. This is for a variety of historical reasons, and when we moved to a multiple instance type world, this started causing us problems because then we'd create new pause pods for M5D, whatever instance types, and I'm sure you can see where this is going. When the new pod comes in, it evicts a pause pod and gets bound to the host where the pause pod used to be running. The pause pod then transitions into a pending state and cluster autoscaler says, aha, here's a pod that can't be scheduled. Let me try to spin something up to satisfy all of its requests. It has this instance type specific node selector on it, and there's only one node group that can satisfy that. So KubeScheduler doesn't know anything about different instance types. It doesn't know anything about ASGs or anything like that, it's more or less picking pause pods to evict at random, and what that means is that cluster autoscaler is entirely constrained in its choices by our pause pods. When a pause pod gets evicted, cluster autoscaler is forced to launch a new node that exactly matches that pause pod. So this was kind of a problem because it meant that we had all of these interacting control loops. We had cluster autoscaler, we had cluster proportional autoscaler, we had KubeScheduler, all of these things were in some sense fighting for control over our cluster composition, and this wasn't great. So what do we do about it? Well, in this slide, I show a slight modification to the previous slide. I'm gonna highlight the differences here. We ended up identifying two things that we could change. The first is that we moved away from these instance type specific pause pods to what I call generic pause pods. I'll explain that in a moment. And the second is that we built a custom expander plugin for cluster autoscaler. So let's dive into what each of those are. So as you might expect, the solution for pause pods is pretty straightforward. You just remove the instance type specific node selector. So now we have what we call generic pause pods. They can run anywhere. This is great. In terms of sizing, previously we had the pause pod just take up the entire node. Now we're going to size the pause pods to be the smallest of the instance types in our cluster. So if we've got node types that have 32 CPUs and we've got node types that have 64 CPUs, we'll size the pause pods to request 32 CPUs. This means that, well, maybe a bigger node might have multiple pause pods running on it, but that's fine. We can tweak our overhead parameters to accommodate that. Now the only sort of remaining concern here is what do we do about our SLAs around scheduling? As you recall before, 98% of the time we're evicting these pause pods in order to make room for new workloads that are coming in. And what that meant is that most of the time our services are getting scheduled in one to two seconds. And we really wanted to make sure that as we move into this multiple instance type world that we're able to maintain that. And this becomes a problem if we start consolidating like our GPU clusters or some high memory workloads. We try to mash all of those different things into a single cluster while you might have a job that comes along that's requesting 500 gigs of RAM and our pause pods, none of them request 500 gigs of RAM. And so it might have to wait, that job might have to wait five minutes or 10 minutes or however long it takes for a new node to come up and join the cluster. And this could be a regression in performance. So the solution that we've identified, this is a piece that we haven't implemented yet is that for jobs like GPU workloads or high memory workloads, we're gonna build in a shadow capacity mechanism so that if you've got a specific service that has unusual resource request, essentially you're gonna get pause pods that are specific to that service so that you can maintain those scheduling guarantees. All right, so let's move on to the second piece that we are introducing. This is our custom expander plugin. This is in progress. We have a really basic proof of concept right now, but we expect this is gonna get a lot more complex in the future. What we observed is that cluster auto scaler is performing a lot of common tasks and we really wanna keep all of that functionality in cluster auto scaler does things like talk to the cloud provider. It runs this simulation where it determines all of the different places that a pod could go and does a lot of really nice things for us. What it doesn't do necessarily is capture the business logic that we want to determine what node types to spin up. So how does cluster auto scaler actually pick a node type to spin up? Well, first it does this filtering step where it identifies all of the node groups that potentially could run the incoming pod. Then it passes those node groups off to what it calls an expander. There are a number of expanders that are built in. There's one that's just the random expander. It's gonna pick one of those node groups at random. The one that we use right now is called the priority expander. So you rank all of your node groups by priority and it picks the highest available priority. It's got a few others, but what we predict and what we're already seeing is that we're gonna want even more control over the node groups that get selected. What we didn't wanna have to happen is that we maintain a fork of cluster auto scaler that has all of this business logic. We want it to be good CNCF citizens and be able to contribute some stuff back upstream and we also want it to make our lives easier. And it turns out that I think with this custom expander plugin we can do all three of those things. So what we're gonna do is patch cluster auto scaler so that it can communicate over GRPC with a separate process. And that separate process is gonna encode all of our expander logic. It's gonna be all of the business logic around what instance types we currently want to run. As Evan mentioned, we wanna start dipping our toes in the spot world. And so this custom expander can do things like look at all of the prices for all of the spot markets that we're in and let's maybe pick one that's cheap right now or that we think has a lot of availability. It can do things like, well, we've got a whole bunch of batch jobs that are gonna be requesting a bunch of memory. So maybe for like bin packing reasons we should request some instances that have more memory or something like that. All of that business logic can get encoded in this expander that is separate from cluster auto scaler itself. This is nice for a couple of reasons. It means that, A, we're not maintaining a fork of cluster auto scaler and B, we can upgrade the two out of band. We don't have to, if our AWS contract changes and we need to fix some of that business logic we don't have to deploy a new version of cluster auto scaler. We can just deploy a new version of this service that we're running. So like I mentioned, this is in progress. We have a work in progress PR that I'll link at the end for this expander. But this is going to really be where we control the composition of our clusters. So that I'm gonna hand it over to Evan to talk a little bit more about some of the organizational issues we've encountered. Yep. Thanks, David. Yeah, so David covered some of the technical issues. This section will be a bit more around the organizational issues. This chapter is titled in which customers are concerned. So because we're coming from single instance clusters service owners had the expectation of identical pod performance across different machines. Another issue we thought of was that our service owners have a lot of migration fatigue. We currently have nearly 1,000 services and with every infrastructure config change leads to migration and every migration causes wastes a lot of engineering time. Our first solution here was talking to people. We sent out surveys and talked with a lot of customer teams. We realized here there were actually quite a lot of nebulous concerns around performance without the teams knowing exactly what requirements they had for their service around performance and most of them hadn't really done any performance testing or performance tuning either. They only came to us with sort of like relative comparisons to their current performance, whatever that performance may be. Out of these talks we created an extensible API out of which was kind of twofold. It allows customers to specify specific instance types if their workloads are actually super sensitive to the different underlying node types. And it is also opt-in, which solves the default is opted in, meaning service owners would have to opt out. This solves a lot of the migration issues because we expect this to work for almost all services. It also allows for additive requirements when service owners are describing the instance types they want and is therefore extensible for our future work, like I said before, like utilizing spot instances and others. To try to address customer concerns, we did a bit of performance testing for these new instance types. We started with running some benchmark tests. So we ran a variety of Sysbench synthetic tests and found performance between the instance types to be within 20%. But obviously this isn't good enough for production workloads. We needed to use real workloads to test this. So we took services test environments with production traffic replay, scheduled them onto different empty nodes of all the instance types we were testing to compare the latency metrics. And we did this for a representative set of service profiles that we have. This was across different languages, so like Java, Ruby, and Node, latency requirements, SLOs, and across different workload types. So we had some like CPU intensive services, we had some IO bound services. Here's an example of a very latency sensitive service that we have. The different colors here, orange and purple represents the different instance types and each grouping is the P95 and P99 of the latencies. You can see they're like really, really similar. We have actually found this to be the case for all of our services and all endpoints. We found that performance was within 10%, which we deemed acceptable enough to move forward. And yeah, so as we started rollout, we found quite a number of very, very technical issues with cluster autoscaler. We don't have time to cover them all in this presentation but we'll deep dive on one of them and then link the bug fixes and issues we filed upstream at the end. And I believe these slides are uploaded so y'all can take a look if you're interested. So yeah, the problem we saw was that during our rollout, we observed that certain ASGs weren't launching. This was kind of strange because the logs, when we checked them, showed pod topology spread failures. If you're not familiar, pod topology spread, all it does is guarantee that pods are spread evenly across availability zones. Here's a screenshot of the logs. You can see, if that's not too small, that the M5D12XLASGs in US East 1B fails pod topology spread constraints and then later on in the logs, the C5AD12XL in the same zone ends up succeeding and being chosen to scale up. This is really weird because it's exactly the opposite of what pod topology spread is trying to do. And then this graph shows our gradual introduction of these new instance types to one of our clusters. You can see these C5AD12XLs in the top in blue gradually taking up a larger and larger proportion during our rollout. But C5AD16XLs and M5Ds aren't launching at all. So after a while, we realized that the ASGs that weren't launching were empty ASGs. In this case, it was the M5D ones and the C5AD16XLs. And this was kind of strange to us because both of those ASGs happened to be the very top priority in our expander ladder. And we'd expect them to launch before the other ones. So this was kind of strange. We did quite a lot of code digging and we learned a couple of things about cluster autoscaler. So what's going on? We have a cluster here with the four ASGs, two of which are populated, the C5D18XLs and the C5AD12XLs. And then the two on the right are empty. And we also have cluster autoscaler, which adds nodes to a cluster when incoming pods come in and there's no capacity. Cluster autoscaler undergoes a scheduling simulation when this happens and looks at possible node types to expand and sees if the node types will fit the scheduling requirements of the incoming pod. So how does this scheduling simulation work when evaluating if a pod will fit? Well, it looks at the available node types it has, in this case, the four on the bottom. And it creates these fake nodes that populate node info objects for each type of node and it sees if this fake node will fit the scheduling constraint. But that begs the question, how does it populate this node info object? Well, it turns out there are two paths for ASGs that have live nodes in the cluster. It copies the node info directly from existing nodes. In this case, the two on the left. But for the ones that are empty, we have these things called ASG templates, which whenever cluster autoscaler chooses to launch a new node, it takes this ASG template, which describes the node config before Kubelet starts on a node. This seems fine, right? But I mean, in most cases it is fine, but Kubelet actually applies quite a lot of labels and annotations to a node on startup, including architecture, OS, the host name, the instance type name. But then most importantly for this scenario, the topology key, topology zone key, this means that ASG templates don't have that topology zone key, which explain the behavior we were seeing because the fake nodes that were copied from the ASG templates don't have the topology zone key. The ones from the existing nodes do have that key because Kubelet has already started on those existing nodes. And thus we see our behavior in the logs where the existing nodes, even though it's a lower priority than the other ones that were empty, is chosen. Our short-term fix for this is to just manually add the topology zone key to the ASG template. This is pretty counter-intuitive behavior because you wouldn't expect to have to add that because Kubelet does it for you. The long-term fix here is to modify cluster autoscaler to automatically add all of those Kubelet tags to the ASG templates when it's evaluating. Yeah, so David, do you want to talk about what we learned? Sure, thanks. So let's see, what did we learn? Testing, there we go. Cool, yeah, sorry about that. Thanks, Evan. What did we learn from this whole process? Well, the short summary is that heterogeneous clusters are really great. We should do more of those. But you might be asking what's so great about them. Well, I think the biggest benefit that we've gotten from all of the work that we've done so far, we're still kind of in a half-finished state and we're already seeing this benefit is it gives us a huge amount of flexibility in our team. We've gone from, as Evan mentioned, like taking years to migrate from older generations to the VC2 instances onto the most recent ones to being able to do that with a single config change in a couple of hours. Even earlier this year, it took several weeks to even just be able to add a new instance type into our cluster. And again, that's now a couple-hour task at most. So it gives us a huge amount of flexibility to respond to changes in our demands, in the cloud provider markets, you name it. We can now handle all of these different changes without really overburdening our team. So that's great. Second benefit is also this allows us to future-proof our systems a lot more. And this really boils down to just being able to add new instance types as AWS makes those available. The third thing is really around cost. Historically, we just picked an instance type and went with it, because that seemed like the thing to do. Now we can actually critically analyze are we getting the most bang for our buck with the instance types that we have in our cluster. Evan and I have hinted several times that we wanna also start incorporating spot instances and all kinds of other stuff. We're investigating AMD, we're investigating Graviton, all of these different things. We're now going to have the capability to make these trade-off decisions where we wouldn't be able to do that before. So what's next? We still have a number of changes to cluster autoscaler that we're trying to get upstreamed. I'll have a list of those on the next slide. We're still in the process of rolling this out as well. Like I mentioned, some of these things are still in progress. We only have a handful of our 50, 60-odd clusters running in a multi-instance type setting right now. So we wanna roll this out to the rest of Airbnb. And then our next big project after that is running on spot instances. This was really a prerequisite to that. And yeah, so this is a list of existing PRs and issues that we filed against cluster autoscaler. These are on the slides that are uploaded. I think there's gonna be like one or two more that are coming that we just discovered in the last couple of weeks. And with that, I think I will conclude. I wanna thank everybody again for attending. I'm happy to take any questions or Evan, any questions that you all have and just wanna put out a shameless plug that we are hiring. So if any of this stuff sounds interesting to work on, you can come talk to us after the presentation. So thanks very much.