 So maybe some of you, maybe all of you, have your own Apache Meso's cluster running. Everything is going great. That's awesome. And maybe you've got applications running in production as well. So let's ask a few questions. Are you able to guarantee capacity for all your applications exactly when they need them? Are you able to optimize placements if that's important to you? Maybe there are some batch workloads that can take care of affinity or locality or anti-locality, right? Do you keep your cluster size elastic or do you provision for peak if you're in a data center environment? That would be true. And are you able to optimize your maximum footprint? Well, these are the questions I'll try to address today. My name is Sharma Podila. I've been at Netflix for about four years. I'm part of the, what's called the edge organization, edge engineering organization. And I worked on a couple of projects that are relevant to Meso's. One is called Mantis and the other Titus. I'll briefly touch upon them. And also we open source the Netflix FENSO scheduling library. Before that, I spent time creating batch schedulers. So this is what we're going to cover. I'll start with a brief context on what are we trying to solve? What did we start out to solve? And my title of the presentation has the word juggle. Why do we need to juggle? We'll take a look at a few challenges that are specific to large clusters and then dive deep into what we've created, how it works for us and share some details with you and then a peek at what we are thinking about working on next. So when we started about three and a half years ago, one of the projects that was looking at Meso's was called Mantis. It's a cloud native reactive stream processing system. It was built to answer certain real-time queries like, hey, the Netflix service is running okay, but maybe it's just a handful of users in one corner of the world are not able to stream a few episodes of a show which gets lost as noise in the overall signal of availability of Netflix service. So doing high cardinality anomaly detection, things like that. And for this we built a scheduler on top of Meso's in order to run containers that are doing stream processing. And there's a lot of interactive exploration of data that users can dig into by putting in new facets and querying what's happening in real time. Around the same time another project that started is called Titus and that's a Docker container orchestration system. It runs on top of EC2 specifically in the VPC environment and it integrates into the rest of the Netflix microservices ecosystem, the discovery monitoring and software-based load balancing and things like that. And Spinnaker is our CICD pipeline and a common UI to make application deployment easy with all its pipelines. So both of these projects were starting out around the same time and what we saw was that what we needed from a cluster manager was a support for heterogeneous mix of workload because the workloads are going to vary not only in terms of how many CPUs each task might ask, memory and network bandwidth, etc., but also in how critical they are. We have a mix of some being more critical than others and their runtime profiles. Some of them are short running batch tasks, some of them are long running batch, but some of them are perpetually running services. And then the resource demand, it varies over time. Netflix is definitely dependent on the time of the day in terms of how many active users there are and that means Mantis has a different amount of data to process and it actually varies about 5x from peak to trough. And similarly the number of containers run in Titus also varies over time, especially for batch systems where work comes in sporadically in lots of them and then just goes away after they're done. So, okay, all that sounds fine, but why juggle? We need to get one thing out of our way which is if you had unlimited resources, I can tell you there's no need to juggle. Life is simple. And I didn't say infinite because we just need practically unlimited as in more resources than we have demand for, right? If we had that, there's absolutely no need to juggle. It's very simple. But the other question that comes up, especially for us is since we run on the Amazon cloud and it's an elastic cloud, well, don't we have an unlimited supply of resources whenever we need? We just go to the cloud and ask for more. Why? It would seem like that's the case. And if you were to sort of represent that as demand versus supply and look at the whole aggregate of all resources we use, maybe it does look like that. The demand is less than the supply, so everything should be fine. But when you start looking at it, there's not just one server type in the cloud. There are so many different instance types, right? And when you start looking at it, it looks like, hey, for some instance types, maybe the demand is less than the supply, but for others, it's actually the opposite. So we do have certain workloads that need to use certain resources for which the demand is higher than the supply. So there is some demand versus supply juggling that we will have to look at. And the other is efficiency. Some workloads need a lot of memory. By the time you pick an instance type that can satisfy that, you're automatically getting so many CPUs that that application is not going to use. So you're going to start seeing holes in utilization. So can we do something about it, improve the efficiency? And then there's different workloads types, not only in criticality, but there's a lot of pre-computes that are done for production workloads to do data lookups. There's experimentation, and then there's testing, and then there's something what I call as the idle soak, which is, this is not super critical workload, but if I can run more of it, it helps me maybe converge in my algorithm, better improve my conference intervals. So if there's any idle cycles, just run this, right? There's all kinds of workload. So, okay, are there any specific challenges to large clusters? So by large, I don't necessarily mean only in the number of servers, number of agents that you use. And if that's large, well, that's a large cluster, but large is also in terms of how many users do you have that are trying to do different things? How many different use cases you have? What's the differences in the kind of resources that the jobs need? The variety could be large, the number of servers could be large, or the number of users could be large. So when you start looking at scheduling, you can think of doing a first fit kind of an assignment, and it keeps things pretty simple. And given the right environment, it can actually be very fast. You can probably achieve constant order time complexity for first fit kind of assignments. However, if you start doing the most optimal assignment, this can be expensive computationally. And even if you were to not reassign already running workload, just figuring out assignments for the pending workload can be computationally expensive. Ideally in the real world, we want to stay somewhere in between as close as possible to the speed, but achieve as much optimal assignments as possible to the other end of the spectrum. That's our goal. And speed is important for various reasons, but two things that I like to point out are if a scheduler were to be slow, while it's figuring out assignments, there are servers that are becoming idle, efficiencies lost. And one could argue that if the assignments were figured out, assuming those servers were full, and in the meantime, they become empty, the schedule is actually incorrect now. So speed is definitely a concern. So with all this in mind, our initial goals were that we need to have a cluster manager and a scheduler that can take multiple scheduling goals in mind and achieve them. We need to be auto-scaling our cluster size, as we do with a lot of things at Netflix, trying to be cloud-native and use and pay for the resources you need. Extensibility. Extensibility is something that, well, I've learned, a lot of people have learned that if you try to keep adding one new feature at a time into a scheduler after some time, it feels like you're adding a Band-Aid over a Band-Aid. And instead, it's easier to design a scheduler that is easy to extend, so that it has to be a goal from the beginning. But very soon we figured out that there's a few more goals we need to add. Security. If we are going to have multi-tenant workloads, different applications, then we need to have a good security story around that. Capacity guarantees. And can we reason about fail-to-execute failures in operations? So let's look at each of those. So as far as multi-goal scheduling objectives go, we can think of four different camps that have an influence on how you should do resource assignments, right? The data center of the cloud operator is interested in making sure while they can do maintenance, they can make sure that the cloud footprint is small, etc. So they're interested in making sure the scheduler behaves in certain ways to achieve their goals. The application owner, on the other hand, is interested in making sure their application gets the best performance possible. If I can get a performance by buying my own machine and running my application, running it in your system should have a similar performance characteristic, if not better, right? And high availability is another concern. Don't put all my tasks on the same machine, same rack, etc. And security concerns are that at least in Netflix, we have certain applications with certain security profiles that can only access only certain other applications, right? Everybody's got the security problem. So if you're going to put multiple applications on the same host, you still need to guarantee that that will be true. Cost is an interesting thing. Cost could mean so many different things. It could mean, hey, use the cheapest instance size possible to get the work done. It could mean in a data center, make sure your power and cooling costs are under control. You can actually schedule things and save on power and cooling. So things like that. So auto scaling, like I said, Mantis has 5x or more peak to trough variation in the amount of data to process. And we don't necessarily want to size our cluster for the peak there. Similarly, Titus has thousands of containers or hundreds of containers running at different points of time. It turns out scaling up is, well, it's not super easy, but relatively easy. You can, for example, watch for some kind of metrics that tell you, here's the demand, it's getting close to your cluster size. And then if the buffer is very small, then you scale up. So scale up can be a little bit easier. But it's a scale down that's harder. And it requires you to do bin packing. So in the first row of these four hosts, I've got a similar number of CPUs being used, about 50%, but they're spread across all the machines. And if I need to scale down, I'll have to move those applications somewhere else. But instead, if my scheduling actually took that into consideration and did bin packing, then it leaves two machines completely idle and it's easy for me to scale down. So that's the basic idea. However, there are other aspects to it where batch workloads can complete at different points of time. And then can you actually look at that, estimate the expected runtime and put them together so at about the same time, the entire machine is going to free up. There could be sophistication, but this is the basic idea. And for security, and for security, we actually find right now that we've got applications, we use security groups on Amazon. Some applications have a single security group. Some of them have multiple, and there could be a mix of them. We'll need to make sure they are isolated. And we look at capacity guarantees. What we're seeing is that here's dozens of applications that are going to be run on the cluster. How can we make sure they have the right capacity when they need to? And when people think about doing this, there are, broadly speaking, two ways to think about it. One is priority, which is on the right-hand side of the slide. And the other is some kind of a quota to say, I'm going to give more quota to the more important workload. So when you look at the quotas, often it's a static fragmentation of the resources to say the critical, for example, gets 75% and the other one, flexible or flex for shard, gets 25%. But what happens when there's a surge in the critical workload? If they're limited to 75%, well, you're in trouble because that is the critical workload. So then you need to vary the quotas, maybe. So that's one problem that we could run into. And the priorities may seem to work because first you make sure the highest priority workload is run and then the others. But then that can starve resources for the flex tier. And even though flex tier is not critical, for example, it's not user-facing services, you still need to get batch workload done on time. And if you're going to starve them, then it's not going to affect you today. It's going to affect you tomorrow maybe. So there's still certain throughput you need to guarantee for batch workload, even though your critical workload is more important. So just the strict priorities is also not going to work. So those are some of the thoughts that we go through to guarantee capacity. And for failures, I'm going to talk about specifically the failure in a job to get launched. Somebody submits a workload. It's not running. You've got thousands of jobs submitted. How do you know why this particular job is not running? Okay, if it's not running, what are the resources that it is asking that are not getting satisfied? And how many servers are even available to satisfy that but are failing at the moment? So these are some of the questions we want to be able to answer. So with that context, what did we create? Here's the overall architecture of what we have. We're out on EC2. We've got Mesos running there. And on top of that, we are running two frameworks, one called Titus, the other called Mantis that I spoke about before. Both of them have abstractions on jobs. Jobs are a collection of tasks and there are a few different types of jobs, batch service and then the stream processing job as well. And both of these frameworks, they use the FENSO scheduling library that I referred to before. So let's look at this in a little bit more detail. So the scheduling strategy that FENSO has is that when we look at tasks and then say, here's the potential list of agents that we could run them on, some of the agents may at the moment fit the task perfectly or some of them may fit okay. It's not bad, but it's not a perfect fit. And there's like a whole spectrum there. Also, some tasks are urgent. Maybe they need to complete right away or they have their service type critical, they need to run right away or some of them are not so urgent. So the general strategy in FENSO is that if it's either urgent or it fits perfectly, let's go ahead and sign it. If not, let's keep it pending and look for other agents where they may fit perfectly. So that's the general idea. So FENSO is actually open source and it can be used by any scheduling framework, any mesos framework that runs on the JVM. And then it provides you extensibility. You can write plugins, which are basically simple functions that you tell FENSO here's my function. It gives you cluster order scaling if that's important to you. There's tiered queues with weighted DRF within a tier. And it gives you a control for speed of assignments and optimality. And actually, importantly, it also gives you ease of experimentation. So if you want to write new plugins or new scheduling algorithms before you put it out into the wild, you can actually say, here's a snapshot of my production workload. Let's run it in FENSO and make sure that this new plugins would actually work well before you put it into production. So it lets you experiment. That's the URL for where you can find it. This is roughly the algorithm that FENSO uses for each ordered task. And we do have queuing. For all available hosts, we start validating what we call hard constraints. Hard constraints must be met in order for the task to run on that. And then we evaluate scores for what we call a fitness function and soft constraints. And then we don't need to do this on every agent. We only need to do this until we find a fitness score that we think is good enough. And that's where the speed comes. If we relax the criteria a little bit that, hey, the score could be low enough. I'm okay with that. Then you get speed. If you make the criteria to be, I need the score to be high, then you get more optimality. So you have that control. And then we pick the house with the highest score within that. So the constraints, both the hard and soft constraints and the fitness functions are the plugins that you write. FENSO comes with a few built-in ones for bin packing, but you can write any fitness function that you want. So what do we use in our system? We definitely use the CPU memory and network bin packing that FENSO already comes with. And to understand how it works, it's very simple. Let's take the example of CPU bin packing. When we look at an agent and a task, let's just assume that all single CPU tasks for this example. We've got five hosts here. Host five is already full, so it's obviously not going to fit it. But we evaluate the ratio of the used CPUs. If we were to launch this task over the total CPUs on the host, that basically gives us a fitness score for CPU bin packing. Similarly, you could do that for memory, network bandwidth. And we also do that for the runtime profiles, so we can sort of dynamically segregate short-running tasks from long-running tasks. It helps us in downscaling, like I was saying. And speaking of extensibility, recently we added another fitness function where we can control or limit the number of concurrent starts happening on machines. This has to do with more on the executor side, the whole docker ecosystem of how many tasks can you concurrently start downloading images and all of that stuff. So we wanted to get a good control, so we added a fitness function for that. And then we combine these. So you can have multiple fitness functions and compose them together to form your overall fitness function. And we give them weights to say, hey, bin packing is important to me. But runtime segregation is also important. And so is control on the launch. So I'm going to give you different weights. So we're playing with weights to see what weight works best for us. The hard constraints, we used two hard constraints today. One is for GPU server matching to say that a host that has GPUs will only run a task if the task is requesting the GPU. Simple enough. That's just a hard constraint we put in, and it automatically does that. We also have some resources here marked for certain kind of tasks. So we use hard constraints for those as well. Soft constraints are mostly specified by the users at submit time. And the popular one is to balance the tasks across availability zones. Amazon regions have multiple availability zones. And by spreading them, you are making it highly available that even if one zone were to go down, your application is still available. So they specify this declaratively to say balance my tasks across the availability zones. Some services, especially those with a low instance count, also require us to balance them across hosts. So while it's a cloud, an instance could die for some reason and come back later. And if one host were to go away, we don't want too many of those instances to go away, especially for those small services. So how do we combine fitness score and soft constraints? Soft constraint also gives us a score to say, yeah, this constraint is met perfectly or this constraint is sort of met. It's okay. The constraint itself is not failing. So if we have such constraints, then we can combine that with the fitness function. So the reason these are different is that fitness function is sort of more globally applied to the cluster. The people who are looking at cluster level optimization and writing fitness functions, whereas the constraints are being written by applications to say for my application, I want this soft constraint. So then we combine them with currently these are the weights that we use to combine them. So it's 40% the global fitness. We give more weight to the user specific constraints. So FENSO has a multi-tiered queue setup. You can theoretically have any number of tiers, but in practice it makes sense to have a handful of tiers. Today we use two tiers, the critical and the flex tier. And what we mean by critical and flex is you can have important applications running in either tier. What we mean by critical is that it's critical for the task to be launched right away. Whereas for flex tier, it's okay to launch them with a little bit delay. I'm okay with that. But this is still important. But how quickly you launch them. So for example, a user facing microservice that's latency sensitive will be in the critical tier. Whereas a batch workload that's okay finishing in the next hour or overnight will be in the flex tier. They both need to be guaranteed some capacity. It's the difference is how quickly they need to be launched. And within each tier, we have multiple applications. So it's multiple buckets. And we do DRF, which is dominant resource fair share across these buckets. It's a weighted DRF. So one application may have more weight for the share that it gets within the tier versus another application. So when Fendo looks at different tasks, it goes through the tier zero, then it goes to tier one and then does DRF there. If there were other tiers, it would have gone down each tier. So this also, we also wanted to make sure that the interface to the users is simple. So users have to do two things. One is when submitting workload, they specify an application name. And then separately, one time specify how much total resources this application would need. And that's in the form of, here's the container size I anticipate. It's a four CPU, eight gigabyte container. And I think I'm going to run about 120 of them or thousand of them, whatever the number is. So once you give that to the application, then the system is able to look at that and say, okay, I'm going to guarantee this capacity. Based on that, I'm going to choose instance type. The system does this. I'm going to choose instance types that have at least four CPUs. So I can't guarantee that your container will ever fit, even if I have in aggregate more CPUs than 120 containers, right? We also want to make it easy for people who want to experiment or running really small services. So there's no overhead in creating this SLAs. So there's a default catch all bucket. We give it some buffer to say experimentation should run fine there. And once experimentation is done, they want to make it into full production with guaranteed capacity, then they set up these SLAs. So we don't need to go too much into detail, but what we're trying to show here is that if we had, say, dozens of applications, each of whom have given those SLAs, how do we figure out how to size our clusters? Especially since we are going to be auto scaling the cluster size itself, how do we figure out how many instances do I need at the moment? So we do that separately for each tier. For the critical tier, since we need to guarantee near instant launch time, we sometimes end up having some idle capacity there because if the service spins up more instances, then we can launch them right away. However, we still ought to scale based on some logic in figuring out the buffer that we can have and still be able to guarantee near instantaneous launch time. Whereas for the flex tier, we actually keep the current size of the cluster exactly where the demand is. So this is what we mean by if you launch more tasks in the flex tier, we will launch them. We're going to guarantee your capacity, but you may have to wait for us to spin up a new VM underneath in the EC2 cloud. So that's where the difference is. So reasoning about failures. This slide has small font text. It's okay if you can't read all of it. The point to show here is that in our server, we have an endpoint on the control plane that gives you failures for a task that is staying in the pending state and not getting launched. This is just a raw JSON output from it. What it's basically saying is that here's the constraints, the hard constraints that are failing for this task. In this example, it says, well, you're not asking for GPUs, so it's not going to run on these machines that are GPU-based machines. And then the first one only runs on agent clusters and gives you two cluster types is basically saying that, like I was referring before, certain kind of tasks are earmarked to run on only certain machines. That's how it's set up for us right now. And that's why it's failing that constraint. And beyond those constraints, there are machines that satisfy the hard constraints but are still failing because they don't have enough resources to run this task right now at the moment. So it tells you for each of those memory requirement is failing on these machines, and it tells you by how much. And for network bandwidth, same thing. And for CPU. So it's been useful for the cluster operator to figure out why is the task not running at the moment. So while I'm coming towards the end part of my presentation, so what are we thinking about next? There's several things, but I wanted to highlight three here. One is task evictions. Some people like to call preemptions, evictions. They're similar, but some differences. But regardless, right now, we seem to like calling them evictions. So by task evictions, we mean here's some idle resources that need to be up and running to guarantee capacity. But maybe I can run some other lower priority workload right now and then evict them later. That's one example. Similarly, when we have weighted DRF across a tier, if one application is not running as much, we can let the other applications run more and then evict them to balance out the weighted DRF algorithm. So those are two examples of where evictions could be useful. Evictions could also be useful because of noisy neighbors. You thought two applications will do okay on an instance, but you notice there is an impact to one of them, and then you can evict and maybe migrate some of their workloads somewhere else. So lots of interesting discussions there. Some of you probably have thought about this, maybe have solutions. We would love to talk to you maybe in the Q&A after this or after this session any time. Okay, so I covered noisy neighbors. So besides evictions, I think one of the challenges we have with noisy neighbors is we're not sure yet what to monitor. How do we measure the effects of noisy neighbors? We have some ways of looking at them and we've come up with, okay, let's keep these applications apart, but we want to be able to dynamically do that. How do we guarantee that is a service workload will meet its latency SLAs? And similarly throughput SLAs for batch workload. So I think there's a lot of work that could be done that would be very useful there. Automated rollout of new agent code is another pain point for us at the moment. When you've got lots of agents already running and you want to introduce new agents with new code, then I wish there was an automated way of doing rolling upgrades, which can take into account the specific measurements we would have to say that the new code is working or not, right? So some kind of automation there. We're trying to build that. So there's probably more. I wanted to highlight these three and then I would love to hear from others what else that you guys working on or if you worked on thought about some of these, what you have to offer at this moment. So let's actually open it up for questions that's all ahead. And slides. This one work? Yeah. Excuse me. For FENSO, are you thinking about putting in support for the maintenance primitives and how to prevent FENSO from scheduling on something that has a maintenance window or things like that? Yeah, that would make perfect sense. We are not using maintenance primitives in Netflix right now because of how we deploy agents. We have more of a red black kind of deploy of agents. So that's one reason we do not. However, what we are looking at is we have the case of bad agents or agents going bad. Some of them are dead on arrival, but more interestingly, some of them go bad after some time. Maintenance primitives may be useful there. I don't think we have a specific use case for maintenance primitives, so but it would make sense if somebody wants to work on it for FENSO. We would love to collaborate and see a PR. Thanks. If you know for sure what a workload requires in terms of memory, disk, network, stuff like that, why would you have a noisy neighbor if you've properly specified what a container is going to need? It's probably the inability for us to specify everything an application needs. I think it's easy to think of it in terms of what we know, CPU, memory, amount of memory, and amount of network bandwidth. But there's caches on the CPUs, there's the whole memory bandwidth issues, pneumoness of the architecture on the CPUs. There's so many other things that are happening that impact application performance. So if you were to run that application in isolation on that machine and you would still use the same number of CPUs versus with other applications mixed, there are performance implications. There's a question here. I can speak to the noisy neighbor problem a little bit. Intel's done some research on that. One of the things it's finding out in noisy neighbor scenarios is oftentimes replicates of the same tasks that are given exactly the same resource requirements, exactly the same things. They can be each other's noisy neighbors. So once you actually know what something does require, you instantiate hundreds of those and they can actually interfere with one another, which is an interesting point. But the question I wanted to ask you has to do with whether you've considered applying predictive analytics to determining noisy neighbor scenarios in the future, or being able to squelch those before they happen. Yeah, I think that's a good question. Have you considered that? I think we've dreamt about it. We haven't solved it. We touched upon an aspect of that in the user panel this morning, so that would be wonderful to have. Yes. I was curious about the availability zones being a soft constraint and whether that can allow services to end up being deployed in a way that are not highly available if that constraint happens to be satisfied and scheduled on that note, even though it's not in another availability zone. Yeah, that's an excellent catch. It's a soft constraint, not a hard constraint, and it's deliberate for one specific reason, which is if you were to have a zone outage, we use three zones right now, and if we were to have a zone outage and you launch a service, then one third of your service will not be up and running if it's a hard constraint. That's the reason we make it a soft constraint. But the downside is that if the zone is down, then the service is only running across two zones. Right now, we haven't solved it completely, but what's coming later is dynamically reshuffling them to make sure that they are well balanced, but we haven't done that work yet. FENZO right now has a dependency on mesos .21. What are the challenges to bring it up to 1.4? I think it's the latest mesos. So we run, while there's two different frameworks, we run it in Netflix. One of them is running at 1.1, or if not 1.2, and the other is running at 1.0. Something. Then when we do the dependency management, of course, that wins over what FENZO provides. FENZO has, we've not put the later mesos functionality into FENZO, so it's able to stay a little bit older. If we were to move it to 1.4, I think FENZO itself will work fine. It does not use any specific mesos features right now. We haven't moved it because it was not needed. There's no other reason. But I think, I don't know if you were also implying that there are newer features that we need to add into FENZO, but I think we haven't done that yet. We would love to get contributions if somebody's focused on it. Yeah, mostly just knowing the challenges of what it would take to bring the current FENZO code base into a compatibility with the latest mesos lib library. Yeah, it works. Because of thanks to mesos being pretty compatible across versions, and we're not using some of the new features. So it still works. But yeah, that should not be a problem. Thank you. I think that's about all we have time for. Thank you very much. All right. Thank you.