 So, finally it was done and we had a really good time at Ruzcon for the rest of the community. I'm really glad that he's back here in Paikliu and I hope there's more use for you to come back to Paikliu. And just by your brief introduction, Sanu has worked at, where he's been doing Netflix, has she called for, has he, and also contributed or contributed to contribute. And I would sort of guess that he was very passionate about writing about his work and so, as for the people who come in might find papers and there's a book in the Paikliu in which maybe we'll talk about it some point. And I think what needs to be done today is to talk about this paper that's going to get published and I think over to you Vikram, thanks for doing this. Thank you. All right, let's do this. So this is, I've never done this talk before and I hope like it all works out. So usually I do speaker notes and prepare myself and think what I'm going to speak ahead of time but not this time, so we'll see how this goes. So going to be talking about cluster schedulers today and especially like what are the patterns for building cluster schedulers, right? What are the patterns that are important for cluster scheduler developers and for operators when they are evaluating a cluster scheduler, right? The last one or two years, actually more than that, like there has been a lot of movement in the cluster scheduling work. I remember like three years back when I was doing cluster schedulers at Netflix or maybe four years now, when I was doing cluster schedulers for Netflix, Kubernetes just came out, right? And there was, Mesos was like the big dog in the industry. Twitter was using Mesos for a long time and even before Twitter, like those engineers like who wrote Mesos, like they were at Google doing Borg and there has been a lot of movement in the last few years on cluster schedulers in general and like today the conversation is around, okay, like which cluster scheduler should I be using? It's not so much like, okay, should I use the cluster scheduler or not? I think we are past that phase, right? Really the conversation is around, should I use Nomad or should I use Kubernetes or should I use Mesos or should I even use Docker Swarm? And a lot of those conversations that I see happen are around, oh, I use this and this has worked for me and so on and this is what someone else uses and we are in production on this many nodes but the conversation that I feel like we should be having is how does cluster schedulers make your life easy as an operations person or as an operator or like as a service owner, if you're running a service like how does cluster schedulers make your life easy? How does it, what does it bring onto the table, right? So, I want to talk about the first principles of like the problems that you're going to face or problem that anyone is going to face when writing a new cluster scheduler or when running a new cluster scheduler in an organization and through that, we're going to be talking about the patterns that cluster schedulers should incorporate to make you sleep through the night. So the software delivery pipeline, what? Highly available cluster schedulers. Okay, available to applies to that. Yeah, yeah, yeah, and also like by that, it also applies to services. What it means is that if you're running something on a cluster scheduler and if the cluster scheduler is itself not available, there will be like other problems in your infrastructure where like the, where your service might go down because your cluster scheduler is not doing what it's supposed to be doing. And I'll touch on that later, like why that matters. So the typical software delivery pipeline in the last decade look like this, right? You take code from SCM and then you push it to through your build infrastructure and then like you create some artifacts and so on and then you have used some kind of deployment automation tool like Chef or M Collective or Puppet and then deploy them on your infrastructure. And the deployment automation orchestration like has been like the heart of like the DevOps movement. Whenever you talk to like anyone like who does DevOps or anyone or any organization who is adopting DevOps practices, like you'd hear that, oh, we are using code to deploy our software. We are not manually deploying anything. So the conversations was always about a configuration management, right? What runs on a server? And like how do you deploy software, right? Like that was the biggest problem right in the last decade because people were trying to find it hard to release software. And like most of the tools was around, okay, how do I make a release? How do I make sure like the release that I'm making is consistently getting deployed and putting into production. But deployment in my deployment orchestration and deployment in general, I feel like are only like 10% of the life of a software. You have deployed fine, but then what happens after that? Those, the answers to those problems were not really answered by tools like Chef or Puppet or M Collective and so on, right? Like service management was left to, for example, to supervisors. Like if a service crashed, if a process crashed, you would have something like any D or like system D or something like that or pick your favorite supervisor to like restart your service and so on. Or like it didn't even have runtime hooks with the application, right? If your application is performing poorly, there was no deployment automation tool like which was going to like scale you up. So like obviously like on public cloud, there were like other things like auto scaling groups and so on which had hooks with the runtime via metrics. So the application then like emitted metrics and then like other things came into the picture and then did stuff, right? And also the most important thing was that it had a static view of your cluster and your data center, right? Someone somewhere said that you had this X number of machines and now you would always deploy onto those X number of machines. That means like it always had the idea of like a subset of machines which would never change. Now performance or like your traffic characteristics changes over time, right? And then like it would basically be up to like a human being to understand like how things are changing, how the application is changing and then translate that to like your cluster topology, right? And also when things failed, right? Like if you had an EBS failure or you had like a machine go down and so on or like a disk failure, right? Like then it was, then like what happens, right? Your application crashes or whatever and then your supervisor tries to bring it up again and then it again crashes. Ultimately, those kind of resolutions, those kind of remediation actions was left to people, right, human beings. And what happens with that is that doesn't scale. The reason why it doesn't scale is that if you have one service, it scales well. You can have five people doing on-call rotation and fix those problems and always have a pager attached to them, right? Now as your organization grows and as you're doing things like microservices and so on, right, like the number of services is going to grow like over time. And the root problem of like having people doing pagers is that you can hire at the rate at which your business grows because the amount of people, the amount of talent you have in the market is going to be not sufficient to like what you would need to continue that practice of like people solving the people doing the remediation actions or people running the remediation actions. And that's where the dynamic cluster scheduling comes into the picture, right? With the dynamic cluster scheduler, what happens is you sort of want software to do the work that people were doing, right? You want your cluster scheduler to be in charge of all the hardware you have. You want your cluster scheduler to understand the services that you have in your, that you need to run in your data center. You want your cluster scheduler to take care of like simpler problems like, okay, there is a node failing. I need to move the service somewhere else. And those kind of things which can be automated, right? You want like someone, like a supervisor, like a data center supervisor to do. And that's where cluster schedulers come into the picture, right? And as I was saying, service management, like service management needs to understand like the topology of the cluster in a more dynamic manner, right? That's the crux of the problem, right? Like instead of like statically giving feedback to people, you want like software to understand the state of the cluster and do those remediation actions. So what are the things that cluster scheduler should provide, right? Cluster scheduler should provide the self-healing capabilities, right? Like if you want five shards of your application to be running, it needs to make sure that five shards are running. It needs to provide APIs to operators. Like for example, if a machine is failing or like you don't want to run machine services in US East 1A or like a particular zone, because you know there are some problems in that zone, you need the cluster scheduler to give you API so that you can get out of that zone. You can not schedule anymore services on that zone. This is like very important to remember and perhaps I should have had a slide about this, is people think about cluster schedulers providing an API to service owners. The whole work around or the PR around Docker and Kubernetes has been around making developers faster. That's half of the story. The other half of the story I believe is like making operators operate the machines that you have, right? Say for example, you had Heartbleed or like the thing that is going on right now, the KPTI problem that is happening in Linux right now. Say you have like unpatched servers and patch servers. Your service owners doesn't care. They want like X number of shards of their application running, right? But it's up to like the reliability people. It's up to the security people to make sure no services are running on unpatched machines, right? So then your cluster scheduler needs to provide APIs to your operators, to your reliability engineers, right? To drain nodes or to drain clusters and to make sure services are only running on patched machines and so on. The other thing is it needs to, go ahead. The other thing is quality of service guarantees. I'm going to talk about multi-tenants in a little bit, but if you're in a multi-tenant world, right? Cluster scheduler needs to provide the quality of service guarantees where like and service doesn't, should use like the amount of resources that it is supposed to be using. And this is kind of interesting, right? Like all this like, this whole movement around all this stuff started with Docker. We actually like invented things like from the bottom up rather than from going from the top, right? So if for example, if you're using like Kubernetes or like Nomad or anything like that, you would probably be using like some tool like Docker or LXC or something to provide quality of service guarantees. In that case, like the cluster scheduler has to understand when an application is violating the quality of service that has been, that it should, or the resources that it has, it is supposed to be consuming when it consumes more or when it exceeds the quota. So cluster schedulers needs to be aware of the underlying resources and so on. And the last part is that cluster schedulers need to provide APIs to like higher level services like DCOS or Cloud Foundry, right? Cluster schedulers are fairly low level. Your application developers probably need like a higher level API so that, which can understand the concept of a service and the concept of things like racks and so on. Yeah, go ahead. Yeah. Sure. So, for example, there should not be a lot of speakingness of the application on that particular partner. If it has too much data, if it has too much of post dependencies in any form or shape, right? And the second thing is that the application needs to be designed to work with... Sure, just saying, if you have a question, let's talk about that. But if you have a comment, let's do this after the talk. So I think that's an orthogonal question altogether, right? Like if you want to run a database, right? Like if you want to run like whatever, right? Like going to your example, right? You're saying that because, you know, like people did not understand state very well and they had state everywhere and now people are doing like stateless and stateful services, you can run cluster schedulers. I think, I don't think like I would agree with that. I think if you want to run stateful services on a scheduler, you can run stateful services on a scheduler. Provided you have a scheduler which understands state. Provided you have schedulers which understand the concept of a persistent volume. Provided the scheduler understands that the lifecycle of a process and the lifecycle of the persistent volume are decoupled with each other. The fact that we struggle to run stateful services on cluster schedulers today is because the cluster schedulers were not designed for them, right? And as you pointed out, cluster schedulers have existed for the last 40 years, but they have existed in the HPC world, right? In the high performance computing world, there was sun grid engine and so on. I think what catapulted the use of cluster scheduler is that the kind of things that Yahoo or Google was doing in the last two decades, more people are trying to do that. And to do that with like, as I was saying, if you want to not linearly scale as your business grows and as your infrastructure grows with the number of people you have, you have to use something like cluster schedulers. I think the architecture of the services that we run has for sure, to your point, has pushed us towards using cluster schedulers, but I would not 100% agree that it's only because people are doing state and stateful services differently today. Like it's more cluster schedulers are more, what do you call, more approachable, right? If that makes sense. But yeah, let's talk about it more afterwards. So to provide us all these things that we need from cluster schedulers or dynamic service management systems, we have cluster schedulers today, right? Mesos, Kubernetes, Nomad, these are all good. If I didn't include like the logo of your favorite cluster scheduler, that's only because the slide, I had limitations of how big the slide can be, but insert your favorite scheduler here. But schedulers are not the silver bullet, right? That we want them to be, right? Like schedulers fail all the time. I have been on call when I was at Netflix for one and a half year running the scheduler that I wrote. I used to get page like every other day, right? Because something or the other would fail, right? And it's not like, you know, we would fail for the same reason tomorrow. We would fix the reason why we fail today and there would be like some other problem crop up in our infrastructure, which would cause failures. So what remains constant though is like we had to plan for failures. We had to assume that our scheduler is going to fail, right? Say for example, to give you an example, like to the point that you raised earlier was when you talk about highly available scheduler, are you talking about why do you, like are you actually talking about the availability of the schedulers or the availability of the service? I would say that they're related. Say for example, at 7 p.m., you have a video on demand application service and at 7 p.m. is like your peak time, right? People come from work and they want to watch Black Mirror or whatever and you can't have your service go down, right? Now let me throw an example of like a hypothetical failure that and actually like we actually saw it happen. Say for example, your cluster scheduler depends on a data store like ZooKeeper, right? And for some reason ZooKeeper isn't working anymore and now your cluster scheduler has lost the state of like your cluster. And at 7 p.m. when you have peak traffic, you want to auto-scale, you want to have more API servers because you are seeing like more API requests for starting new videos. And what happens when, and also let's say that your API servers are using something like Tomcat which is a threaded server. So now since you can auto-scale because your cluster scheduler is not using, sorry, not working, all of a sudden you will see like more number of users trying to hit the same fixed number of Tomcat API servers that you had. And now most of the threads are going to be busy doing existing requests. And so you will see like high CPU usage and once you see high CPU usage, all your API servers are going to be unresponsive and now you have a failure. So what started as a failure in your cluster scheduler now cascaded into your API. And once that failure has cascaded into API like your service is down pretty much, right? So I think like availability of cluster scheduler or availability of any service that you depend that another service depends on like is the key. Like understanding the relationship is the key. So I'm going to talk about failures in cluster schedulers and their remediation strategies at like three level. The first is like the node levels like where we were going to talk about the failures that happen at the machine level at a node level. And second, failures that happen at the cluster level. And third, failures that happen at the control plane at the schedulers itself. So node level failures, right? Capacity planning. So today like when you're running today when you're running a cluster scheduler like Kubernetes or like Mesos or whatever you are doing, you are not only just running your applications in a container but you are probably running like some kind of logging agent or you're running some kind of sidecar and so on. Most people, they obviously like they put a quota and limitation of resource usages on their containers but they often don't realize that there are some system services as well, right? The system services has to be treated the same way the applications are being treated. They have to be put under C groups. They have to be put under namespaces and so on so that they don't exceed their resource usage. Because once they exceed their resource usage since the scheduler doesn't control them or the scheduler is not imposing like resource constraints on them, they will impact the quality of service of the services running the actual services running, right? So it's very important to measure the quality the overhead of something like a sidecar or like a logging agent or even like your Docker demon, right? If you're using file systems like ZFS understand like how the file system cache works, right? The reason why this is important is today like in C groups V1, the mechanism for controlling IOPS of system read and write is very much broken. So two different containers can run on the same machine and there are very fairly bad ways of like how you can restrict the IOPS. So what it means is that if you for example have a two GB limited the file system cache to two GB and one container is doing a lot of IO it's going to exhaust all your ZFS cache, dark cache. So understand like how your file system caching works and so on. I had an outage back in the day where I didn't limit the ZFS arc size and one fine day I saw like applications are not able to malloc anymore and when I did, when I saw what's happening I saw like ZFS was using like half the memory of all the machine. Even though like it was supposed to not use more than like say one or two gigs that's because I didn't put a constraint there, right? The last is like put garbage collection properties or logic of things like log rotation things like that which understand the what are the units of the underlying resources? You doing log rotation after seven days doesn't mean anything if like the disk is a fixed size. So do log rotation and things like that of say like of the same unit as the underlying resources because seven days of log doesn't say much, right? Your unit at the disk layer is like you're measuring in terms of bytes, number of bytes and not days, right? Oops. Oom killer. The Oom killer in Linux is fairly complex and it's very few people understand how the Oom killer works. Over the time of a process like the kernel keeps score like which is called the Oom score and whenever like the kernel is under stress like it starts randomly killing processes, right? So if for example you have a sidecar which is using all the memory and because of that there is like an Oom situation the kernel might kill like a process which did not cause that. So what I have seen and what we have done in most of our cluster schedulers is we make sure that we can kill process in the user space because when we kill process in the user space using Oom notifications and so on we have a control and we have a deterministic way of knowing like which process we have to kill rather than the kernel deciding like we can the cluster scheduler agent can decide that. This is, this is, this I think like is a key like if you're doing a lot of packing or like if you have a lot of density on a single machine and I think the last thing that I wrote here is that you know like of course put everything under a C group right, a memory C group. This is not my favorite topic and I think this is a topic that people can't do much is that software like Docker are traditionally bad are like traditionally unreliable, right? If you go to like Docker's website because of the pace at which like the project moves like they add a lot of regressions as well. Like I think like six or seven months back you know I had an outage where Docker was leaking the AUFS layers and so on and you know like who would plan for that, right? Like you don't plan for things like that unless like that outage happens. And so we figured out that like like we have to plan for like a failures of like the Docker daemon itself. So it's very important to understand like how the Docker daemon is doing by defining metrics like how much CPU it is using, how much disk it is using, how much disk it is using at the number of actual containers which is running and when something goes south the remediation actions could simply be the scheduler drains that node or like drains that pool of node with that version of Docker, right? And so on. And optimize for like and most importantly like optimize for like cluster level efficiency while at the node like we talk about the number of CPU cores which are available, the number of RAM which is available but people amount of memory which is available people often forget that especially in public cloud that a lot of resources are bound by the network IO because like things like block storage and so on like are actually like on the network, right? Or like if you have if maybe like your network card can you fingerprint the network card and you see that it's like a hundred gig network, right? Like a network card, but the actual link layer might only be like 10 GB or like even less like 5 GB, right? So like expose all the all the properties of the network up to the scheduler as schedulable resource, right? So even if you have like CPU cores which are not being used don't oversubscribe the machine, right? Don't oversubscribe this machine in such a way that the network is saturated. So that like overall your application performs well at the cluster level. As I was saying schedulers provide QoS which is nice when you hear about it or in paper when people make promises that okay I'm going to let you do like, you know have 50 megahertz of like CPU cores, but that's not true. What happens is multi-tenancy in Linux is like horribly broken today. Have you anyone programmed on using SIMD instructions like or AVX 512 instructions on Intel? So these are like vector instructions, right? So what happens with vector instructions is like they need a lot of power, right? And say you are running like non vector instructions which need less power. The Intel what Intel would do is it will reduce the base frequency so that the SIMD instructions or the vector instructions don't consume as much power. As like, say for example, like if your peak CPUs performance is like, say three gigahertz like if you run vector instructions like you'll see like a dropdown to like two gigahertz or something, right? So say for example, you're running like an open SSL or like a heavy crypto library, right? Open SSL or Chacha or something like that which uses vector instructions and then you're running like non vector instructions like normal stuff, right? Like, I don't know, like whatever like something which takes JSON and writes to the database and so on which is like most of the workload today, right? So now you'll see all of a sudden like your normal workload is suffering because like something else is using open SSL which is using vector instructions. So on Linux, unfortunately multi-tenancy is not a solved problem, right? So what you do in that situation is you isolate workloads to like a clusters, right? You tell your scheduler, tag your nodes or tag your clusters to say that run applications which are more IO sensitive in this cluster, run applications which are more which are doing vector instructions on other other clusters and so on. That's like one way of solving it, right? Let's go to cluster level failures. Most often cluster level failures happen because of bad software, period, right? The software doesn't only mean like application software it also means configuration. Go to like the outages of like Amazon, Google or any cloud provider. The last few outages you will see or most of the outages you'll see is because of bad configuration, right? It's because like someone has pushed a bad configuration and now the whole thing is getting impacted. So never ever release software to the entire population or to all the nodes, always do rolling upgrades, see like how the metrics are and make sure like there is a feedback control loop and like the scheduler or like the deployment system understand the metrics that the scheduler is providing. If for example, after doing a new push the scheduler sees a lot of crash, schedulers should like stop deploying, right? And so on. We did something similar in Nomad recently where Nomad uses the state from console and like figures out like whether a cluster is healthy or not. System software failure, as I was saying like deeply scrutinize like the system software that are running on a machine and if for example it doesn't work well anything doesn't work well like just roll back to like an older version. So the most effective way I have done this in the past is by having like really long staging period, right? Like try to run like a very stable version. If you have to like update the version make sure like it's actually running in production on a very smaller percentage of the node for quite some time, see the metrics, compare the metrics and then like make sure like it gets deployed. For security vulnerabilities, you must be thinking like how can I run an older version of the software? I mean most people like who are responsible, responsible software engineers they have a security vulnerability in their code like they would do like some backward, they would back port those patches into like an older version as well. So the staging times of systems of software should be like much larger than say something like an application. Depletion of global resources basically means if you are relying on something like AWS to bring machines and so on, if you have a cluster level failure and like say for example your application is performing less or like performing badly or like say like you are having certain problems on certain Amazon machines and so on. The remediation step could be the cluster scheduler or the orchestration system brings back like more nodes in other zones or in other places. But remember that there are global limits like for example the number of API calls you can make. Or the number of nodes you can even bring up like in a given timeframe. By bringing, by being very aggressive about remediation you could deplete those global resources and even dig yourself in a bigger hole. Control flame plane failures, I think this is the hardest part of like the entire story. That's because it needs a lot of theoretical knowledge, a lot of knowledge about distributed systems. What I mean by control state cluster, control plane failures in general is what happens when there is a failure in the scheduler itself, right? And to understand like failures or to debug or to plan for failures in schedulers, in the control plane of the schedulers, that you have to understand like what is the underlying data source it is using, right? Or what is the scheduling mechanism for example? Like let's take some example, right? Say for example, your cluster scheduler uses zookeeper, right? In the case of Mesos or in the case of Kubernetes it uses HCD and same in Nomad like we use something called raft, right? Which under the hood HCD is also using. All these things are highly consistent, right? Like strongly consistent rather, right? Strongly consistent systems because the reason why they are strongly consistent is because the cluster scheduler wants a view of the cluster topology so that like it doesn't oversubscribe without knowing. It can oversubscribe if it decides that it can oversubscribe. But it doesn't want to oversubscribe just because it is unaware of the cluster state, right? Again like to our earlier conversation about persistent volumes, right? For example, if you're doing persistent volumes and if it's a schedulable resource if you don't use a strongly consistent system you might be asking or you might allow two different process to use the same consistent volumes, right? So if you use something like an AP database, right? Like things don't work well, right? So most cluster schedulers uses a CP data store but in general we want our cluster schedulers to be highly available. So that's the tension, right? From an operator perspective the operator never wants the cluster scheduler to go down. Even if some parts of the cluster scheduler goes down the operator would want as much it can have from the cluster scheduler. So there is like a tension there, right? So what it means is that the schedulers we need to build them in such a way that they can reconcile from data loss, right? What it means, right? For example, you lose Zookeeper or you lose, I don't know, like HCD or whatever, right? Like in your favorite scheduler. Can your scheduler build a state from the running agents that are there in the cluster? In Mesos, it's possible in part. Mesos has something called a task registry where like the framework can send a message and like the scheduler and the agents, the Mesos agents responds back with like the tasks which are running. But then there is no upper limit of like when the cluster is going to reconcile, right? So in Nomad we took a different approach, right? In Nomad we said that if you lose all the scheduler nodes, like there is no way of you can like get back to like a healthy state without rerunning all the job files again, right? So resubmit all your jobs, your cluster state will reconcile. Maybe in the future we'll do like things like backups and all that. But this is a fairly complex topic, right? Reconciling state from failed data from running nodes is like a fairly complex topic. And this is something like I think operators needs to plan for and needs to prepare for basically. The other thing is like the scheduler mechanism itself. Say for example, you have a time limit that I need to be scaling up my services within like say 20 minutes or like 10 minutes at peak. When like traffic hits the roof, I want to get like things as fast as possible. You also have to understand like how is the scheduler working? Is it even driven or is it like level triggered? Level triggered scheduling basically means the scheduler is running in a loop and then it's looking for the tasks which are not running, right? It's basically comparing the goal state and the current state and then it's basically reconciling the cluster state. What it means is that there is some basic like minimum time limit for the scheduler to take an action, right? So you might, if you want to dispatch a job right now, you might have to wait till like your scheduler runs again and like it might have things like head of line blocking, right? He might have thrown like a job, bad job, which has 10,000 units of work, whereas my stuff is more like important. It's like five nodes, five shards of an application. But if the scheduler is working on all the things together, it might have head of line blocking where the scheduler might be spending CPU cycles in scheduling the batch workload. I'm not really working on like the, what is it? The service workload. So in that case, like event-driven schedulers perform much better because in event-driven schedulers, the scheduler is invoked. The moment like a new event happens that event could be like a node going down or someone submitting a job and so on. We did event-driven scheduling in Nomad. If you want to learn more about event-driven scheduling, the Omega paper is good. There is another paper called the pharma man paper which is written by Malte who did the Omega too. He writes like some good stuff about how do you do event-driven scheduling? Mesos is also sort of event-driven scheduling. I am saying sort of because what happens is most scheduler, most schedulers I know or some schedulers I should say, they wait till a bunch of events shows up and then they decide like how, then they decide like scheduling, how they're going to schedule or take actions based on those events. Second thing is like implement quotas, right? Bad days are going to happen. Amazon is going to have outage. Your data center is going to have outage. There might be power failures and so on. You are going to have resource crunch for sure, right? Like you are going to have, at some stage, you are going to have resource crunches. When you have resource crunches, when you have resource crunches and when people are contending for resources, quotas are what saves the day. With quotas you can make sure like the business critical services are running and the services like map reduce and so on which can happen much later are like backlogged, right? And in some cases quotas are also useful for the quality of services guarantee on the local agent because based on that, the agents which is actually running the containers can determine like what to prefer when there are problems on the local node itself. So in the end, I want you to take home this message where plan to reboot your data center, plan whichever, no matter which scheduler you are using, which deployment automation you are using, prepare for failure of that software itself and be ready to know like how the organization behaves when the cluster scheduler fails. If your cluster scheduler fails when you don't have to scale out or when there is a period when you don't need to deploy anything or your hardware is not failing, then no one knows what has happened, right? Behind the scenes. But if your cluster scheduler has failed at the same time, when your hardware has failed or like during your peak time, that's when like you need to understand how the organization is behaving, who is calling whom, like what kind of commotion it is happening at like a very large scale, like we see very interesting thing happening when like schedulers fail. Like a lot of people ask the same questions. So instead of like working on solving the problem that has happened on the cluster scheduler, like we are often like talking to our service owners and telling them, informing them what is happening. So like make sure the incident management workflow, the incident management system is capable of like handling such large scale catastrophic failures. So that's pretty much it. And these are some of the papers that I think are good and goes into this topic that we discussed this evening. Yeah, so capacity planning is not something that you do like when something bad has happened, right? Capacity planning has to be done from day one, right? So I think being very conservative is the key, right? Be as conservative as you want to be when it comes to capacity planning. Everything that runs needs to be put under like a micro, like a lens, right? Like don't even consider like any process which is running on the machine to be like, to be like, you know, to be whitelisted because failure like things can go south like from anywhere. So I think like understand like how things are running like at the node level from the very beginning, keep things monitored and then like put reasonable constraints, right? Let's start like start with higher constraints, right? Be conservative and as you understand the software more as you have more experience then like tune them further down. So scheduler, I mean, it depends like, I mean it's a, it depends answer, right? Like are we talking about the scheduler? It depends on the scheduler control plane and the scheduler agent. I've seen people do that as well. I would never run schedulers on, in a container. I would always treat schedulers to be like a system software that is has the right amount of privileges and so on. Something needs to like at the end of the day like the turtle needs to like end somewhere, right? Like scheduler is like the scheduler agent is at that level, right? I would not run that as in a container. What happens when that fails? What happens when Docker has a bug, right? Now because Docker has the most reliable thing at the end of the day is your kernel, right? Is your distro, right? Everything else like everything else is like stacked on top of it, right? If your scheduler also has the same dependency as your application, right? When your application fails and there is a bug in like your container software, like how is your like your scheduler is going to fail at the same time. The when incidents happen, what the best thing that I have seen happen when incident happen is have as little dependencies between incidents, right? Like if something fails and also something else fails which you used to deploy the thing that has failed, like thing, like it doesn't really work well, right? Like in my opinion, and yeah, I mean. I don't want to do what people are doing about like code or like your VST, right? Everything here, something goes but it's always our expectation Yeah, but sure, that works like if you have 100 servers. Now you have like say 50,000 servers or 10,000 servers, like yeah, at scale, like I mean, sure, like if you have 10 servers, yeah, do that, yeah. That is the second value change, like how would you end up? So it depends, I mean, bin packing has been like the primary technique for deploying software but I think, not I think, I know for sure that bin packing is not the only answer, right? For example, like I am implementing right now in Nomad. I mean, it's already running in production. I need to upstream it like a new ranker which basically like spreads applications across horizontally rather than bin pack. For example, some of the problems that you will see is if you're running GPU software which is using GPU, if you do bin packing, what would happen is your data center, part of your data center, you're going to be creating a hotspot. So when you create hotspot in part of your data center, like then the thermal effects will be like really bad and some hardware is going to be like deteriorating faster than like other hardware. So what do you want in that case is like, you want a very uniform thermal outcome, right? Across the data center where things are getting, consuming the same amount of power throughout so that the cooling is more uniform, right? So then like in that case, like you can't do bin packing, right? But there are some cases in which bin packing is better. Like if you have like jobs which can finish, which doesn't have a SLA and they can finish, like whenever they need to finish, right? Like just do bin packing in that case. I think there is a case for both. Yeah. Yeah, so I have always done like node pools, right? I've always done node pools where like, you have certain nodes where we do bin packing in certain nodes where we do like spread, we apply the spread algorithm or anti-affinity where like things are getting spread evenly. Can we choose half? Yeah, absolutely. Yeah, you should. Like these are the things that, this is where the conversation should be. Like when you choose a scheduler or write a new scheduler is that those kind of capabilities which impact, right? The data center performance, not only just application performance has to be incorporated back up into the scheduler, right? Absolutely. I don't have a good answer for that. The, I don't have a silver bullet answer for that, right? Like in the past, as a scheduler developer, I would roll out, I would roll like a new node. So schedulers, like usually with schedulers, right? Like you do rolling deploys, right? Again, like it depends whether we are talking about the scheduler control plane or the scheduler agent. The scheduler agent is easier, right? On the scheduler control plane, like I have always deployed like a single node, observed the metrics, right? And in a lot of cases, I had simulators, right? I had basically like given the scheduler like a static view of the cluster and seen like how the scheduler is performing, right? In some cases I would run benchmarks and so on against the scheduler and see how the scheduler is performing. If I have a change in the ranking algorithm and so on. There are no benchmarks, there are traces though, right? There are publicly available traces. Like I think the Quincy paper like points out to a publicly available trace where like you can have like a trace of your cluster and run your, run your scheduler against it. I mean, of course, like you have to now make sure your scheduler can consume the trace. You know the game's properties, right? Yeah. So I need to meet those traces and kind of fire out. That's what I'm saying, like. Fire out some kind of port or whatever, I call it the trace. I mean, yeah, the scheduler needs to do that, right? But you can't do that in production, right? Like in production, the only way to like test whether something is working or not is metrics. No, but services are similar to the available, right? Like, I have to tell something a couple of days not yet, but then I'll be like, whatever it is available, you can go Yeah. So we had written one for Mesos, like for our scheduler on top of Mesos. I think someone started work on this for Nomad as well. We can easily write like a simulator for Nomad as well because the scheduler is just a library. So I can like easily, yeah, exactly. So I can write a simulator pretty easily. And I don't contribute to Kubernetes. So I don't know what Kubernetes is. Yeah. So those, I think you can do easily. Like if you have three schedulers, like three different schedulers, like every scheduler team has done a benchmark. So in 2016, I think we did a benchmark in Nomad. We can run a billion containers, not a billion, a million containers under five minutes. And on like, we did it on 15,000 machines. And I think like on Kubernetes, the max you could do like 100 machines or 1,000 machines. Yeah, the scalability of Kubernetes is pretty low compared to like other schedulers. That's because they didn't prioritize that. Whereas like in Nomad, like we, you know, like we, from the get go, like we made sure our performance was like at the peak. I know people run like 100,000 containers like on a daily basis using Nomad. Some finance companies, investment banking companies. Mesa's, I think they have similar benchmarks too. Yeah, absolutely. So Mesa's marathon, again, like a lot of people, like, sure, like, I think like he was saying that you get the benefits of deploying like the new scheduler, right? If you run it on top of a scheduler, I would not do that. I mean, so I think in that case, right, like if you want to have like a fair discussion, right, like about based on just merits and demerits, you can talk about what happens when this marathon goes down, right? You can talk about how does marathon reconcile state, right? They don't want to share this node with another who's not going to be spending money. Sure. Yeah. That's a shortcoming of marathon. Exactly. So what's the right way to deal with these kind of enterprise issues where somebody says these nodes belong to me. Previously, you would allocate just physical servers and say, you know, I give you access to your developers and do the whatever they want. Yeah. But now, you know, finance wants to come and say, no, you have to bring back, you keep coming and asking the server if it belongs. Yeah, absolutely. Yeah. Supposedly the solution. Yeah. But what approach would you suggest? I can tell you like what we did in the past, right? And we did exactly the same thing, right? We had quotas. We implemented quotas. So with quotas, we said that every namespace or every organization gets a fixed amount of quota, right? And within that quota, you have like also like your priorities, right? So say for example, you are like a BU and you have like your quota is like 50, you are paying for 50,000 nodes. So the scheduler understand that quota is present and it will allocate only like 50,000 cores, right? If the scheduler cannot or doesn't have the idea or notion of quota, that's when like the problems that you are saying happens. So if it was up to me, I would not use a scheduler which doesn't understand quotas. So is Nomad, from that perspective, a better choice for multi-usiness? We recently implemented namespaces and with namespaces, we implemented quotas. So I'll let you answer that question as well but to my knowledge, I've done a lot, I've done mesos for a long time. So mesos has the concept of a quota and it has the concept of a role, right? So in the mesos master or you can basically say that this role, the framework which has registered for this role gets this many resources. So one option for you could be, I don't know, I have not kept up with like how mesos has evolved over the years, is run different frameworks, right? Like you can configure the framework name for example and create a role for that framework name and say that this framework with this role gets only this much quota. So then mesos was built to be like multiple frameworks, right? Run each marathon thing as like its own, I don't know if it's possible, but run them as like a scheduler with roles defined. It just seems like, you know, something that is bolted onto it rather than something that should natively be doing for the whole case. What am I looking at? The architecture seems to be not meant for doing something like that. We just figured out this is one way of doing it, but not necessarily the best way. Having highly availability and having 100,000 containers and all is fine, but then if you enter this in somebody else's servers, it's no fun. Check out the concept of like roles in mesos. In mesos master, not in, and you should be able to, I think configure marathon to do that. Marathon has a lot of other problems, I don't know how to go there, but... Is it because you CFS? Yeah, yeah, yeah. I mean, yeah, I have no comments there. On the way you're saying you're in this plaster orchestration platform and you never implemented scaling, you just ran something wherever you wanted it. Right. Yeah. Any other questions? By the way, did I answer your previous question or do we need... That's a longer debate, but there are more dependencies. Yeah. Sure. Sure. Sure. I mean, if you're talking about like how much time it takes to like recover from failure, right? Like at the cluster scheduler level, the way we have done it in the past is always by using chaos, right? So like, you know, you use things like chaos monkey and other tools and like shoot the scheduler, right? And see like how long it takes to takes it for you to like even bring a scheduler up, right? Or even like understand like what is the queue length of like the number of jobs that has been submitted but hasn't been scheduled yet, right? These are metrics like has that has to be that has to be like observed very closely if you're a scheduler developer or a scheduler operator. In Nomad, like we basically even let people put decide like how many batch schedulers they want to run and how many service schedulers they want to run. Because like it's even driven like say for example, if you have two batch schedulers running and if you have someone doing a lot of bad jobs, obviously like your batch scheduler is going to be like be lagging behind. We even like expose things like timer. Like we just we tell you like if you have this many number of nodes like how many, how much time did like the scheduler spend to make one allocation, right? So obviously like this the scheduler software itself like needs to be monitored and seen like all those things like understand like how the scheduler is performing and then provision resources to the schedulers accordingly, right? Observability is again the key. I mean I I'm not I am pretty sure like there are like more methodological approaches for doing this, but I have always used metrics and used observability to figure out like the amount of resources that has to be provisioned. Yeah. So there are lots of stateful services like database. Yeah. Most of them it is in the case of COS and so it seems that the vendor, the scheduler vendor, they are the ones providing the framework for running HDFS or Kaka and so on. The open source software vendors themselves don't seem to be providing any of these whatever kind of help or the universe package or what is it that runs. So from that perspective which scheduler is popular amongst project owners themselves that they would like to support I just don't see anybody out there doing that. I mean this is a very political question. Yeah. Absolutely. Absolutely. Yeah. Absolutely. Yeah. Absolutely. I it's a very political. The answer to that is like it involves a lot of politics behind the scenes, a lot of handshakes between corporations and people, right? So there is no like logical reason why you like vendor vendors use this like. The stupidest thing I can say is for example it ships a Mongo without authentication. Sure. If you use the app. Yeah. And then Tengen probably does like yeah. And they say if you want something you click here and they give you a referral kind of a thing to say data stacks for Cassandra and so on. I can talk about it but in let's do talk about it offline. It's a it's a it's more than just like merits and limits. Which one do you think is most likely to be favored by open source projects? Because of the of the of the of various dynamics that are going on right now Kubernetes. Because of the the dynamics of everything that is happening around even though I don't contribute. So like I mean I'm very impartial to this. Yeah. Yeah. Absolutely. There is a paper I I didn't I ran out of time to add this search for Harkillis or something. Or so I can pronounce like Harkillis like it's a Google paper. So if you go to like Google's research publications and see like the the distributed systems column like you'll see like a paper where they use deep learning to figure out which workload to run along with which and they're saving a lot of power. They're improving on performance and so on. Absolutely. This is the future. No one in open source is doing it yet. But you know like it's it's obviously going to come like an open source sometime soon. The the reason I think why no one has done it yet is because the reason like for example like Kubernetes is so popular because people don't need that much scale. Right. If people needed that much scale like people are talking about this stuff that I just talked about for the last half an hour. People don't talk about all this stuff. People are like OK what is my friend using or what is someone X using not for bad reasons but because they don't need a lot of these things right like you would start needing this like if power is your constraint right if you're not constrained by power right if you're not constrained by the number of machines you can add because the government won't give you as much power. Why do you care so much about improving performance right and by improving power consumption. Right. So if you're constrained by power if you're constrained constrained by all other resources I think that's when like you do deep learning and so on and only like very few companies are like doing this stuff. Yeah. Yeah. Sure. Yeah. They're doing a six. Yeah absolutely. Like people are using deep learning like across the board now to reduce improved performance but not at like 10 node scale like there is no use at that scale. Yeah. Maybe not. I think you just I think this is the thing right like there is there is I mean I think this is the thing right like like all the side cars and all this stuff right like these are these are these are failures that are waiting to happen right you need them sure but like at what scale do you need them and why do you need them right like for me I think like if your notion is to keep the business running like you want to sleep well at night like keep your dependencies to the minimum right make sure like you know like how this recovers right instead of like 10 different running 10 different things which fail in 10 different ways right you're just increasing the permutations of the number of ways you are going to fail. And everything is mounted on the LFS. So they are like oh my LFS will never fail and it happened right. So I mean the point is for most people they seem to think that you know using a scheduler suddenly solves infrastructure problems. Unless the underlying infrastructure itself is ID available it doesn't matter what you do you still tend to treat the entire system as an ID available. I mean I would like make a small correction there right the underlying infrastructure is never going to be highly available right like the cluster scheduler has to understand that the topology or the failures that are happening. They don't know that. See the guys who are operating the cluster are very different from the side ops. They don't know that but they seem to think that oh everything is magically working at the back end. But see that knowledge right that knowledge that the data center people have needs to be exposed as a failure domain as labels. The worst ones when it comes to not necessarily our understanding of how infrastructure needs to be managed to run something like resources that we are not very much we are the rack and stand guys. So it's at least here in India I don't know how it's in the west where it's you know public occupation much smarter but here it's when shit hits the fan you cannot port and they are sharing the same switch or so on and so forth. Here a long way out. Yeah and I think this is where like educating people about this first principles are important. Cool thanks everyone. Yeah.