 Hello everybody. My name is Mike Sivak. I work for UNHED at the Oregon State Department. At the Oregon Project. But today I will not be specifically talking about the Oregon Project. I'll be slightly comparing in very low detail. I'll be comparing at least a little bit. So you will actually learn how to select the host and know where the VM to start and when the VM to start, the VM button. And I'll show you how to open state of the art in the program. So since I have pretty long presentation, I'm going to put it in the light. First, before we start talking about what the schedule is just then, we need to figure out what the schedule is that you're supposed to do. So what is the goal that you're trying to start with? I mean, the primary goal is easy. Basically, we'll start below where we want to start some load, whether it's computer or VM. And you need to find a place for it. But that's only one of the goals. What other things you might want to solve? So for example, you might want to find a place where the VM will keep going even if something happens. Or if you have containers on it, you need to find a place where the containers will keep running even if one of the VM does. So you might want to start two VMs with the same container and the same host. And the other thing you might want to make sure that your VM has the highest performance possible, if your mind keeps going, for example, by doing it in VM, I don't know where it is, but you might want to do that. And you might want to keep it out of the consumption mode, for example. So there are many different aspects of scheduling. Starting a VM is just one of them. And all the two schedulers have some specific features that help you. So first, before we go through all the scenarios in specific, let's think about what you need to think about when you try to write about the scheduler. Because all the three things basically need to solve the same set of issues or answer the same set of questions. So first, the size of your cluster. It's actually different if you're trying to schedule a VM and your cluster is 50 hosts or 200 hosts. And if your cluster is 10,000, the algorithms are not going to be the same. You might not care about what you call a process. In the 10,000-plus scenario, you will definitely care when you only have 10 of them. You need to figure out where you are going to use probabilistic algorithms or deterministic algorithms. Probabilistic algorithms are really nice. And now AI is all the rage everywhere. Everybody loves artificial intelligence and some organization stuff. But if you want to support your customers, that's not the best approach you take. I mean, if somebody sends you an issue description saying, I can try to start this VM and it just wouldn't go to the house I wanted to go to, then you need to somehow revive it. And if all the logic is hidden in a neural net, you can revive it. So we are all three schedulers. I also made a ring of algorithms, meaning we have no random function but almost no random function in any way, anywhere. So if this comes our way, we'll just simulate this scenario and we'll see exactly what it's going to do. And we'll see why it's behaving in such a way. The other thing you need to answer is are you going to support migrations or maybe load balancing? It also changes the algorithm slightly. And one of the big questions is are you going to support homogeneous clusters or heterogeneous clusters? They all look very simple when you only have homogeneous clusters because all have the same size or all have the same memory amount. But when you have heterogeneous clusters suddenly comparing two hosts where one has 10 gigabytes of RAM and one has 10 gigabytes of RAM and they're in a cluster together, it becomes much more complicated. Another big question you have to answer is how are you going to treat your VMs? You know, is your VM a pet? It can pair models to make evidence to make sure that it still runs to make everything possible to keep it running or is it just, you know, one cow from your 10,000 cows and if you just one cow even you don't really get it, you'll buy another one or you'll start another one. So that's another big question. Now, scheduler is basically a function. It's the black box in there. So this is how it fits into the environment. All these schedulers are talking about all of the same structure. They're using the same structure. In this case it's a VM. It can be a container. If somebody wants to start a VM, then you have some configuration that's another input to the function and you have your set of nodes and all three schedulers, well, all three projects have an agent on that node. The agent is actually the part that does all the work. You tell it to start a VM and it does all that's necessary to start a VM or a container. And it doesn't matter if it's called VDSM, Hubelad, or whatever it is. So you have your nodes. All the nodes through the agents are reporting the status and available resources and whatever is running there. And that's the third input to your function. So we have the black box function here and it really is a function. You have three inputs and it generates an output. It tells you where you should start the load. Now, the three schedulers will be talking about it. So are we talking about the scheduler that we use in our work? Are we talking about the scheduler that's used in our work stack? That's part of the node upgrade. And I'll be talking about the scheduler that's part of Hubelad's. There are some differences in the goals in there. You can see how they're stable. So it's not the forward and open stack. Both support the openness of process. That's the official number from the implementation. Hubelad is planned to support 5,000. To hit more, like 1-0, but it actually didn't affect the output in too much. It will still be pretty simple. Now, you see all three are written in completely different languages. Java, Python, Go. I would say that they probably had to use something very different. And you see all the load stack is different. Over as petty VMs open stack as petty VMs and Kubernetes as containers. And when looking at numbers, there are lots of differences between those two schedules. Now, let's talk about how we actually do stuff and we'll see that even though it looks like their number is the same, basically all 3Ds reach almost the same solutions. So, first, before we get to the actual algorithm, let's talk about resources. That's an important part of the scheduler. It needs to know how much resources it has on that. So, let's say we have Overt. Overt is data center visualization. So, as I said, we have the management, the scheduler, we have the agent. The agent is on the right side. And currently, we have a host with a single VM running that's occupying some resources that are in the process. And we have a request, that's the arrow that just appeared on the left side. We have a request that just came here. And it's asking please start with this VM. So, what Overt scheduler does, it basically computes everything against the resource allocations from the agent. And it computes internally where the VM is supposed to go. And it remembers if another VM comes in, please remember that there is a VM that's supposed to start. It's not running it, but I've already reserved a space for it. Because I have small boxes that appear in the management. Then remember that, because that's actually one of the differences in how the three schedulers work. So, then we remember that it's starting. So, we tell the agent, please start with VM. And then the agent starts comparing it. It takes some time. Starting with VM it's not general like this. It can take even seconds if you have to repair storage or mount FS or something. I mean, before the network responds it can take some time. So, during that time, another request comes in and you remember that there is a VM starting although it's not there. So, we have no trouble moving the resource. And then, when the VM actually starts it will remove the pending reports because we are already getting all the stats from the agent. The agent sees the second VM and it's telling us, hey, there is a second VM running to see the memory that you can see with you. So, we already know the actual consumption. You don't need to find any reports. So, that's how Kubernetes works. Now, let's look at Kubernetes. Kubernetes behind the scenes. Kubernetes behind the scenes. So, Kubernetes, how many of you are actually using Kubernetes? And how many of you actually studied the internal architecture of how Kubernetes is orchestrated internally? A couple of people. So, Kubernetes internally has a distributed database. That's the single point of information for everything. So, you have VPCD, but on top of it there's an API server. That has everything. It's the API endpoint. So, when you are trying to start a plot, that's publishing of containers. So, that's the workflow of Kubernetes. You basically send the report for document describing the plot to the API server. So, that's why I'm going to do it. You see, the arrow is actually connected to the API server now. And it created a document that's the light blue one with resource definitions I had. So, it created a document saying, I don't want this plot to be run. But the API server doesn't do anything. It's really just an information center. So, what happens is the scheduler notices hey, there's a plot and it doesn't know where to run and I have all the information. So, let's compute the right destination. So, here it's not pushing the scheduler is actually pulling the data. So, I mean that's a simplification, but scheduler is basically pulling the data. So, it sees a plot that doesn't know where to run. It takes the information, takes all the information of the three resources, computes the proper destination and writes it back to the API server again. It's not starving them. It just writes the proper information back to the documents to the plot. And then, the agent notices, huh, there's a plot and it's telling me that I should be the one who is running it. So, let's do it. And it starts the the plot. In this case we have no pending record because everything is part of the database. Fending means I have a plot and it doesn't know where to run but it already specifies all the resources it needs. So, in this case we don't actually track the intermediate state because we have it in the open stack does it again slightly different. So, in open stack before Newton, it was basically very similar to Overhead. They had scheduler to ask to do schedule a VM and send it to the right destination and send it to command. They didn't have pending. They had issues with double booking resources because it was atomic. Open stack supports multiple schedulers, as we were at this desk. So, when two schedulers were running at the same time, they were actually coming on each other's streets and permanently they didn't reject the plot or the VM because resources were already taken so they had to retry and it was expensive. So, post Newton was the kind of version of how that happened so they introduced a new service so the service is called Western service. And that's the single point of information for all the resources. So, it's again because it comes to the scheduler asking for the VM to restart. But, this time the scheduler first gets all the free resources from the Western service. It's asking for data and then confused the result but before it starts the VM it actually sends a reservation request if you want to call it like that to the service, to the Western service. And that's a problem. So, once the scheduler already used the resource the reservation service will reject that request. But, if it senses it means that the VM can safely start. So, it sends it there the reservation service says ok, that's fine, I'm taking that and then it sends the request to the agent and the agent starts again. So, all the three projects followed differently the end goal is the same you are trying to avoid double resources. It's just that you are collecting data in a different way and reacting on them in a slightly different way. But, the difference is actually pretty small if you think about it. One service is a Western service, one project has a central database where it supports a lot of schedulers so the pending database is inside the scheduler but that's really a technical detail. So, now we have the resource technique. So, the scheduler actually knows how much free resources it has on its nodes. So, what's the algorithm was the magic. It opens out all three projects following the completely the same algorithm. It's really the same. Even though they are using different language they have, they support different amounts of nodes the algorithm is the service. The algorithm is the service. If you remember the topic of Madman Reduce it's been, you know over and over you know, we can do and everybody was using it. 5 years ago, 6 years ago it was all great how many of you have heard about Madman Reduce? Yeah, it's like almost everybody. So, we don't talk about Madman Reduce but obviously it's a really cool past algorithm we've been doing it like that for years and it's exactly the same thing except we have a filter step before it. So, everybody follows the simple three steps. First, we filter out nodes that are not really able to run the VM or the bot. They don't have enough memory they miss some hardware whatever. Then we map the remaining nodes to some number and then based on that number we select the best node to use for all three schedulers and I'll use the simplification again basically scheduler in serial way when you ask for 10 VMs to start it, they first schedule one then the second one then the third one there are cases when it's actually parallel but all three schedules do it like this and it's basically serial So, how does it actually work? Let's talk about filter. The filter step is this. We'll have some functions over closed-end units like Kubernetes that hold differently I think Kubernetes is pretty good but it's the same thing. Basically for maintenance purposes it's a function and the function takes all the nodes that you have at the moment, takes the load you're trying to start and returns a boolean. It tells you yes, I can start a load based on this condition start account so when you are usually checking you have CPU compatibility you have 3 RAM those are pretty obvious checks to be able to start you have network presence or storage when you're starting a VM if you can't connect to this disk you wouldn't be able to do anything but there are some hires that are not that obvious both three projects support them if you have to do some small differences the features are pretty common so one of them is authentic both three support both VM or two hosts so you can say this VM or this bot it goes around on this subset of hosts so we know for a reason or just because the host supports something you want to use and it's not mandatory for the VM but if it's there, you can use it or because the host is more powerful or closer to some other host for example if you have MySQL and web service you would probably want the VMs and web services the web server and the database to run together closer to a really good latency so basically I don't know anything but you can use the support both to a host affinity or inter-pot affinity so you can specify pretty complex rules especially in Kubernetes combined with labels which I prepared in bold because it's a really important topic in Kubernetes you basically have a fully expressive language where you can express complicated rules with logical formulas and powers and then we'll declare that this host is good for the spot where it's not the other important topic that is related to what I just said is low isolation for example OpenStack is a special filter that allows you to say that these customers VMs if you for example have a shared hosting for VMs these customer VMs are never going to be mixed up with VMs from another customer or these VMs are processing payment data so they never will run on a host that has a web server just because of Meldon for example that's a really cool thing that appeared it's a new attack and this is exactly what you would use to begin a virtualized environment apart from thermal interfaces so you will never mix up VMs in safe and unsafe areas and again you would use some affinity or you would use some artificial logic of the project supported scoring is a bit more complicated so yes was the thing used for negative energy? yeah so if there was any specific support for negative energy I'm pretty sure for antithesis for overness as well I wouldn't remember if open psychological effects was that I think it does I think it does out of my head I'm not really 100% sure yes so you can specify over that says this VM will never run on this host or these two VMs will never run together which is important for the case as I said at the beginning basically when you have to an antithesis cluster and you are using VMs and you have a container that's highly available and has multiple copies but if those copies are in the VM or in VMs multiple VMs and those VMs are on the same node it's not highly available so you can use negative affinity between those VMs to actually put them to separate physical nodes and that way you are having it so yes you can use negative affinity as well what we don't have is an inverted affinity we're not anywhere except this node but we have negative affinity now back to the story if you have just one memory for example memory 3 memory it might look like an easy job you will just tell the host that has the most amount of memory but what if the nodes have to spread below evenly and kind of an overhead specialty we actually support multiple different policies and a lot of them is power saving so you might actually as much or as many VMs as possible or as few hosts as possible just to save power you don't hear in the night all the employees go home some VMs are still running but let's put them together and shut down all the other hosts and then in the morning let's change it to evenly balanced again so before they come to work all the other hosts or some of the other hosts will look up again and will spread out the VMs then actually this takes over and you know what is the how do the expenses for data center are spread out for us different VMs basically usually 50% of your energy budget for data center is proving so if your VMs are running and the hosts are not running you are saving the lots of money seriously, the lots of money so that's how it goes you can have power saving and in that case scoring becomes different but now we don't have just memory let's say we have memory and CPU and which one of those is more important memory is more important or CPU is more important you can't really tell the customer's decision is more important and that's why you have to deal with that what I wanted to talk about here is that 100% of the CPU can be the first sector of the CPU so think about that for a big response all your machines are using different CPU so let's say you have a GPU which is a pretty old processor and i7 in the same data center 10% of the CPU is going to be totally different performance so you can't really tell which one is the highest performance so you can use GVs for time slices so everybody is using percent but technically for CPU it's not as obvious as the CPU for memory you could use percent but all your machines are usually using fixed numbers you do not specify the amount of memory for a VM as 10% of the cost which means it kind of doesn't make sense to use percent for memory tracking you use that for absolute numbers now labor presents if you have a soft opportunity to run and it goes to this level but it doesn't have to it's not possible it's basically moving 0 and 1 so it's actually not as easy you have to think about how to represent the numbers so now let's back to the destination which now is the best spot I already touched it so basically you have to let the user specify the most important test memory or the other way around so I have that all of you purchase a DORVET over it has a UI DORVET so you can change it dynamically for OpenState and Kubernetes you have to change the configuration file but it's basically the same thing you just say CPU function is modified by now but still you get some numbers and now we have to actually decide the best one how many for CPU 2 for memory or 2 for memory but how do you sum up 10% of CPU level plus 4 for RAM and booting for label well you need some normalization to get them so they are comparable so the normalization algorithm it's actually different in Open projects each project decided to use a different function and doing it the same way is just that they use different maximums and different functions so for example over use this RAM we sort all results from the CPU function and the size score based on how many nodes are worse than you so if you have 3 and one has better CPU load you have 20 and the first one only has 4 or 2 OpenState uses a dynamic maximum basically all the you go to filtering and get to scoring you take the maximum amount of memory there so if you have both with 4, 2 and 1 gigabytes the maximum is 4 gigabytes and then they scale everything based on the maximum of all of them so they will use percentage and the normalize to 0 to 1 and it loads Kubernetes is a bit special it's like they want to use everything because they use the maximum of the single load so every load has a different scale they use just how much memory there is on the single load when they normalize the 3 memory so it's basically 50% of the available memory on the load and 25% on the other load if the memory sizes or the amount of physical memory those loads are different then I understand that they are comparable but that's basically what we have for example the oversolder compresses the differences if you have a huge difference or a small difference between the next members after surveying the difference is for the next one but it still works very well so another topic I was mentioning is balancing so there are multiple different metals and here the projects really are different over supports so we support live migrations we'll do whatever we can if people are running so if the load gets overloaded we'll migrate again to the rest so this live migration both these stacks support live migration only when the system includes a button or that's not an action then you don't have to automatically migrate and Kubernetes don't mess with containers currently we have no migration at all except when the logic decides to start up another version of the container and then maybe shut down some model so basically you can't issue offline migration start second container and do the first one so there's an offline migration as I said especially for overcast situations that appear during runtime you can have most can become overloaded over time I mean if you're doing nothing your employees are doing nothing basically and then one day somewhere starts a 3D modeling software right there you know some rendering and some way the host will go to 20PM so you're mostly sharing everything becomes overloaded so over we'll try to distribute the logic in all these stacks when you are at least if you have something like that you have to solve it somehow I'll show you another way of doing that on the next slide let's look at balancing for a second basically what you need to do is you need to find a candidate like we ever thought you want to do something and then the action and that's hard, I'm doing it completely properly for balancing so I can be on each problem so we have over some very simple features we basically have thresholds and we have overloaded underloaded for equal balance we'll just be moving both from overloaded to underloaded so all VMs are in the middle group for R saving it's the same thing, just underloaded most we can actually shut down the arrangement they'll move all the VMs from there to the middle group and overloaded is the same we move VMs from overloaded to the middle group so we are not doing any artificial intelligence there is no magic right now and for Kubernetes there is something called Reemption it's pretty new so they can decide that something that has higher score is more important it's about to start and we don't have a place for it so they basically use what's called disruption budget other thoughts to ask for a place can you release one of your copies so I can use the place for something that's more important but it's not asking that nicely basically each part specifies what it does how brilliant it is to give up this space for the sake of it and that's a different approach because containers are slightly different than we are now, Amazon Web Services the quality services is the same thing except you don't have a score you don't have money so you say how much work is your load oh, yeah if your load is zero and if your current price is higher Amazon will just kill your VM and start to load that turns into more money so that is the same thing it's pretty efficient now, do you have any other highlights some stuff I didn't talk about Overdiction has a component that uses techniques around artificial intelligence or optimization it's called Power Optimizer and it's an upper tool for system needs to be run computationally in terms of how we can see how we can go on to the cluster but we'll never execute it we'll just tell the admin hey, if you want to have a better performance please move this way and there and it will be very balanced or if you can't start this VM right away but if you first move V and A to load V then suddenly you have the last place so we have a tool like that but it's an upper tool and it's only a Power Optimizer it opens that with a chance scheduler you can just decide oh, what the hell, I'm going to try whatever works I have and it works I'll use it, if it doesn't help I can who were at least as arbitrary logic I already talked about that you can actually express complicated logic but formulas to specify the rules for where VM protocol can start or not now, we say it's a matter of topic it's complicated for the stock but it took more than an hour when I did the reversal and I had to cut it in half, basically so you can't pass through devices, graphical cards SRI, all these stuff meanings and in Overt it's your title of host in OpenSec, the system name in advance prepares the list of labels and basically it says this device from both A, this device from both B and SRI are the same function and then you just ask for a function and the scheduler knows there are three different hosts to the necessary capability and then on the Kubernetes side I really like the reactive and declarative way of doing things I mean, there's a simple database and everything is happening there everybody reaches in and somehow modifies it that's really cool, it's really great so let's get back to good ideas because we are getting out of time, unfortunately so labels I mean, the way to use these labels is just awesome you can do almost anything with them it's a problem with research and they have different mechanisms for that so basically all the footers all these things have all these things have people just realize that there are labels the system name will prepare some room and label the host accordingly and it will be done the scheduler will do whatever it needs to do and it's all there then numberization methods are really important because they behave I mean, they control how the cluster behaves and that will be the host of the best one and I had very nice comparison of all three with nice tables and numbers I had to remove it but if you want to talk about that then you can find me about it and then the most important part everything is atomic resource tracking and reservation I mean, you really can't have a schedule without that depending on what you need you have multiple schedule registers you can make corrections and balancing and preemption but those depend on what you actually need for your use really shortly enough every idea of the work sharing across the schedulers each has some really nice points that's usually and none of them is the best one we can really work together and share what we've learned over the years using them and the differences are pretty small thanks, I think I'm out of time so that was it so if you have any questions can you please I could be here can I have anything to say you want to share with us if we can put everything in single set, genuinely well, definitely why not there are some differences as I said we're at this price of 5,000 dollars so for example, what we do is when you come to remember and see if you love it there is this function that puts those two together that's what I do then what's really hilarious is finding the code but they just do it in a very simple way so they can't find you the muscle the same normalization they basically have no normalizations so we could but sometimes we go from we're not exactly on the same