 Test yeah, okay. Good morning everybody My name is Martin Sivak, and I work for Red Hat on the Overt project. I've been on the project for about three years now and Today I'll show you what One of our side projects that deals with smart VM scheduling It's a kind of evolution of what we do when we plan VMs So let's get started First I'll get you an introduction about how we do scheduling now and how most of other projects including the standard Overt does scheduling today Then I'll show you what were our goals when we were actually thinking about doing this Some theory because there is some computer science theory behind it And then I'll describe the optimization service and I'll show you some hidden endpoints, which you will understand later So what's scheduling what is scheduling used for in basically any distributed system or any system? When you have a virtual machines and you have physical hosts You need to select the host that will get the actual virtual machine when you want to start it You have 50 hosts with different load different capabilities And you have a VM that has certain requirements and you need to figure out where to start the VM That's the first task second task if you support migrations and you click migrate VM You need to be able to tell where the VM should be migrated it's very close to the first task and The third task is a bit different because when you have a host that's suddenly to load it And it has too many VMs you need to select first the VM that will be migrated away and then the destination So how do we do it today? So today We have a basically what many people would call a simplified map and reduce and filter approach This is basically how it looks like first we get the list of all the available hosts in the cluster and The VM we are trying to schedule then we remove all the hosts that are missing some capabilities They don't have enough memory or some network card is missing Not enough disk space whatever and we remove all of those you can see that here. Where's the oh, sorry My laser pointer. This is the process basically We are filtering and only two hosts are remaining and then suddenly we have two hosts with the right set of Capabilities, but we need to figure out where the VM will actually go. So we have another step scoring or Waiting as we call it where we have multiple functions that compute some kind of score All those scores are summed up Multiplied by weight each I mean you might be interested more in memory load than in CPU load So you can assign a higher factor to memory, which means we'll multiply the score of memory and Then we'll sum it up all together and the best host wins So this is how it works right now. It works like this in Over it works like this in Nova in OpenStack. It works like this in Kubernetes. Basically, everybody is doing some kind of version of this Now it has some limitations first It's one by one scheduling, which means you start one VM It finds a place for it and then you start another VM. It finds a place for it Second you have to track pending stuff, which means you start a VM it's still not running yet, but your user clicked another start VM button and You need to be aware of the fact that the VM is going to be started on some host It's still not there. So you don't see the memory consumption. You somehow need to track the future allocation we have the pending counters for that and Yeah, the wait for launch versus starting. That's the actual issue. We are solving with those Now it's one by one and the order matters. So here you have you see two different runs With the same set of VMs, it's very simplified But the same set of VMs you have the left column and right column and the only difference is that we are starting VMs A, B, C and D in the left column and D, C, B and A in the right column You see that the result is completely different. The rule here. It's a very simple rule that says use the host with the lowest load Just because the order was reversed the situation looks completely different and We only have one value. We only use memory here, for example but over it uses about 10 different things to 10 different values and Capabilities to check whether the VM should be run on some host or not So you see that this is not really optimal But the advantage of this is it's very fast if you have the space to start a VM somewhere Will start it in a second or so Now what happens when you want to start a VM and this is the same situation Except there are already some VMs running in total. There is enough space in the cluster But We want to start a VM. That's very huge and it doesn't immediately fit any host For that we need better load balancing Because the VM can't be placed directly. We first need to create space for it Unfortunately the current scheduler only does one by one as I said before and it doesn't know how to do that It can't look one step ahead We need a better scheduler for that We also want that to be configurable by the current cluster migration policy and The service is going to use quite a lot of resources So we need a separate machine for that. So this is the goal. We need to solve this situation and I'll show you how we did that So now this computer science theory this whole thing is called machine reassignment problem Basically you define the problem by set of machines instead of processes Each machine has some resources each allocation requires some resources You have affinity and anti affinity rules You might want some VMs to run together on one collocated host or some VMs do not run together Because they are using too many resources The anti affinity is actually very important in rising the complexity. It's it's causing the complexity to to ramp really up and Because of all that it's NP hard problem. If you have computer science background, you know that that's about as hard as it gets And it's not really solvable by deterministic algorithms So brute force is a no-go if you have more than two or ten VMs It will just take very very long time to compute that it will eat all your resources But we need reasonable response time We took a look at a competition where Some teams were competing in this exact problem There's the link to it 2012 one of our redhead teams from the J-boss project Competed there actually with their solution and their solution uses probabilistic probabilistic approach This basically means we are not trying to compute somehow. What is the right assignment? We are randomly creating assignments and then computing a score. So we Trove a dice and will somehow spread the VMs across the host and Then suddenly we see oh, but here we violated the heart constraint here. It's not optimal because The memory is just too low now so we give it some kind of score and Then we throw the dice again What happens is we get another solution and we can compare the scores. We keep the best one Obviously, this is also not very efficient because we can revisit the same solution like we did in the past If we just randomly select solutions So that's not the way to go Not exactly. So what do we have to do? We have to use some heuristics and some limitations to the randomness There are a couple of algorithms that can help you with that the most well the simplest one and the one that Mostly everybody from universities is aware of that is simulated annealing simulated annealing is basically modeled according to how steel was forged and there is some movement of atoms but as the temperature goes down the movement is smaller and smaller in This case simulated annealing would select a random solution and then modify An assignment it would move a VM to a different host But just a single VM the rest will stay the same and then it will move The you know, it will take the the migration but will replace the VM That's actually moved basically you modify variables randomly, but not everything at once just one by one and you are moving across the solution space and With progressing time you are limiting the amount of movements. You are limiting the distance So you are only making smaller steps and then you reach some Suboptimal usually suboptimal solution That's cool, but we couldn't use that Because of the suboptimal solution there are local maximum But what we use is tabu search tabu search is very very similar to this It actually also moves in the in the space of solutions But if we've replaced the VM and we see that the score was actually lower than before We return and we remember that replacing the VM was not a good way to go for certain amount of steps thousand or ten thousand or so and Then we try a different direction We get to some other local maxima and after ten thousand steps We try to replace the VM again because suddenly the timeout for the VM replacement step expired It's better for us It had it got it had better results So that's what we are using but we actually tried simulating in the link too and it's easier to to explain Then we have genetic algorithms. We are not using those Because we are running in a single thread genetic algorithms are perfectly fit for multi-threaded and distributed system But that's not what we have Although it's a good fit for this problem Since we are not running distributively we can't use it easily. It wouldn't give us any benefit So now about the projects that competed in the in the challenge the project that Implements all these all these optimizations is called opta planner. It's part of the Jebo suit well Jebo's business rule management suit And behind the algorithm. There is drills system drills is a rule-matching Project basically you write certain set of rules you have When something happens do this action an opta planner builds the heuristics on top of that Basically the do something action part is just increase the score or decrease the score These are very interesting projects We are in the virtualization track here and Jebo's and Java are in different part of this building But I actually think there was a presentation yesterday about drills so if you have any interest in in planning and not just random planning, but Even business process planning drills is a project to look at So what we did we have taken drills and opta planner our preprocessor and we have written a service It's constantly running service on some isolated VM or separate host because it really needs a lot of resources Which has one solid thread per cluster That's why we don't use genetic algorithms because we only have one thread and it's constantly improving the solution So at the beginning it scans the cluster it takes a look at all the VMs all the hosts All the resources and it's constantly computing the best allocation according to your rules If the solution is not improving it will pause and wait for updates Every 30 seconds or so we scan the cluster again and we feed the changes to the optimization engine When it receives the updates it incorporates them into the current solution and the solution suddenly is not as optimal as it was because some Allocations changed, but it can reuse the old solution and When it reuses that it's much faster to converge on a new solution So we are not computing everything from the scratch again We feed it into an existing solution and we just let it to you know to find Taylor the The solution to the new one So that's how we save some CPU cycles So this is the architecture we have you can see that this Part basically here. That's the standard overt architecture. You have some website management You have a machine that runs the overt engine. That's the brain. That's the part that does the scheduling That's the part that takes track of all the resources. It can be a VM now, but it's not important for this case Here you have your set of hosts and on those you run your VMs That's a standard setup and here we added a machine or VM with the optimizer service So you see the you see the arrows there. That's the direction of communication. So overt internally communicates In between the nodes the engine the website But the optimizer only reads data from the engine It's not sending anything back and then it sends data or actually it's pulled by the website UI plug-in to show the results to the user the whole optimizer is running as a rest service So you only have couple of get and post endpoints and you can query for data And it uses rest API to query data from the engine So this is the internal architecture. It's a bit more complicated But what it is is this is one solver one threat We get all the facts from the engine then we preprocess them to some other data Apply the factors to your rules Configuration and then this is the solution We have couple of predefined steps at the beginning when you configure the optimizer You select how many steps into the future you want to take a look here I only have five and Some of them are not valid because the VM is not set or the host is not set and But the steps that are valid we have two here are here and here There are return to the user The user can actually check whether some solution even manually crafted one is valid solution for the current cluster That's actually pretty good for debugging. It's a Even important for one of the features. I'll be talking about in a second One thing we have is when you have the engine when you have the management you can right click a VM and Select optimize start what it does it records the VM s This VM should be running and it can do the multiple step start So it can first create a space for the VM and then give you the result give you the set of migrations that Result in that VM being you know on That's the goal we had we weren't able to start the VM because it didn't fit immediately So how do we get data from the engine? First we do cluster discovery every couple of minutes So when you add a new cluster will discover it eventually We start a new solvetrat Then every 30 seconds as I said we do cluster updates where we read the list of all the resources and we use trust API for performance reasons We are using the rest API in bulk which means we are querying for all the VMs and all the hosts in one go Overt engine rest API is not the fastest rest API ever. So we had to optimize that a bit and There are still some things to do to make it faster Now clusterfacts what we actually do with those so we get the rest entities from the SDK Or the rest API we convert them to objects that can be inserted into the drool's database key. I a Sorry ki is actually knowledge is everything that's a abbreviation of the drool's people It's a fact database. You have a huge heap of Simple objects or not that simple but huge heap of objects. It doesn't have any structure But what drool's can do on top of that is pattern matching So you can ask for a VM with these two fields being set to certain value Or you can ask where some object with certain properties exists and it's all done using pattern matching It's all cached. So it's very very fast But it eats memory. That's why optimizer is heavy on resources So we have a supervised update cycle Every time we change something we have to update the optimization service. That's constantly running so it knows to You know replace the caches or update the caches. We have three different fact sets first We have cluster state facts. That's the stuff we collect from the cluster. We have configuration facts We also collect some of your configuration if you set your cluster to be equally balanced or power saving Which has different set of rules for how the VMs should be allocated to which host We collect that to plus user requests currently. That's the start with the VM Some entities are pre-processed Basically because the rest API entities are pretty deep and pretty complicated and the caching mechanism in drool's only caches top-level objects Or accesses to top-level objects you can't replace something that's deep in the structure You have to replace the full object So we pre-process stuff to VM info host info just to make it faster and actually more readable When we access the data from the rule system now the difference currently and any optimization actually relies on the model to be perfectly matching the world the domain Unfortunately currently it's not the case. We are pretty close, but not as close as you would like to be because in the engine That's the internal scheduler one by one We have policy units and those are standard Java classes with algorithms and computations And we have direct access to the engine database and we have much more data there The optimizer is using rest API. It's not getting all the data the engine actually has in the database We're improving that but still we are not getting everything But it's using pattern matching. It's declarative and it's pretty easy to read although I'll show you a complicated rule that might not look like that to you, but it's not hard to read You can use collections some of values, but you are not doing any algorithmics You're not writing your for loops and doing some multiplications. It's working a bit differently So the exact match is not achievable right now that we are getting closer So this is how simple rule looks like you see this is Even valid when you know think about drills in general, but this is a rule for optimizer We have just two sections when and then When each line is one rule this one is split, but each line is one rule and all three Rules all three sub rules have to be valid for the dense action to be executed So in this case we are looking for a migration step That's a common line for every rule we have basically and then we are checking for whether there is a VM that's not assigned anywhere and Running VM is one of the objects That represent the fact that the VM should be running So if we have a VM that's not assigned to any host and it should be running That's clearly a violation of a rule. So we decrease the score by 10,000 in this case That's pretty steep price to pay, but although, you know the VM that's not running. It's also pretty steep violation Of the rules. So and this is a complicated rule. You see it got a bit more dense There is a collection in here accumulate it basically computes all The memory consumption of all the VMs from a single host so we can check that there is enough space for a new VM This is just for as an example. It's actually simplified. It's a bit more complicated now So this is a complicated rule. So you see it's still not that hard to read, but it can get a bit more complicated It's still no Java So now what we compute? When we have all the information and we ask for the solution The first attempt was just to compute the ideal solution. I have my VMs. I have my hosts. What's the best? Best layout Well, we got best layout You see this is the rule that says the biggest VM should be on the second host for example It's a artificial rule So the best layout is here the biggest VM is on the second host the others were moved to the first host and that's From this point of view, it's perfectly valid solution Unfortunately, it's not reachable When you try to migrate the VMs over you see that at one point during migration, you don't have enough space to actually do it Well, so we improved that we replaced the code a bit and then We started doing the migration steps. So that's much more complicated in terms of Algorithmics, it's much more complicated in terms of resource usage because suddenly you are not computing one final state You're computing some intermediate states, too But it actually shows you that this solution is not possible so Although the The result is desirable. It's not reachable from the system like it is now. It might be in a Docker-based cluster for example, we don't support her But if we use something like that for Docker or maybe in for open stack They can shut the VM down and then start it again on a different host for them That's viable use case Unfortunately over it treats VMs as pets, which means we won't kill a VM ever if we can avoid it So we only rely on live migration and you see that with migrations. It's not possible to reach the best scenario It's lower to converge But hard constraints are checked for every step soft constraints are only checked for the result because we don't care if the score decreases temporarily During the the migration procedure, but at the end we'll get better a better cluster balancing Now how we report results. We have rest interface. They have one endpoint per cluster Over optimizes which results slash cluster It gives you Jason. I'm not sure how good you how well you can read that because it's pretty small But basically you get pre-processed settings. You get pre-processed host to VM map You have pre-processed VM to host map You have migration steps. You have the current situation It's all done this way Some kind of noise and echo probably in the sound, but it's all done So your JavaScript or whoever is processing the result has to do as few computation steps as possible You can just take this and display it on the website without any complicated processing And this is how it looks like when we process the JavaScript Sorry the Jason into JavaScript. You get a page with solution status It tells you that the solution is doable or not doable at all It shows you the migration steps that are planned for you and you can either execute them or cancel them There is no automatics. We only Create a hint for the system in currently we plan on having some automatic Stuff in the future, but we were afraid that it might start migration storms in You know production clusters and we couldn't risk that since it's a basically a tech preview So we only created as a hint for sysadmin and there are buttons that will start the migrations But the sysadmin has to confirm those So we show the destination well the result situation That's the bottom here for each host will show you the VMs that will end up being there with all the buttons to start Migrations will show you the VMs that are supposed to be started. We'll show you the migration steps That's what you will see. It's all correlated using you ID from the engine as I said applying the solution is manual at the time and We are monitoring status when you start a migration or start a VM. We'll actually show you what's happening now there is a Solution freeze button Solution freeze button is important because when you compute 30 steps and each VM has four gigabytes of RAM for example, it can take some time to actually migrate the VM and You want to wait for the my VM to be migrate to fully migrated before you start another step? Because otherwise you don't have the space you might need for some other step so Since we changed the solution every 30 seconds or so The solution that you had at the beginning might be completely different from what you get from the optimizer at the end of Your procedure, but you might want you know You might have liked the first solution and you just want to go to the end of it So you can freeze the solution if you freeze it We won't update the web page. Obviously the optimizer is still running But we won't update the web page and you can click the buttons What can happen though is that somebody else does some kind of migration to and your solution is no longer valid Which means I mean he took the space you prepared for some other VM So although we've frozen the solution We are still constantly sending the solution back using one of the endpoints. I've talked about before And we get the score back We are not running an optimization on your solution or in the existing solution We are just recomputing the score just to see whether it's still valid or not And if it's still valid We didn't do nothing if it's not valid will tell you that hey your solution is no longer valid You might want to refresh the page or unfreeze it That's one of the things so even though you freeze the solution. We still constantly talk to the optimizer Now we have a couple of hidden gems You can write your own custom DRR rules using the drool's language You just put them into ETC over optimizer rules.d will automatically scan them and include them If you have a typo in there or syntax error, it will blow up. Obviously it will not start But if it's correct, it will start and it will start using your rules You can use all the pattern matching magic the standard set of rules is part of the project It's internal so you can replace that but you can add your own Like VM for this customer has higher priority You know some stuff like that We have also simple scheduling basically at the beginning I talked about how the scheduling is done in all the projects today Basically all the way, you know widespread projects We have the same thing in the optimizer too if you want to run optimizer without the optimization engine to save Resources, but if you want to use the rules you can just give it a number of steps of zero It will not start optimizer, but it will still allow you to post especially crafted JSON with all the information to an endpoint and It will give you a list of hosts and their score and then you can select the best host and use it as destination That's where it's very cheap in terms of computing power because it's not running the optimization And if you just want to have the same set of rules for optimizer and a simple scheduling That's how you can do it. You can have the same service Provide you both even simple stuff even the complicated stuff It's useful for starting a VM When you have enough space for starting the VM, you might not want to run the full optimization Just to get the one step that tells you started on this host because that can take time the optimizer is a non-deterministic process and That means that it will it can take 30 seconds. It can take two minutes because it can select bad random solutions But when you click the button the user expects the you know when a user clicks the button He expects the VM to be started pretty fast and if there is enough space for the VM There is no reason to use the optimizer for that. So you can use both approaches You can use a simple start or simple scheduling for starting the VM If there is enough space and the optimizer part to starting a VM when there is not enough space and for migrations Then we also have debug endpoints You have to enable those in the config file because they expose all the information about your cluster But basically you can create a dump of the full Ki database So you get all the stuff in Jason and then you can submit it back That wouldn't be too useful But when you are submitting it back you can actually modify it so you can manually do changes to the Jason file To see what will happen basically to test for responses to some different scenarios You might have in your cluster so you can use it as a test bed for checking the score when you change VMs or you add a host We use it for debugging obviously when you Run the the master version because it's not in the released version yet But when you run the the master version of optimizer and you find a bug in the scheduling part I mean, it's not treating my VM like it should be you can create the dump send it to us We will can import it to our optimizer without having your cluster It's not necessary to have your setup then and we can see the rules and everything that was supplied now future plans Tighter integration with the business rule management system. That's the J-Bus part especially with regards to deployment and RPMs Because currently you have to download a zip file on Fedora and in Overdworld will do it for you from the RPM package With all the checksumming and everything to make sure it's secure But if you are using the supported version of redhead enterprise virtualization, we can't do that So there you have to download a zip file and execute over the engine Sorry over the optimizer setup script that will unpack it and prepare everything For automation of the optimization. That's what I said currently the CIS admin has to click all the buttons We can provide automation, but we don't have it yet We want to be pushing cluster data updates to the optimizer instead of pulling the engine because currently we are pulling every 30 seconds You can configure the time but right default is 30 seconds and that's not efficient First we are pulling even when nothing changed second We are pulling only every 30 seconds and when you know huge changes undergoing in the engine We should be doing that more often Support for more policy units the internal policy the rules file is missing some rules We have in the engine. We don't have them in the optimizer yet It takes some time to write them test them and we just don't have that many developers working on that too bad And we want to review review open stacks guns and Kubernetes mess us all the air projects like that Because we might be able to work with them and to write their, you know own pre-processors for the optimizer So they would be able to utilize it as well So how you can test it yourself Well you install overt you have a couple of hosts you install this repository It will add all the necessary files to your configuration Then you just do I'm install over the engine on one and the I'm install over the optimizer on the other one and You will configure it and prepare it on this page there is a Set of steps you can perform to install it manually. It's the upstream version You see that the feature is called opta planner It was meant to be opta planner integration because opta planner is library behind that and we didn't have over optimizer name When we started a project, it's kind of hard to change the name of the wiki page. So everybody find it So, yeah, the other Way you can use is I have a github docker file Which you can use to to create a docker image on your own machine. You still need the engine You still need to have the source of information, but you don't have to install a machine for the optimizer You can just run docker image on your system and I think that's it. I would like to show you a demo, but I don't have a VPN connection, unfortunately So just one screenshot how it looks like are you already saw this page? But that was a it's a sub page in a frame. It's integrated into the over engine. So this is the engine management part showing the list of clusters you select cluster and there is a sub-tab Optimize the result and That will switch to this page and there you see everything So that's it. Thanks. And if you have any questions, I have some nice scarves in here Define future scars Sure Okay, so the question is that I talked about having constraints like this host has enough memory and the question is whether There are constraints like a storage device needs to have enough space or nicks are present for example, of course, we have Constraints with regards to memory CPU load networking devices pass-through devices Actually, I don't remember all of them There is special configuration for display network because we support different virtual networks for You know transferring the display data if you open your VNC or so and for the management and for storage We don't have any rules for storage Because in overt storage is managed separately We have a shared storage. So if you have the VM you have the storage we access it over network So the host doesn't have to have enough storage But yeah, we have quite a lot of rules. I think the number is about 20 right now And the optimizer separates some of those into multiple rule sets Because we also have minimum guaranteed memory and an actual memory Minimum guaranteed always has to be there the actual maximum value is a good to have so that's only soft constraint If it's not there, but the guarantee is still there. You can start a VM Yeah, so sure Okay Okay, so the question is whether it's possible to define additional resource to track in the optimization in the overt It's possible in overt engine We have something called external scheduler where you write your own internal policy unit for the overt engine I'm talking about the Java part now In Python and there you can do whatever you whatever you want using Rust API The integration with that is not yet present in the optimizer That's one of the goals to we want to scan for units that are provided by by the Python part But the drills language Is pure Java? I mean it's pattern matching, but all the lines are pure Java So you can call whatever Java function you want of course if you call something that's You know heavy and resources are slow. It will slow down the whole optimization So Okay, well as I said, it's not okay Sorry, the question is whether it's possible to use it for an aggressive migration management like power management during night As I said, this is just a rule engine So it will propose steps, but it's not actually executing the it's not actually executing the migrations Yes, obviously any migration you can execute can be executed So obviously yes, you can execute migrations and by the way, we have power management in in overt Yeah, okay So the second question was where it's useful for moving VMs from one cluster to another one That's generally not allowed in our world We only allow it for special purposes like upgrade and you are upgrading between different versions but optimizer is Cluster bound so optimizer doesn't know about any clusters other clusters. It only optimizes in terms of single cluster I'm out of time here. So thanks everybody and The three people that ask questions can come to me and I'll give them scar Thanks You