 All right, it is 320 so Welcome everyone our next talk is about Automating load balancing via predictive analysis by Stephen Rosenberg I'll be sharing a recording on this. So if you notice any issues, what's I will please reach out and chat and I'll make sure to fix that up I'll be sharing my screen now Hi, even Rosenberg and this presentation is on predictive analysis for migration schedulers Also known as automating load balancing and full tolerance via predictive analysis This presentation will explore the possibility of using Victim analysis in order to predict when some processes will complete in order to stop the migration of mission critical Or high-priority processes early and the soon to be freed resources will provide performance improvements within a distributed system Such as within a data center or possibly across data centers Early migration may also include full tolerance solutions of which there is a growing interest The ability to migrate processes whether they are virtual machines work in nodes Managed by orchestration applications Odds containers or other types of processes Provides for many options in conjunction with predictive analysis Which we did shall discuss shortly So the topics of discussion will be load balancing full tolerance scheduling types of solutions live migration and predictive analysis So this is a load balancing example You can see you have a load balancer You can see you have process one running on host one process two running on host two process three running on host three because we only have three hosts Process four will then run on host one giving an even distribution and we call this a round robin approach The load balancing we have priority based upon urgency Even distribution within categories We can have urgent priority high priority neutral priority low priority and no priority and as a simple solution we can consider when urgent priority or high priority Processes need more resources. You can actually bring down the low priority and no priority processes accordingly and that would Give those processes more resources So full tolerance. Well, what can go wrong? Well, there are many opportunities for failure We have network elements that can fail hardware and resources that can fail operating system bios internal failures and of course process failure So full tolerance redundancy example Well redundancy is one way of mitigating against Failures and to avoid downtime You have two hosts here a host one which is the active host currently and host two with this currently the passive host you see you have two storage elements and you have multiple connectivity and Basically what happens is that when host two the text that host one is no longer responding then it would become the active host and The problem with this scenario is one course and two is latency because it takes time for host two You realize that host one is not active before it becomes active himself There's also issues of the split brain If somehow both of the processes can't communicate with each other You can have what's called a split brain scenario. So there are many problems with this type of redundancy So scheduled dispatching concepts. So this is an example of that Here you have a process eight on a queue waiting to be launched and a process nine on the queue waiting for process eight to be launched that then it could be launched and You can see that we have a free squad on host two and a free squad on host three So eight could be launched or migrated if you will To host two and process nine could be launched or migrated to host three But what if eight needs to slice it would need to wait until process five exits in order to be launched or migrated to host two and if it's not very intelligent Process nine will have to wait until process eight is launched So if we want a better scenario and we want to be more intelligent we could say well, okay Let's migrate process five to host three and then we can do the launching or migration of process eight on to host two The concept with early migration though is if we can detect that Process five will be exiting within the time it takes Process eight to launch and or migrate we can start the launching and or migration early We can then also start the launching or migration of process nine on to host three So when we're finished Process five will no longer be running process eight will be running on host two and process nine will be Running on host three and that's an example of early migration So scheduling the ability to launch processes based upon needed resources Such as monitoring the amount of resources each process utilizes is One example of obtaining that the types of launching and migration scenarios can be initial launching as we discussed migration for maintenance if you have to add New hardware to some hosts you can migrate all of the processes to another host without bringing down the Processes so that then you can bring down the host and add the hardware you need resource rebalancing Migrating from one host to another where another host has more resources so that the processes can run more efficiently And also we have for recovery Migrating to mitigate system and do a process failure No policy units and the attributes of scheduling migration. So first we have filters I a process that needs to be migrated Please a certain piece of hardware those hosts that do not have those hardware will be filtered out and The rest of the candidates so that have that hardware can then be weighted and scored based upon the load balancer for example for even distribution hosts that have more resources will have a higher source For power saving those hosts that use less energy will have higher scores. We also have prioritizing which we discussed affinity CPU pneumo thinning for optimal performance So there's many criteria for different types of balancing in order to decide which host Is the best candidate for migration So the types of solutions for applying predictive analysis We have live migration load balancing as we discussed For recovery also minimizing the live migration forcing we could use a predictive analysis for that as well And we'll discuss that shortly We have redundancy as well, which we also discussed and we have the distribution processes that are running simultaneously So you have duplication means more energy usage and of course for recovery for when there are failures redundant processes and or host and then take over from the host so and or processes that are not responding So live migration Well, there are a few things to consider with live migration. Remember, we're migrating a process from a source host to a destination host So the first thing to consider is the connectivity. We also need to consider the remote disk availability We need to migrate all of the local disk data from the source host to the destination host as well And we need to copy all of the memory in phases because first we copy all of the memory content While the process is running on the source Then we Ask the differences because remember the processes to running so it's changing the memory So we have to ask the deltas or differences of the memory until we find the minimum force time where the differences are small enough to Make to allow for the process to be paused at the smallest amount of time So once we reach that minimum force time We then stop the process on the source We pass the rest of the memory To the destination we copy the cpu states Again, the goal is to limit the pausing of the process Then we restart the process on the destination host seamlessly and then we clean up on the source So live migration transitioning example So these are the sequence of events We set up was synchronized the disk and we start the memory transfer While the process is still running on the source We estimate the minimum downtime And we continue the memory transfer in deltas Until we reach the minimum force time Then we pause the process We activate the network on the destination We complete the memory transfer And then we run the process on the destination and then we clean up the source So live migration From host one to host two transitioning So you can see here that the guest process has already migrated from host one to host two But the storage data is still sitting on host one And so we have two controllers whose job it is to migrate the data in the storage The local storage of the local disk From host one to host two until all of the data is passed In the meantime if the process needs data that is still on host one It can obtain that data through the two controllers The predictive analysis and these are the topics for the discussion Predicting future occurrences via analysis of past performance. Well, this is the concept The techniques for predictive analysis will be discussed Including the process for developing a predictive model The types of predictive models with examples And then applying these techniques for scheduling So this is the predictive analytics methodology We have historical data And from that we extract a training set We develop an algorithm The algorithm reads from the training set it creates a model We feed in the test set data to the model We obtain a result We compare the results to the expected result and we obtain a percentage of error Based upon that percentage of error We then adjust the algorithm, maybe the weights in the algorithm such as if it's a neural network And then we do the process over again Until the percentage of error is at a minimum threshold that we find accepted And that's called machine learning So there are many techniques for predictive analysis And I won't go over them all The key is not to be a solution looking for a problem You need to define the problem first and then choose a solution that best fits the problem based upon the required criteria So this is the process for developing a predictive model We define the project We collect the data We analyze the data We validate the data And then we create a model We deploy that model We monitor that model Again, we calculate the percentage of error Based upon the results Against the expected results And we redefine the project And then we do the process cyclically Again, that's what you're learning So the types of predictive models with examples But we have support vector machine models Those are classification models to predict the category For example, stock prices might increase or decrease Is one criteria or one application that we might want to solve And that's obviously a binary yes or no Kind of model Then we have predict quantity, which is a regression model Examples can be predicting a person's age based upon their height, weight, health, and or other factors We have anomaly detection, normal behavior versus exceptions Those are anomalies An example could be money withdrawal anomalies If money is missing from your bank account, that's obviously something that should concern you Clustering, discover, structure in unexplored data And examples might be finding groups of customers with similar behavior giving a large database of customers containing their demographics and past buying records So those are some of the examples So for applying predictive analytics to schedulers We have certain criteria for the data that can be required We would want to consider processing time and or processing iterations And that can be adjusted for resource capacity as well as priority The percentage of resource is used and that could also be adjusted for capacity and priority And to adjust for anomalies when calculating averages, which we'll discuss shortly So we can collect certain ideas or selective techniques That have been applied to other scheduling type of applications Such as machine learning and advanced mathematical models We can consider combining regression like modeling and functional approximation Using the sum of exponential functions to produce probability estimates And there are many other examples So here's a predictive analysis architecture We have the historian and it collects the data from the scheduler The types of data would be information about the CPU It would be such as a percentage of available CPU as opposed to total CPU Memory storage and networking that can be considered not only size but also throughput As well as storing that data could be collected and placed in the historian The predictor would then need the historian and create a prediction based upon the data Such as whether to perform a live migration and then it could send a trigger to the scheduler to do just that So tracking historical data What data would be interesting? The time each process starts and terminates early migration The resource is used by each process The time each process uses to migrate And the time and or iterations that memory and or disc transfers per or per size Considerations based upon analysis They can be if early migration can proceed When early migration shall first start And error correction and or anomaly detection for accurate results For example Anomaly and or error calculation methods to consider in order to gain more accurate results Could be statistical such as calculating the percentage of error from the mean And eliminating results outside of threshold Could be one way of filtering out outliers Signal processing techniques such as smoothing filters to eliminate glitches You can even use machine learning Techniques such as analysis of patterns and categorize and to categorize between normal and out of ring results Thank you. Um, I'm open to any questions Very cool. Um, thanks for that Stephen so So as she was that book, so we're happy to answer questions now Yes He is and I'm happy to forward them to him. Okay Yes, I I read the karen uh Han question if there's no data And I covered the trying to eliminate outliers and there's different leaks that I mentioned such as statistical type of approaches in order to In order to Calculate the percentage the mean Then to eliminate outliers at the threshold Uh, you could also use machine learning techniques for that as well because Outliers are anomalies and if you treat them as anomalies, you can filter them out A signal processing as well though that might be a lot more sophisticated But the idea of reducing noise is noise is an anomaly. So it's a matter of Um, eliminating them and filtering them out. Uh, this way such as for example with early migration If something failed and the process went down and didn't complete early So that would the outlier that you would want to filter out Otherwise your results won't When the machine has to really learn A long a process might take so I anything else so Currents also asking if the data is open sourced and published somewhere. You'd love to explore this data Yes, well the idea of this is there's a few phases with this and one phase can actually be that you run stress testing against the CPU stress testing against the memory Networking stress testing and you can generate the data that way. So we've done some Investigation in that but we have Had enough resources to really forward with Uh, but that's something that we do want to do Uh, and once we write that day and we serve different weeks Uh, in order to to do that we monitor we would monitor the CPU the memory uh, collect the matrices and based upon that we can see what the thresholds are And by stressing it you then see where the limitations of the And then those thresholds can be Used as part of the learning process Uh, so the first step would be to generate the data and the beauty of this This I propose to most machine learning processes are first usually with machine learning you collect the data first and with Project for the fault tolerance For the early migration it would have to really learn And you would have to analyze when the processes went up and down but for fault tolerance you can actually use stress testing in order to Generate the data them And then you can clean the data and then you can use that data for the training So, uh, I have many ideas in that area where you don't have to It would take too long to actually generate a fault And it probably wouldn't be accurate. It could fault in different ways, but by using stress testing, uh, and, um, Uh, the aggregation of Is just one example of detecting when something might fail So actually Generate the data in this case It works well for the concept of Tolerant migration for fault tolerance Hope that's clear anything else? um, it doesn't seem like a Well, again, thank you so much steven for hopping on. I know it's very late for you. Um, this is somebody interesting stuff Uh For everyone else who attended the talk will be published later, right? It's all going to be available on youtube um, yeah Thank you again, steven. Have a good night. You too