 Hi, everyone. Thank you so much for joining our presentation. So, most of the people interested in this presentation must have faced issues with pre-copy live migration like live migration fails to converse, degradation of guest performance during live migration, or waiting for live migration to complete for hours or even days in some cases without any idea of expected time for completion. So what if we could guarantee live migrations to converse for any workload? What if we could live migrate the VM without making it slow? And what if we could predict the time for migration in advance? So me and Siam will talk about a microstanding-based live migration throttling algorithm that targets these issues. We also call this Derticota-based throttling. This is our agenda for today's presentation. First, I will talk about background design and implementation, and then Siam will take over to talk about results, conclusion, and future work. So our microstanding-based algorithm improves pre-copy live migrations. As pre-copy live migration can recover VM on source even in case of network failure or destination host failure, it is most popular scheme for live migrating VM. So this is some overview on how pre-copy live migration works. First, there is a preparation phase where we enable dirty tracking. And then we start sending guest memory on destination host. So pre-copy live migration works for multiple iterations. In first iteration, we transfer full memory of guest and from second iteration onwards, we transfer memory which was started while previous iteration was going on. This continues till we have amount of 30 memory left so less that we can transfer it within blackout time limits. When we reach at that point, we stop VM on source host, transfer remaining dirty memory to destination, and then resume VM on destination host. But there are some limitations of pre-copy migrations, like non-convergence and long-running migrations. In case of pre-copy live migration, network bandwidth and dirty rate are actually two factors which decide convergence of live migration. To ensure convergence of live migration, dirty rate of VM should be lesser than network throughput. In cases where dirty rate is more than network throughput, we depend on throttling to bring dirty rate within the limits. So how current throttling algorithm of QMU works and what are its limitations? So current throttling algorithm reduces CPU run time of VM. For example, at 99% throttle, all the CPUs of VM will be running only 1% of time. As run time of VM is reduced, it also reduces total dirty rate of VM. But this algorithm has some limitations. As it throttles all VCPUs irrespective of their contribution to overall memory dirty, it also penalizes read heavy VCPUs. Even though read does not count to memory dirty, also it does not guarantee convergence even at 99% throttle. As in very high dirty case, even if VM is allowed to run only 1% of time, it can dirty more than network throughput. Another limitation is that throttling adjustment is done only at the end of live migration iteration. As we get dirty statistics only when we do bitmap sync, which is done at the end of every iteration. Due to this limitation, it does not react to workload change efficiently and can take up to 5-6 iterations to find proper throttle. And actually we start heading towards convergence only after finding proper throttle. And any work done before finding proper throttle is actually a waste of work and time. For example, in first iteration, which is actually longest iteration as we have to transfer full memory of VM, we do not even throttle VM as we do not have dirty statistics yet. Until now pre-copy live migration scheme has been about throttling the guest was unconditional without knowing what kind of workload individual VCPUs running. In last slide, we discussed a few limitations of current implementation like migration fails to converse, not able to adapt to workload change quickly and unfairly throttling read heavy VCPUs. Now we will discuss implementation of our microstern based algorithm and how it results those limitations. Let's take convergence first. To ensure convergence of live migration, we have to limit dirty rate of VM within some factor of network throughput. The algorithm ensures that dirty rate limit of VM is always lesser than network throughput divided by some value X by putting microsterns in between. By default X is equal to 2, which means dirty rate limit will always be kept half of network throughput. Assume we have one VM of 16 GB memory and network bandwidth available equal to 1 GBPS and you want to live migrate that VM. So in first iteration, we have to transfer 16 GB, but as dirty rate limit is half of network throughput memory dirty during first iteration should be half of total transferred or it will be 8 GB. In second iteration, we have to transfer 8 GB. So memory dirty in second iteration cannot be more than 4 GB. And so on after eighth iteration amount of dirty memory left will be 0.0625 GB. Which with available network bandwidth we can transfer in 125 millisecond, which is less than default blackout time of 200 millisecond. So we can see if we limit dirty rate to some factor of network bandwidth, we can always guarantee convergence. Not only we can guarantee convergence, we can actually predict in advance what will be next time taken to migrate VM with given network bandwidth and VM size. For example, for x equal to 2, next time to migrate is twice of VM size divided by average network throughput. So how we ensure dirty rate limit equal to network throughput divided by x. We control dirty rate of VM by assigning a limit to every vcp, which it can dirty in some small window of time. This limit we call dirty quota. Any time in current window any vcp exceeds quota assigned to it, it is made to sleep for remaining time in that current window. We also call these sleeps microstun. So these limits are ensured by tracking memory dating at VM exit level. Now what is ideal window size? Smaller window size makes throttling to be more adaptive to network change. And also bound on a vcp will be continuous stun will be smaller. Also smaller window size or stun are better for guest performance. How? Let's take an example for this. In this diagram, red color means right extensive workload is running. Green color is for non-right extensive workload. And black is time stolen from guest by stunning vcp. So this is how execution of guest CPU looks like. Assume time slice for guest scheduler is 10 millisecond. Now some workload will run for 10 millisecond. And after that guest to us will take scheduling decision. Then again some workload of process will run for 10 millisecond. And again after that guest to us will take rescheduling decision. This is just an example assuming there are no other interrupts in between other than local timer interrupt after every 10 millisecond. As we already discussed, we control deteriorate in small window time. If in any window that the quota assigned for that window gets over, we stun vcp for remaining time in that window. So this is how a large window size looks like compared to small window size. In first example, window size is 100 millisecond. Now assume there is very right extensive workload running and full quota for that window is exhausted within first within first few milliseconds. In that case, vcp will be completely blocked or stunned for remaining time in that window. Even though guest could have run, read workload is given opportunity. This way we are not only throttling right heavy processes, we are actually throttling complete guest to us by not giving it opportunity to reschedule for multiple time slices. Now let's look at second example. Here we take smaller window size and smaller dirty quota. Let's assume window size here is one millisecond. Now even if vcp is starting at very high deteriorate, vcp cannot be stunned for more than one millisecond and in every one millisecond will get some opportunity to dirty. This way vcp will not be completely blocked for long time and guest can still take scheduling decision after every 10 millisecond and can bring non-right extensive process on that CPU. From guest perspective it will look like only right extensive processes are made slower and all guest to us is working as normal. But as we take smaller and smaller window size, accuracy of sleep and reschedule of vcp becomes lesser and lesser. So we may have to go with busy weights instead of normal sleep and that has its own disadvantages. So that's why we have to decide on window size very smartly. Currently we are experimenting with different window sizes. We do not have enough data for very small window size yet. So results which will be presenting today are with relatively larger window size of 100 millisecond. As dirty quota is assigned to each vcp independently, it ensures vcp's are throttled selectively based on their individual deteriorate and only right heavy vcp's are throttled. So total dirty quota allowed for any window is deteriorate limit into window time. Now how we divide quota among different vcp's? In every window we give a quota equal to total quota for that window divided by number of vcp's to every vcp. This ensures that every vcp gets at least this much of quota in every window to ensure fairness. But there is one issue. If we simply divide total allowed dirty quota equally among vcp's, assume a vm has 10 vcp's and only 5 are actively dirty memory. In that case quota allowed for 5 vcp's will go unused. Even the vm had scoped dirty more, it could not do to non-optimal distribution of total allowed quota. We fix this by keeping a common quota. Every vcp's unused quota is added to common quota after end of every window. Common quota can be consumed by any vcp on first come and first serve basis. But any vcp can consume from common quota only after it has consumed its individual quota. And it needs more quota in same window. Individual quota is renewed in every window as per equation in last slide. So now Sivam will talk about results we obtained in future work. Hi everybody. Before going into the results, let us try to understand the notions of dirty quota and common quota in between and see how they are used to selectively throttle the vcp's. This chart would help us understand how an individual vcp is microstand. In this chart, on the x-axis we have timing milliseconds and on the y-axis we have the number of pages. This red line over here represents the number of pages the vcp has dirtied until a given point of time. And this blue line represents the cumulative value of dirty quota. So whenever you see a bump in this blue line, it means that the vcp has received some dirty quota. t equal to 0 on the start represents the start of the magician. And as you can see, at the start of the magician, the vcp has some initial quota, which is this value. And with this initial quota, the vcp starts dotting the number until this point of time when it has exhausted its initial quota. If I had to give an analogy, I would say that the dirty quota acts as a fuel. So for the vcp to keep dotting the number, it would require more fuel or more dirty quota. What our algorithm does here is that it checks whether the vcp has already received this quota for the current time window or not. In this case, it has not. So we give the vcp to share for the current time window, which is this value. And with this share, the vcp starts dotting the memory. This time until this point of time, when it has again exhausted its dirty quota. This time we see that the vcp has already received this quota for the current time window. So the next thing to check is whether common quota is available or not. Let's say that the common quota is not available at this point, which means that there is no scope of dirty left. So the only option left to us is to stun the vcp's until the next time window starts, which happens at equal to 200 milliseconds. Because when we add T2 to 100 milliseconds, the vcp would be able to claim its quota for this new time window, which is 200 millisecond to 300 millisecond time window. Okay, before going any further, let us see how the dirty quota of the vcp is calculated for a given time window. We already talked about this formula in the previous slides. Let us take an example to understand how this formula work. Let us say that in this case, we have a network throughput of 220 megabytes per second, and the number of vcp is 8. So with this network throughput, dirty regimen can be calculated as 110 megabytes per second, if we use a factor of 2 in this formula. And with this dirty regimen and this number of vcp's, dirty quota can be calculated as 352 pages per vcp. So this is how we are calculating the share of the vcp for any given time window. Note here that this share is only valid for the given time window. And if the vcp does not claim it in the given time window, the share gets lapsed and it gets added to the common pool of the common quota. An example of that scenario can be seen here in the 400 millisecond to 500 millisecond time window, where the vcp was not able to claim its dirty quota. And so the dirty quota gets added to the common quota at the end of the time window. Another important point here is that this network throughput information gets updated after every time window. So in this case, after every 100 millisecond, we'll have an updated network throughput. And since we are using the updated network throughput for all our calculations, any change in network throughput would be reflected in our throttling levels. Okay, let's get back to this point where the vcp has received its quota for the current time window and has started dotting the memory. Let us say that this time the vcp was able to dot the memory only until this point of time, where it has again exhausted its dirty quota. So this time we see that the vcp has received its dirty quota for the current time window. So the next thing to check is whether common quota is available or not. Let's say that the common quota is available at this time. So we'll try to get 352 from the common quota, but if that much is not available, we'll give whatever is available in the common quota to the vcp. And with this share, which the vcp has obtained from common quota, the vcp can now dirty the memory again. So this is how we are using dirty quota and common quota to microstun the vcps. Next, we'll talk about the results we have obtained with microstunning and we'll compare it with the current management scheme and came in. First, we'll talk about convergence and total management time. In the next three slides, I'll walk you through this six different types of walkers which we have come up with in our experiments. And as we move from walker type one to walker type six, we'll find more and more difficult to converge cases. Each of these walker types has three, two to three sample cases and each sample case is represented by these four values N, M, L and B. Basically, each sample case means N threads represent writing M gigabytes of memory at a rate of L megabytes per second with a network bandwidth of B megabytes per second. Okay. So for type one and for type two, we see that the migrations are converging with both these schemes. But with microstunning, the total migration time has decreased significantly by around 50% in most of the cases. For type three and type four also, we see a similar trend with an exception here. The difference is not significant because the total rate is very low and so the throttling logic does not play a crucial role. For type five and type six, the difference gets more pronounced because in this cases, the migrations did not converge with the current scheme. But with microstunning, the migrations did not just converge, they convert within reasonable amounts of time. For type five, we see that the migrations converged within around six minutes of time and for type six, the migrations converged in around half an hour with microstunning, which is great. Note here that if not aborted, the current scheme would try to converse the migrations in these cases by throttling the VM to 99% which means that the VM would be running 99% slower for very long, which is not a very good guest performance. Next we'll talk about right throughput and green throughput on the guest during migration for this particular case. So with right throughput first, in this chart the blue line represents the right throughput with the current maximum scheme in KMO and the red line represents the one with microstunning. So as you can see that the current maximum scheme in KMO struggles to find the optimal throttle by taking a number of iterations while microstunning finds the optimal throttle right away and throttles the VM optimally from the very beginning of the migration. And so was able to converse the migration much sooner. If you talk about read throughput, microstunning does not affect the read throughput at all. While with current scheme, the read throughput follows this right throughput trend, and it decreases significantly, and it finally reaches to about 1% of its initial value. And this happens because the current migration scheme in KMO does not distinguish between the read and write processor running on the guest, and so it throttles them indistantly. In summary, microstunning has a lot to offer. It significantly improves the guest performance during the migration, helps migration converge in cases where the current maximum scheme is not able to converse even at the cost of guest performance, helps converse migration in less time, and last but not the least, it is very adaptive to network bandwidth and workload changes. Let's say that a very heavy write intensive workload starts in the middle of a migration operation on the guest. So microstunning finds the right throttle right away and throttles the VM optimally required if throttling is required. While the current scheme only takes a decision after the current migration operation is finished. Talking about the future, we are working on experiments to find the right migration failure conditions, so that we can converse the migration in as many cases as possible, without terribly affecting the guest performance of course. On the implementation side, we'll be sending out patches soon to the open source community for review. Also, we are running a lot of experiments to find the right window size to microstun the VM optimally. So that idea here is to stun the VM in microscopic time windows so that only the right intensive process is running on the guest. Feel that CPU time is getting stolen from them. While the read intensive process running on the guest or the CPU threads of the guest are least affected. Also, we are planning to run experiments on some real workloads, but we are very confident that we will have even better performance with real workloads because the workloads we have used in our experiments were just dirty. And lastly, they would like to extend this dirty quota based idea or the microstunning idea to provide predictability for live medicines so that one can predict beforehand how much time a migration is going to take if it is going to converge and whether a migration is going to converge or not. That brings us to the end of this presentation. Thanks a lot for your attention and do reach out to us in case if you have any suggestions or any feedback. These are our contacts.