 Hello everyone, my name is Michał Estrzemski, this is Michał Dulk on Pawel Koniśewski, and we're going to talk to you about live migration. And let's start with a question, why even bother about it? So what can live migration do to improve your cloud? So we're gonna talk about three use cases. First one is imminent host failure, maintenance mode, and optimal resource placement. And first, let's start with imminent host failure. Imminent, in this context means that we still have control over our host, but we presume it may go down at any moment, and we want to play on a safe site and migrate out of it. So that may mean that, for example, our temperature is spiking up, and that may mean that we have cooling issues. Disk may start to show problems, we don't want to listen to data. For example, one of the network cards may break, but we still have control over our management network, even who want to mitigate the network problems. Or your data center might get struck by a flood, which actually happened in one of the corporations in Poland. And in this case, you may migrate your virtual machines above water surface. The most common use case for live migration is a maintenance mode. It's a scheduled maintenance in which we intentionally want to put down a host. For example, we want to change the firmware to upgrade BIOS, hardware upgrade, even replace whole node, maybe upgrade kernel. And we just clean up all the resources of the workloads out of the host, and then we can proceed with whatever we want to do with it. Another interesting thing to do with live migration is to place your resources in a way that would be more optimal than it was before. For example, we can reduce costs by moving virtual machines closer to each other or closer to their storage, so we can lessen the latency between these two resources and therefore gain a little bit of performance but with very little cost. Or for example, you can stack your virtual machines on a single host or a several host and therefore free up additional host, turn them off, save power, save money. On the other hand, you can, for example, increase resiliency of your cluster, so noise-enabled separation will be one example. Noise-enabled is a virtual machine which consumes unshakable resources. For example, CPU cache, you can put quote on a CPU cache, so if one virtual machine consumes all of it, it may impact performance on other virtual machines on the same node, and you might want to separate this problematic virtual machine so other virtual machines on the same host won't be affected by it. Or for example, you can spread your workload across as many hosts as possible, so a failure of a single node will cause less damage than it would before. So let's talk about Javadarflow. I give up to Miho. So Miho told you about what live migration can solve and let's talk about how it actually works. So there are a few assumptions. Live migration is, of course, live. That means that the VM is still running through all the process. It is consistent, so the state of a VM on the source host is the same as on this nation host when the transition moment comes. It is transparent, so VM doesn't know and doesn't need to know that it is live migrated. And we need to target for minimal service disruption, so we need to keep downtime of a VM as low as possible. You know, it's like there are various types of live migrations, so first of all, there's non-live migration called also called migration. This is simply shutting down the VM on the source host and booting it up on the destination host. This is something we don't want to cover in this presentation. We will be talking mostly about true live migration, so it can be based on shared storage or volumes, so you need to have shared storage for your VM internals, for example, CEP or even NFS, or you need to boot your VMs from volumes. There's also block-life migration, and this is an option that doesn't require you to have shared storage, but because your disk, your fmrl disk will be transferred over the network. The problem is that VM is suspended during the moment of disk transfer, so the downtime is quite high, but still, this is kind of live migration. And there are some compatibility problems with live migration. The most, what's most important on this table is that on the bottom right, that means that if you have any read-only devices attached to your VMs, for example, CD-ROMs, you can only live migrate such VMs if you have its internals on shared storage. So this is the most flexible configuration you probably want to go with that. So let me hand over to Paweł, who will talk about how it works under the hood. Well, so talk about live migration process. It basically consists of five stages. First stage, pre-migration belongs to OpenStack, second stage, reservation belongs to both OpenStack and Hypervisor, and every next step belongs to Hypervisor. Let me briefly walk you through each step. At the very beginning, we have two compute nodes, A and B, and virtual machine learning on compute node A. In this stage, we need to choose to which host our VM will be live migrated. We can do it explicitly, or let's schedule it for us. When we know to which host our VM will be live migrated, OpenStack and Hypervisor needs to reserve needed resources on destination hosts. This is to be sure that in the whole process, new compute nodes will be capable to host VM. When resources are claimed, Hypervisor starts to iteratively pre-copy VM state to destination host. In the very first iteration, whole memory is transferred to destination host, and in every another iteration, dirty pages are copied to destination host. Notice that VM continues to run on source host so that it still may write something to memory. If it writes something to memory, memory on compute node B becomes outdated, and we need to retransfer such pages, and this is what we call dirty pages. When VM state is nearly the same on both hosts, Hypervisor decides to pause VM on compute node A and transfers remaining state to compute node B. And when both states are actually the same, it starts VM on compute node B and removes the VM from compute node A. What is worth mentioning here is that in case of any failure, everything can be rolled back so that VM will continue to run on compute node A without any disruption. Now let's talk about performance and reliability. So we don't really feel a perfect world, and there are quite some pitfalls we want to put some light on, and we'll talk about each of these later on. Let me walk through it. So currently, OpenStack doesn't allow you to perform any operation on a virtual machine during live migration, not even cancel live migration. Pretty much all the stuff you can do with it is destroy a virtual machine, which probably you don't want to. And that's one of the problems with OpenStack, especially that, as Pavel mentioned, we transfer the state, when state changes, we have to transfer the changes as well to the network. And so if state changes too fast, we may end up not being able to migrate at all because our network may not suffice simply. So we may end up with completely, with live migration, we can't pause, we can't cancel, and it will never end by itself. If that occurs, also, I mean, live migration can generate a very heavy load on network, and the worst part about it is that OpenStack, by default, uses a management network to perform live migration. So if, for example, we end up with this virtual machine that will eat up all of our bandwidth, it might disrupt other services, like, for example, Rabbit, and that's a critical thing. There's also some issues between compute nodes with different CPUs because to migrate the state of the virtual machine, the instruction sets has to be compatible with each other. And the key is a few ways to achieve compatibility. We'll talk about it later. And there's a problem with schedules of memory over subscription. And yeah, we'll talk about it later, this again. Okay, let's get into details. So first of all, as Mr. Kostak said, you cannot schedule any operations through OpenStack on the ongoing live migration process. But you can still use VIRS instead of that. So first of all, if you want to know what's going on with your live migration, you can use don't drop info command on the source compute node. So these are examples of a few information that it produces. So first of all, one of the most important is time elapsed. So this is around 30 minutes. This one is going for a quite long time. And the second important one is data remaining. So this means how much data is remaining to be transferred. So if that value doesn't change over time, then your migration is probably stuck. So to resolve that, you may try this two commands. So first of all, you may cancel the process. So live migration, so the VM on the destination house will be destroyed and the VM on the source host will continue to run. That's don't drop a board command of VIRS. And also you can force live migration to finish the trade-off is the downtime, of course. So you will suspend the VM on the source host. So it stops writing to the memory and the state can be transferred over the network. So we increase the downtime, but the migration will finish. These were kind of absolute solutions and we can try to mitigate some mitigation solutions. So first of all, we can tune the maximum downtime, this maximum tolerable downtime that a VM can suffer and the key move, so the higher the value, the earlier the key move can decide to actually stop the VM on the source host and transfer the remaining state. So there are two commands. One is the key move command and the second one is VIRS command for that. So the difference is that the key move one can be used only on live migration that isn't in progress and live VIRT one can be used only on live migration that is in progress. Also key move one accepts values that are lower than one and the live VIRT one accepts only values higher or equal to one. Another idea to mitigate the problem with never ending live migration is auto convert feature of key move, so you can enable it adding VIRMigrate auto convert flag to your live migration flag in Novakon and this works in a way that when key move notices that live migration isn't progressing then it just throttles down the memory rights to 25% of initial performance. So if the rate of memory rights is lower then probably your VM will be more likely to migrate. The problem is that this doesn't degrade the performance continuously. So if throttling down to 25% is in enough then your live migration will still hang. Another thing we noticed with our tests is that the flag VIRMigrate tunnel is by default on and it works in a way that it makes it transfer the data from hypervisor to the live VIRT process then through the network to live VIRT process on this nation host and then back to hypervisor on this nation host. So when we disabled this flag, hypervisor talk to each other directly. So we found that this disabling this flag increases the performance of live migration four times. This may depend probably on hardware and the network capabilities but still you can try that to increase the performance. The problem is that you are losing the encryption if you are making your hypervisor to talk to each other directly. Also you may tune the bandwidth used by a single live migration process so there are two ways to do that. The live VIRT comment for a certain VM just to set it for one VM and the global NovaCon setting live migration bandwidth that set is globally for one compute node. So the default value for this option is like 7,000 petabytes per second. So that's kind of infinity. Another idea to lower the bandwidth used by the live migration is this algorithm. Well, it's XOR binary zero run length encoding compression. Whoa, I did it. So there's NovaCon settings also. You need to add this flag VIRMAGET compressed. So the idea, the name is quite hard to pronounce but the idea is very simple. You just try to transfer the deltas, the divs of the pages. So Dras destination host gets the div or patch, applies to his version of our page and he has the updated page. So this should lower the bandwidth used. If you rely on live migration strongly, then you may want to have it on a dedicated network. So normally the old traffic goes through management network as Miha said and there's some idea of a workaround for that. So if you want to have a dedicated live migration network, we may add a suffix to the option live migration URI. So this option is used by Liwirt to decide how to connect with another compute node. And you may either, this percent S sign is changed to the host name. So if you add a suffix there and set up your DNS to resolve host names with suffix to IPC in your dedicated network, then your live migration will be going to that dedicated network. So that should work. I will hand over to Paweł who will talk a little more about other issues. Okay, so Mihał told you that CPU instruction set of source host need to be a subset of instruction set of destination host. Consider situation where VM is running on compute node A and you want to live migrate it compute node B. It will pass because actually compute node B contains both instruction sets MMX and AVX. However, if we revert this situation, we want to live migrate VM from compute node B to compute node A. It will fail because probably compute node A will not understand CPU state of a VM because it may use some kind of instructions from SSE2. To mitigate this problem, you can explicitly set VM CPU in Novacon and we do not encourage you to use this in your whole cluster because you really want to let VM use new as possible instruction set for performance optimizations. However, you can, for example, make host aggregate when, where your environment will be heterogeneous and in such host aggregate, you can explicitly set CPU model to specific one. List of supported names you can be found in lib with CPU map XML. It also contains which instructions belongs to which CPU model. There's also problem with memory subscription, but before I start talking about how many of you use RAM allocation ratio that is different than 1.0 in production environment? Okay, so he uses it. So the problem is that RAM allocation ratio is something that belongs to scheduler. If compute node reports two gigabytes of memory to Novaconductor and Novac scheduler, Novac scheduler will get this value and multiply it by RAM allocation ratio and in the result it will report more memory than compute node reported. To mitigate this problem, you can set reserved host memory megabytes to negative value. It basically says how much memory Novacomput can use for itself. So if you set it to negative value, it will be subtracted from available memory. So that compute node in the result will report more memory than it actually has. So that Novaconductor and Novac scheduler will understand what memory subscription is. However, Novacomput will not understand any more what memory available is. So now let's talk about secure life migration. Yeah, security matters. Everything can be sniffed on, especially in the light of recent zero day on virtual machines. You can't rely on your hardware that it will be safe. And consider that life migration transfers whole memory of our network. So that means keys, users, maybe encrypted passwords, pretty much anything that is in memory goes through network. So if you have money in the middle, you may lose some valuable data. Because your machines may contain data, the sensitive data, or maybe there's some legal issues with unencrypted data transfers or issues, like for example, when you have PCI compliance and you don't want that transfer, that data transfer to be unencrypted. There are three ways to perform encryption for life migration. So when you would turn off the tunnel mode, the app advisor will talk to each other, not via revert. And if your app advisor does, you can do it via encrypted transfer. Yeah, you can use it, but Kimu doesn't do it, so we won't dig into it. To use encrypted transfer in Kimu, you might want to use tunneled. In this way, libret itself supports encryption. You just need to change the protocol that libret transfers its data to SSH or TOS. And you need to turn on the migrate tunnel flag. Problem is, you just use only one core per migration. So that's kind of a performance issue. Why to mitigate, for example, that maybe you can also create, the dedicated network life migration can be encrypted to, you can use some L2 layer encryption that would be completely transparent to hypervisor. But since I would dig a little bit more into tunneled transport because it's more flexible, so we've met up a little bit of a test and to compare the transfer rate of the TCP and SSH tunneled transmission. The way it's also problematic is that it relies heavily on memory access. So, and transfer of memory between processes. So it may probably become problematic if you can limit your transfer rate. And we've made a comparative study about these two processors. This is E5 version two and E5 version three. Version three has DDR4. It has some few more features that accelerate your memory access. And as you can see, the performance is about 20% better with version three. Even though if you look closely, this generation three has slightly lower clock rate. It's 2.4 gigahertz versus three gigahertz on all their processor. And it still has about 20% better rate. So that's something to consider. So let's talk about the future of migration. What we're going to do and what's gonna happen in the following months. So we told you how life migration looks like, how it actually works, how you can interact with the process. Now let's talk a bit about what's in the future of life migration. The first thing is multi-trade compression. Michal told you about compression that is based on the run link encoding. This is a bit different because it compresses every page send during life migration instead of pages that are recent. For compression, Zilip is used. What is configurable is number of threads and compression ratio. This solution is to reduce amount of data transferred over network so that life migration will probably finish faster and also will help to mitigate the problem of never ending life migration. The next thing is post-copy life migration. Everything we told you before is called pre-copy life migration. If you remember, iterative pre-copy stage of life migration process, VM, where VM continues to run on compute node A, this is a bit different because workload is immediately moved to destination host so that VM continues to run on compute node B. This is some kind of a trade-off because it completely removes problem of dirty pages. However, in case of any failure, the VM state will be lost so that it needs to be restarted and there is also a heavy impact on performance because memory hasn't been transferred yet so that if VM asks for memory that hasn't been transferred yet, it needs to be transferred over network so that memory access will be way slower. The next thing is active life migration monitoring. Now, user needs to track every problem manually so that if life migration will ever end, user needs to check the progress by the job info comment. The monitor will help with this because it will monitor the memory transfer to destination host and in case of any problems detected, in verification implementation, it will abort the life migration and maybe in the near future, it will continue life migration on VM that is paused. The last thing is actions on ongoing life migration. Also, Michal told you that there is no way to interact with life migration process through OpenStack. Every comment that he described, we want to make it possible to use them through OpenStack so that user will be able to pause VM during life migration, abort life migration, check the progress to see the memory transfer, also change configuration on the fly. So as you can see, there's a lot of going on in the whole OpenStack, not only the life migration topic. So we want to encourage all the users and operators to give feedback, any opinion, any back reports, any ideas for a new feature are welcomed. So you may use a mailing list for that. Both will probably accept the propositions for a new features. And apart from that, we are members of OpenStack Windows Enterprise Group. So you can catch us up on IRC and ask our questions if you want. And we may try to drive any features you need to have in your OpenStack Cloud. So we needed to show that disclaimer, so okay. But let's go to Q&A. Any questions? Yes. Could you please use the microphone at the back? He's here. Since post-copy can be awfully devastating to performance, have you tried something like doing a pre-copy first to get a first pass and most of the memory over and then initiating a post-copy? No, we haven't tried it. We are basing on Kimu and post-copy actually is not yet implemented in Kimu, but that sounds like a good solution. Nice presentation way, thanks. My question is, do you know something about features like Migrate VM, which is in the suspense state because currently it's not possible OpenStack do not allow it for you? Since Kilo, it is possible to live Migrate VMs that are paused. But it's a suspend. It's the same. It's the same, actually. LiveVirt, LiveVirt translate, I mean, in OpenStack it is paused, in LiveVirt it is suspended. Thanks. How do you verify that migration is successful? Do you do some check and how long does the check take? Well, the problem is that nothing observes what workload runs of a VM, so the only way, as far as I know, to verify if the process will eventually complete is just to check it through the DoB job info command. Okay, thank you. Have you done any work or do you know of any work being done for VM migration for VMs using SRIOE? We haven't tried it, but there is work being done in OpenStack to support it. I don't remember the name, but there is a person who is working on live migration, SRIOE support in live migration. Are you saying there's a blueprint for that? Yeah. I was wondering, does it make sense to you for live migration to be made its own project like Nova scheduler, sort of, Nova Migrate? So it has a daemon with plugins? So if I understand correctly the question, so getting live migration out of Nova Compute into another project? Yeah, well, how is Nova currently implementing Migrate? It's all inside Nova API and it's managing it, but like Nova scheduler is moving into its own project. Does it make sense to have a separate project? It's so much tied into the hypervisor and Nova Compute is talking mostly with hypervisor, so I don't even see a point of getting it out of Nova, I wouldn't expect that. I mean, OpenStack just initiated live migration. It's when it's initiated with OpenStack, actually selects the host, it's all on the app advisor since then, so. And it's only supported for KVM, right? Yeah. Or Libvert, basically. Libvert, Kimu, yeah. Thanks. You mentioned that it's possible to live migrate between hypervisor directly, without so bypassing the LibVirt. Yeah. What is the performance improvement if you do it like that and how can you actually set it up if you don't like encryption, like if you don't want encryption? We can explain how we've tested it. We've tested the network bandwidth used by a single process. So in case of talent migration, the bandwidth, we just know 10 gigabytes network, so only like 14% and yes, 14% of the bandwidth was used in case of talent migration. And we suspect that the problem is the transfer between the processes of the memory. So when we turn, when we disable the flag from Nova Conv for talent migration, the increasing performance we saw was actually 70% of the bandwidth was used. So with that, that means that the live migration is progressing so much faster. So basically, if you don't care about encryption necessarily, you can do hypervisor to hypervisor by turning off the flag in Nova Conv and it gets much better. There is one another thing we didn't mention. URI won't work. I mean, if you want to have this URI hacky thing we did with the suffix, right? We mentioned the live migration dedicated network, so this requires a tunneled on. But if you don't care about that and about then tunneled option is just degrading the performance. And it's only by default, so that's fine. But this is tightly coupled with hardware you use. So this increasing performance may be completely different in case of your hardware, so. Thank you. Not sure if somebody asked this before, but did you benchmark the memory consumption, I'm sorry, bandwidth consumption used by pre-copy and in post-copy? No, we haven't. As I said before, we are basically with Kimu and it's not yet implemented in Kimu, so I don't have any information about the bandwidth used in post-copy and pre-copy, the difference. Okay, thank you. So before migrating, is there a way to check any resources, compatibility? Meaning how many virtual CPUs and how many, what amount of memory the target node may have? Because if it's incompatible then maybe it's not worth it. Okay. If there are those substitutes, I don't think we know about it. Okay, so that CPU compatibility thing isn't checked, so that's the problem with scheduler, that it doesn't check that in case of live migration and this is, for a few releases, this is a problem and we are trying to solve it somehow. So when you're doing this migration, you're assuming that it's compatible? Yes, OpenStack assumes that. And then Libvirtripes will talk to each other and boom, we have a problem. Yeah, the validation is at Nova compute level, not at the scheduler level, so it may cause a rescheduling, so. Also the Libvirt doesn't, I mean, it doesn't really check what workloads your virtual machine actually uses, so it may not even touch the instructions that are problematic, but Libvirtripe assumes that it uses all of them, so it may still break, right? That's hence the limitation of the features, features set, you may, you can limit your features set, so to the features that it is actually using and therefore enable you to migrate, but that's again, that's a trade-off, right? Sure, okay, thank you, yeah. Hi there. What is required to clean up after everything or something goes wrong, is it, you know? During life migration? Yeah, during migration itself. Nothing, it just works, I mean, if it breaks the, I mean, the only moment that actually is vulnerable to this is commitment pace, right? So up until then, it just copies, copies, copies and it stops copying, I mean the first virtual machine, the source virtual machine is always, it will be still working for the same host as it worked before, so if you answer. I think the answer will be, leave your take care of that. Leave your take care of cleaning up after, if anything goes wrong, it reverts the process and your VM will still run the source host. Yeah, great, okay, thanks. Hi, should we expect problems when we want to live migration between hypervisors that run different versions of Libvirt? Well, yes, it should be backward compatible, but there are some versions of Kimu that do not allow to live migrate to the backward version. Also, I don't know if it was removed from Nova, but I think there is a check if destination hypervisor is at the same version or higher version, so probably it will not be possible to live migrate to lower version of hypervisor. Okay, thank you. Other question, the improvements you mentioned in your next slides, are they foreseen for liberty? Is that what you're aiming for? One, at least one is, I expect at least one to land in liberty, the active live migration monitoring. I'm discussing about every other feature when it will be possible to make it happen in OpenStack. Thank you, great talk. Thank you. Are there maybe any more questions? That's kind of funny. Yeah, so the problem is that actually, we noticed that we need to extend the keys between root user on the source host and Nova user on the destination host. So from the source host, the root user is trying to SSH for Nova user on the destination host. Well, it's done that way. Any more questions? Yes, thank you. Thank you.