 Good afternoon. Welcome, everyone. My name is John Garbert and today I'm here to talk to you about live migration. Hopefully you're here to hear about live migration. So before I start like talking about live migration all that kind of fun stuff, I wanted to like introduce myself quickly for those of you don't know me. My Twitter handle is JohntheTuberGuy and my ISE handles JohntheTuberGuy. Just to clear things up, I do actually play the Tuber. It's not just a randomly picked out of the air name. One day we'll have a summit where I can actually bring a Tuber, but we'll see. So I just wanted to quickly go through like my OpenStack journey just to give you context on where I'm coming from talking about live migration. I started working on OpenStack in December 2010. At the time, I was working at packaging OpenStack with Citrix Zen server and making that all work together. So that's sort of doing the packaging and we're working with Puppet and various things to get that going. Soon after I started to look more closely at things like how does Citrix Zen server work with the whole OpenStack ecosystem? How do we get it working with Nova, with Cinder, with Neutron, all these different bits? That was my introduction to the upstream community. So I got more involved, came to design summit sessions and really enjoyed that. After that, I got talking with Rackspace and it turns out they were running the code I was working on, a massive scale in their data centers. So I thought that'd be absolutely awesome to get involved with. So I moved over to the Rackspace public cloud working on that. Most recently, I've been sort of been stepping up with some of the upstream leadership. So I became Nova Ptl for the Liberty and Metacor Cycles. I guess that's not recently so much anymore, but I've been sort of Nova Core recently doing lots of reviews on these things and the latest thing I was helping being the tech lead for the OSIC group working on Nova and as part of that, we did some interesting work at looking at live migration and that's why I'm here today. So with any good talk, you start with like a cute picture you found on Wikipedia. So like why live migration, why is it important? You may know that if you're in here, but the particular use cases I was interested in was at Rackspace public cloud, you know, at some point you actually do have to upgrade the hypervisor and it turns out there's a whole mix of workloads on these kind of clouds. An awful lot of them really don't want any downtime in the VM. So the basic idea is like let's upgrade the host in a way that doesn't affect the VMs running on it. So as you might expect, the general idea was you move around the VMs from one host, so you get a free host, you patch the host, reboot the host and then you can sort of repeat the process because you've got somewhere to move your VMs and keep shuffling. We actually did that to quite an extent to sort of try and update bits of the clouds. That was my original introduction to live migration. Some people are using live migrate to help with the rolling upgrade of clouds, so you can actually keep you got your nnnnnn plus one nodes and you can switch between them. If you want to upgrade the hypervisor at the same time as all the code, you can sort of use live migration to do that kind of shuffle as well. There's been many work, many sort of use cases around workload rebalancing as well. So on one side, there's people who want to make sure that they're minimizing their power usage. They've got sort of bursty workloads. They try and power down some servers to save some money. to save some money. That's a use case some people like. The other one is there's a particularly busy, noisy neighbor and you try and get other people out of the way. So live migration, kind of interesting. So how does live migration work? I wanted to sort of delve into this. And the picture here is just a little bit of a joke towards nova configuration options. This sort of reminded me of that when I was trying to look for a picture for how does stuff work. I'm going to start with a sort of look at what makes up a virtual machine. This is totally obvious, but I just want to make sure we're all using the same words for the same things. So the virtual machine, you've got a networking connection to the virtual machine. You've got CPU and memory on the virtual machine. And you've got disks on the virtual machine. Some of the disks may be locally on your hypervisor. And some of the disks are remotely connected. So that's a place we're starting from with the virtual machine. So now let's look at live migration and its different phases. And the idea of me looking at this live migration is if we understand a bit more about how the live migration works, it'll be way more interesting when I start talking about the results and how to change live migration. So simply speaking, if you look at the virtual machine during live migration, there's kind of three key phases. So we have a source host over here. And we have a destination host over here. So the key phases for me are the VM is running on the source host. There's a point in time where the VM is paused and it's not really running anywhere. And there's a point in time when the VM is running on the destination host. So those are the key three phases I'm thinking about. So what happens when we start the live migration process and NOVA? So going back to the previous diagram about the virtual machine, what we need for the endpoint is that all of those things, the disk, the network, the CPU and memory, have moved from the source node to the destination node so that it can be running on the destination node. That's our aim. So one of the first things we do is we have to think about the networking. So the networking, we need to have the VIF set up on the source and destination node so that when we move, the networking can change between the source and the destination node. So there's a setup on the neutron side. NOVA has to go talk to Neutron to get the destination set up. The other part is you then need to think about the disks. If you've got remote disks through Cinder or other things, you need to make sure that you have the remote disk set up on both sides on the source and the destination. So we've got that sorted. And that was for remote disks. The second piece we need to think about is what we call block migration, where there's a local disk on the source host. And we need to get that local disk to the destination host. The rough idea of how this works is that we can make sure that any writes on the source host get mirrored to the destination host. So effectively, it's kind of the same. And in the background, we copy the rest of the disk, be that downloading from Glans or copying across the network. We get the disk synced in the two places. So we now have all writes going to both places. So effectively, we have the remote disks. The network connected. And the local disks are all moved across. So we're getting there. We're almost there. The next piece, which is where a lot of the fun happens, shall we say? Yeah, you can see seasoned operators sort of giggling at me from the front. The next piece is the memory. So what happens is we try and copy the memory from the source to the destination using iterations through the whole memory, effectively. Well, bear with me. So the first thing is, let's try copy all the memory from the source host to the destination host. So we get the first copy. Now clearly, if that VM is running, there will be more memory being dirtied at the source host that we then need to get over to the destination host. So now we just copy what's changed since we did that initial copy. So we do the next memory iteration. And then we go back, and that naughty old VM is still dirtying the memory. So there'll be a bit left that we try and copy, right? This is the memory iteration process. We're slowly copying more and more memory. At some point in time, we need to decide to stop doing this, because we can copy memory for the rest of our lives, and it's not very interesting. At some point, we need to calculate how long it's taking to copy the memory between the source and destination host. So we know that there's only a small amount of memory left to copy. So we pause the VM to make sure it's not going to dirty any more memory. Copy the final piece. So this is a decision point, right? We have to decide how much downtime we allow for that VM, i.e. how long we're going to pause the VM for it not to be dirtying the memory for that final copy. Once that's done, let's think through the destination node. We've got the CPU state we just copied at the end. We've got the memory state. We've got the disks local and remote. We've got the network connected. So we can now start it at the destination host and do some tidy up. So we've got the VM running from the source host to the destination host. And the key point to take about that is because of that memory, the way we iterate through the memory as we copy it from the source to the destination, we have a decision point based on how much downtime we allow on the VM. How long do we pause the VM in that phase? So at this point, I wanted to dig a bit deeper into that point. Like there's some trade-offs during live migration. Like the most extreme form of live migration, if you want it to complete quickly is you just pause the VM so you know it's not dirtying any memory. It's going to move pretty quickly, but you have quite a long downtime. So that's like the extreme case. The other extreme case is that you just try copying the memory forever and you never bother finishing, which this process, there's a decision point to cause it to finish. So let's talk through what are the things you can do to minimize the VM downtime. And these are the things that, so Nova's current preference in this trade-off is very towards the minimize downtime rather than the complete quickly, always complete kind of side. So one of the things we actually do is we actually ramp up the allowed downtime. So there's a configuration that says, this is the maximum allowed downtime for the whole migration, be that, I think, the default is about 500 milliseconds. Many operators find that they want to increase that to 1,000 to get a better completion rate. But what Nova is doing is rather than starting with, oh, if it's just going, you know, how long do we want to wait to start with 500? We slowly ramp that up. So if we go back to that memory copying piece, Nova's trying to say, like, you know, we have to decide, given how much memory you need to copy, how long is it going to take, and you're comparing it to that value, right? So that's what's happening with the ramp up, as we just do that over time so that there's more chance of the VM completing with a very small pause. One of the gotchas here is that it's very normal for people to say, oh, live migrate always works, and there's a lot of complaints about in Nova it's possible to have a timeout exception where it just says it's given up. The trade-off we're making here is we could just pause the VM and force it to move, but what we're actually saying is we're not, you know, we don't want to disrupt the VM for a certain amount of time, so that's the choice being made. Now let's talk about how do we make live migrate complete more quickly. There's some interesting features that's been added recently to help with this. One of them is the auto converge. We throttle down the CPU and memory usage, QMU throttles that down so that there's less dirtying to make the move quicker, and it does that basically, you know, over time it sort of throttles it more. That's one option. Again, this is a trade-off between, you know, there's a lowering in the guest performance to allow that move to happen. One other feature we've added recently is the exposed post copy. The idea of post copy is we can, if we go back to the copying of memory approach, is instead of, at some point you can basically say, let's now run the VM at the destination node, but instead of waiting to copy all of the memory so that you can run the memory locally, we just start requests whenever there's a memory request, we just fetch it from the source host. Up until the point where in the background we slowly copied all the memory, so rather than having iterations, you're taking a performance impact when you want to access the memory. So basically, post copy is where you're running it on the destination host, and you're pulling the memory from the source host. Now, if you've got this picture in your head of the VMs running over here on the destination host, and the memory is over here on the source host, no, clearly there's some latency here, but if someone comes along with a big pair of comedy scissors in the middle and cuts the link, like they sometimes do at previous keynotes it seems, but if you come in the middle of here and cut the wire with like a comedy pair of scissors, things go bad and that there is a risk there is that if you can't get to the source host, basically the only option you have is to reboot the VM. So there's a risk on network outages that you have to reboot the VM. Hopefully there aren't people with big comedy pairs of scissors walking around your DC, otherwise you've got bigger issues. Or, well, that's the QA guy, chaos monkey with scissors. Anyway, got distracted. So I've been through auto conversion post copy as features. There are some API things we added. The advertisement here is that there's also, down on the first floor, there's lots of interactive sessions where me as a developer gets to talk a lot and argue with lots of the operators about the features they need, how best we do it. And lots of these API features have come from those fruitful conversations. When the live migrate is going long, sometimes you know that you're gonna have to reboot the host. Like, if it doesn't move now, I'm gonna reboot the host. So you've got the option of actually now calling an API to force the live migration. What force does is it just pauses the VM to make sure that it copies now. So you take extra downtime on the particular VM when it's happening, so that if your automation can choose, right, just taking forever, let's force it to move so that you can make that move. In a similar way, if you decide that it's taking too long but I don't want the VM to have that level of downtime, you can kind of do the opposite, which is just abort the live migration and go back to what you were and decide to move some other VM if that's what you want to do. So those options are now available in the API. Another interesting myth, should we say that a lot of people say, is live migrate can't possibly work if you have a local disk. Some people have said that to me. It turns out if you use the Nova API, it was very easy to get confused about this. So before the Nova API, the early versions, you had to say, is it a block migration? I.e., is there a local disk to move? Or is it not a block migration? It's all remote disks. There's no need to do that. And very helpfully, the Nova API knew exactly what was needed to do. It just told you if you were wrong. So if you said I want a block migration, it goes, there's no blocks to migrate, you're wrong, bye. So that was a failure. Or the probably more common one is, you go please live migrate me, now I can't because you've got local disk. So what we actually meant was, please pass true into the special parameter to make that work. So we thought after discussing with the operators, it was probably bad to be like that. So we've got a new API which by default, we actually work out what we already know and do the right thing for you. We thought that would be nice. Hopefully that's good for you. It's an interesting journey because when people go, oh, I just need, there's local disks, I just need to turn this flag on in libvert. So we go, oh, okay. So you can just pass that in the API and turn the flag on in libvert. That sounds great in isolation but rubbish user experience. So feedback is really welcome and you can see the operator feedback evolving Nova here quite directly. One of the other recent things in Nova has been an awful lot of work on the scheduler. I'm not gonna deep dive in that now but when you pick live migration, the destination host has to be somewhere. One of the options is the scheduler picks that for you and until recently, Nova had a really bad memory on what you asked the scheduler to do when you first boot the VM. So right now, like all new boots of Nova VMs, we record what the scheduler hints were so that when we do live migration, we can actually use the same scheduler hints to try and get you to a good place. Not all the scheduler hints are good when you live migrate. If you have the special force to a particular host scheduler hint, that's not so good when you're trying to find another host, right? But in general, we try to remember the scheduler hints so that we move you to the right place. When looking back at this, when looking back at live migration, preparing this talk, I also discovered it's surprisingly recent that you can decide which network your live migration traffic happens on. There's a lot of network traffic as we're doing all these copies. You can now choose where that goes so that it doesn't interrupt any of the other traffic. You can put it in the place that you want it to be. So there's a lot of feedback on live migration. Recently, the feedback from operators has been, hey, it's working for me in my particular situation or my particular situation, it's not working for me. But there's a lot of people using it in production right now and it really does work. Although there's a lot of talk about a small number of errors that are happening and what's happening. So what do we do? We did some testing. So wow, we didn't get the test tubes out. We got rally out, but same difference. I'm told to look clever. I have to put a graph in my slides. This one actually means something but it's a little bit boring but I'm going to go through it anyway. Let's first of all concentrate on the bottom red line, the line at the bottom. What we did for this testing is we basically ran an awful lot of live migrations. This particular test run that generated these six points did about a thousand live migrations to actually get the green and red lines. I think it says min max. I believe they're actually like one sigma lines if I remember the graph rightly. The raw data is hopefully going to become available after the talk. We're going to try and get a blog post out that links to all the raw data and all the scripts we used for this. So you can actually go have a look at it and maybe even reproduce it if you want to do that. So we had a 20 node cloud. That's not particularly interesting. We were really just live migrating between hosts lots of times. There were six VMs on the host. So for the small case, there were six small VMs, six medium VMs or six large VMs depending on the run. And we were moving that between the hosts. Generally speaking, 40 times all six VMs were moved from A to B. That's how we got all the repeats in. So the key issue is if you have more memory, you're a larger VM, surprise, surprise, it takes longer to copy that memory across the network to the other place. So small VMs move faster than large VMs. It works exactly how you expect. That's good. Tick. The other piece to note here is that there's a red line and a green line. The red line is all about when you've got remote storage. So again, as you would expect, when you've got remote storage, it's a lot quicker to move between the source and destination because you're not copying all the disk. So in the green line, we're copying the disk. As you can see, the green line is effectively dominated by the disk copy because you can see the green line going up. When I go into the specifics of actually what sizes we used, basically half the disk was full in each of the flavors. So there's like, the flavors stepped up in size. So you can just see that it's proportional to the size of the disk, basically the time taken. Back to my original point, the green line is there, it did work. So you can move local disks, but it does take a significantly longer amount of time because there's more data to copy between the two places. As with statistics, there's always lies, damn lies and statistics or whatever Churchill said. I'm in the Brit, I should probably say that, right? So I didn't put on there the failure rate. That particular run had zero failures for all of the thousand live migrates. Admittedly, it being a artificial-ish environment, the workload was such that we knew it would move and it reliably moved within the infrastructure. If I stopped there, I would be lying. We found some problems, which we've now got fixed upstream. The first one was about a progress timeout. The problem that we had was the top right point on that graph, i.e., the large VM with local disk, had a couple of failures during the runs. When we looked in the logs, it was failing with a progress timeout error. What we happened to see was that when we dug into the code, we were actually checking to see if the live migrate was progressing by checking to see if the memory copy was progressing. That seems all well and good. It turns out the memory copy isn't progressing if you're busy copying the disk. Oops. So, yeah, what we've done is we've disabled this progress timeoutting. So, yeah, my big fix was to default to zero rather than 100, and we backported that and put release notes in. So, if you're having problems with local disk migrations, I'd look at this configuration value for the progress timeout. Talking with some of the people much closer to QMU than myself, when we dug into this code, people were a bit worried about the checking here anyway, so I need to sort of have a think about it. It turns out what we were doing was we were sampling a sawtooth by which I mean we spoke about memory iterations. If you look at the memory progress, it's a bit like this, right? So, each iteration, it goes down. Now, if you're sampling this at random, not knowing when the memory iterations are, you can actually get a straight line even in the sawtooth, and you still think there's no progress, or you can actually get the line going the wrong way if you sample really badly. So, we need to work with the upstream QMU community and work out how to get better information out of that, or how to better use their APIs to do progress. But right now, we've just turned it off because dragons. Another thing was I was talking with the ops team doing the testing. They said, oh, yeah, it's working great, but there's a couple of them that seem to get stuck and not go anywhere. Like, they still seem to be live migrating now. That was bad. So, we dug into this. It turns out there was one particular bug hitting them here, which was, we called libvert to undefine the domain, and libvert said, no, you can't undefine the domain, it's not defined. So, there was certainly some race in there that the error handling wasn't quite right, so we got rid of that bug. But looking deeper, if we go back to the live migrate process, it's running on the saw, it goes to the destination, and then we do the tidy up thing. Any errors during doing the tidy up thing were actually not being handled correctly by Nova, and just leaving the live migrate in a migrating state. So, I fixed that too. We actually now correctly turn it to the error state so that you can see something bad happen with this VM. A lot of the time during errors in live migration, we do our best to try and roll back, but if we spent an awful long time trying to move your VM and then fail, the rollback would involve like, then moving the VM again back to the other place, and that seems like a terrible thing to do. So, for now, we're just going to error, but at least it doesn't get stuck in the migrating state. I know lots of people are complaining about it getting stuck in the migrating state, but we managed to actually get this test suite reliably reproducing these bugs. So, it turns out if you do a thousand live migrates across all of these different flavors, you can actually get this reproducing. So, with these patches in hand, we applied them to the cloud and over, we did a new test run, and we got rid of all the failures. It wasn't quite that simple. We thought the first one was all of the fixes. We did a test run and we still had the other bug, and we got rid of that one, and we were happy place. So, we've got some real stability improvements. Just to set the context here, the error rate we were seeing was well under 1%-ish anyway. So, what we're doing is we'll go from a 1% error rate to the theoretical zero. Like, it's not really zero, right? It's just that within our environment, with our test we saw zero failures, which is great. Outside of the OSIC testing, one of the things I've seen as an operator, like the Rackspace public cloud, was that Nova's using user tokens to talk to other services. So, we get the token through the API, and then we use that token to talk to Cinder, talk to Neutron, and other places. There's problems if that token expires between hitting the API and you hitting some other API. So, one of the things we've done is to use service tokens to expand the... So, when you have the user token with a service token, that user token isn't going to expire because we basically trust that the service token means that it's already previously been validated. Anyway, lots of waffle, waffle, waffle. There were token expiry areas that meant you got live migrates stuck in the migrating state. So, we've stopped that error happening by using service tokens. So, if you configure service tokens, it goes away. And if something else bad happened, you wouldn't go to the migrating state anymore. You would go back correctly to the error state. This next graph, I don't really understand. But I'm going to show you through anyway because it's got an interesting fact behind it. Going back to the use cases about evacuating a host, purposely use the wrong word there. That's a different NOVA operation. Don't use that for evacuating a host. That's a different thing. There's a great blog post on that. When you're trying to empty the hosts of all the VMs, in this case, we're using 12 medium VMs on the same hosts we were doing before, I wanted to work out how many of these should we do in parallel to be as fast as possible to move all 12 VMs? So, this graph is like, how long did it take on average to move all 12 VMs? We did about, I think, about 40 repeats of this test. Sort of like, well, back and forward 40 times. So, you know, 20 move from A to B and back again. When discussing these results with the operators doing it, I said, you know that sort of flippant comment I said that NOVA has this config value that says you can only do one live migration at once by default. Did you tweak that? It was the response I got. Turns out this test was done with the default configuration which says only do one live migration at once. So, if you want to do parallel live migrations, you might wanna tweak that config. Because with that config as one, are you only doing one at once? If you ask it to live migrate to at once, there's no API failure, it just gets queued up and overcomputed silently. You're welcome. Serialized, yeah. It's just the bit that gets serialized is actually the bit where we activate the live migration. All the setup for it actually happens in parallel, but yeah, it turns out doing two when they're getting serialized and overcomputed at that point takes longer. It looks like we're tending towards four and a half minutes. Like I'm guessing it's just going up there. There's some overhead we hit. It could have been in our benchmarking tool for all I know, but yeah, that was weird. I put lots of things here. There's gonna be a blog post describing all the details and basically we use the small, medium and large VMs. This is the point I was trying to make is the disks were 40, 80 and 160. So the actual amount of being copied between that was like 20, 40 and 80 between those things. For the workload on the VM, we actually were using like a Apache Spark just to generate memory dirtying during the process just to make sure that it didn't move really quickly because that would be really boring. During the testing, we did ping tests and had a TCP stream open with a VM just to try and make sure to see if the networking was interrupted at all. I don't have any results on that because we didn't see any TCP drops or any ping test drops. Admittedly, we were using regular provider networks. Lots of interesting information, but I'm not gonna go into that right now. So future things, like what's happening, Pyke, Queen, Rocky. We spoke about the problems I found with the progress monitoring during live migration. At the PTG, we had some great discussions about ways we can replace that with some interesting APIs. We've got some specs up for that right now. That works happening. I would love for people that want to dig into live migration to spend a bit more time with very busy workloads and how they move. Clearly that's a different problem that we haven't really addressed here. And we'd love to like dig into that and see if the APIs we have work well for those kind of use cases. There's some ideas about tracking progress using memory iterations. So we might be able to actually use memory iterations rather than time values for the default how long we wait to give up. Because that's naturally hedged between the size of the VM and the bandwidth that's available within your system in interesting ways. So some interesting thoughts to dig down there. The other heads up is, I was in a session earlier in the week and I promised to put this point in the slide deck because this is sort of me promising to go review a patch. We have some issues if you start pinning resources. The live migrate will work great but you don't necessarily have things pinned where you would hope they would be. As in we pin it in exactly the same place it was in the source node even if there's something already there on the destination node. Oopsie daisies. My final slide is about takeaways. In the UK fast food like this is called a takeaway. This is like fish and chips. The other thing is this is chips. Not those things they had downstairs for lunch by the way. They're crisps where I come from. I just thought I'd say that. Anyway, maybe we call it fish and fries and then it would translate better. I don't know. Anywho, sorry, British rant, finished. Live migration works. There's caveats but it does work. People are using it in production. Give it a whirl. Give us feedback. Shared storage. When you've got remote storage, shared storage, it's quicker but local is definitely possible. It does work, particularly when you turn off the progress thing. Whoops. And generally I just wanted to like try and give you an interesting look inside the tin so that you can better understand the trade-offs involved. The analogy I thought about this is if you've got one of the new raspberry pies with like the official case it's a bit like when you take the case off and have a look at the chips inside and go, ooh. It doesn't actually help you but I've given you a guided tour inside the little box to try and help you with the live migrate hopefully. Okay. Questions. I'd love to go to questions. I see one waiting. John, thanks for... John, John, John. John, thanks for the presentation. It was great. As an operator I've got a share thing and a question. Absolutely. We've actually had compute hosts that have lost their root disk. It's usually a controller issue. We use both root disks. They're a rated pair. But we're able to live migrate off of them. Everything except the post commit. Kind of the tidying up I think you called it. Got you. Because Libvert can't write to that local disk. So what we do is we basically monitor with an Ansible command to make sure it's really running on the new node, not really running on the old node. And then I go do a database cleanup manually. Is there anything that audits NOVA where instances are actually running? Because we've written tools to do that because I don't think there's anything that exists. Am I out in the field here? So there is a periodic task that runs that has a configuration option about what it does. If I remember rightly the default behavior is just to moan at you in the logs. Yeah, you can set it to reap and just delete things. I'm not sure I like reap because it seems a bit sort of whack-a-mole. But there's definitely information in there. Should we have more interesting tooling so you can make a more informed choice? Yes, there was a really interesting PTG discussion, forum discussion this morning about health APIs and like the health of the compute node. That would be a really interesting thing to put in there. How many foreign bodies have you currently detected in this place? We call it ghost detection. Oh, right, ghosts, yeah. That makes sense, or orphans, but that seems a bit Annie. Yeah, that's a good question. The other thing was is that now we fix some of the error handling. The instance should go to the error state even though just that little bit of tidy up failed. I think we do that bit last, but I can't remember. And you'd actually be able to use reset state to put it back to active. So it might actually help you. Hey, Chet. Hey, I got a question. So Chet's question for the video was, hey, you say you're saying it to error. You're not killing the VM and hurting me, are you? No, we're not killing the VM. It's literally just the API state goes to error. Previously, the problem was the API state just got stuck in the migrating and it looked a bit limbo-y. Yeah, so for the video, like when you do the cleanup, you can fix it. Cool, any more questions from people? Well, for the more shy people, like I certainly was when I came to the summit, feel free to come and talk to me afterwards. But yeah, thank you very much for listening. Thanks very much.