 Hello everyone. Thanks for joining me this afternoon. My name is Steven Gordon. I'm a senior technical product manager at Red Hat primarily for OpenStack Compute and its interaction with related technologies in the underlying hypervisor. Today I'm going to be talking specifically about the features in the compute service for moving instances around an OpenStack cloud and primarily some of the differences between them from a user-facing point of view and also some of the things that have been done in Liberty to improve these facilities and also some of the things planned and being discussed from Ataka at the moment. So in terms of that I'll be focusing first on defining what we're actually moving. So you know the spoiler alert at the bottom there is instances so compute instances we're moving around but what does that really mean in terms of the components of those virtual machines? Why we're moving them? How we're moving them? So in terms of those user-facing APIs to actually initiate the moves and then also the new enhancements that are coming both in the current and upcoming release. So in terms of what we're moving if we talk about our instance or our server as it's referred to in the NOVER API we have our guest configuration which can be the simpler or more obvious things like how many CPUs, how much RAM is associated with the guest but also all the way down to the device profiles that are actually exposed by the hypervisor. So today in particular I'll be focusing largely on the libvert slash KVM driver and the way migration works with it and it includes as part of the guest configuration a lot of different things and options for things like the disk device drivers that you should be using and so on. There's also the guest storage primarily here I'm talking about the initial image or volume from which the guest was booted and I separate that out from the disk part of the guest state so you're a ephemeral storage. So in terms of guest state we're talking both about the in-memory state of the guest but also what's stored on that ephemeral disk. All of the paths we have for moving instances around a compute deployment involve moving some subset of these elements but which mechanism you use for moving instances will vary depending on which of these elements you care about so you may for example not necessarily care too much about the on-disk state for ephemeral or the in-memory state if you have particularly cloud-ready applications and you may be willing to throw that away but you just want to know that the instance is going to come back up somewhere else in the initial state that you had in the guest in the image. So in terms of why we're moving instances so in the open stack world at least all of the APIs are going to talk about are marked as admin APIs in open stack parlance so that means they're not necessarily exposed to normal users but they are available to the tenant administrators and that that becomes a little bit important shortly because even though they're only exposed at the tenant admin level it still exposes in the current form back-end details to that tenant administrator that they shouldn't necessarily have to know so why might we might might we be performing these moves so we may be doing them proactively so in advance of node maintenance so that maybe we're adding or removing hardware or we know based on our monitoring tools that there's an imminent hardware failure coming so we might preemptively try and move our instances off of that host it may be reactive in terms of the node has already failed for some reason it's lost power or there has been a hardware failure I'm also also for potentially capacity management reasons so in the way the schedule works in Nova is we basically place once and then once the instance is running somewhere there's not really a concept of dynamic rescheduling in reaction to the way capacity needs may change or something like a noisy neighbor effect so the admin of the tenant may want to move instances around either to spread them out or consolidate depending on what their goals are with their open stack deployment so getting into the meat of it and how we're moving instances so I mentioned there are a number of different API's in Nova for achieving a similar goal and that is in fact part of the reason for the confusion which I guess it spurred me to do this talk in the first place so even in the development community this is little understood in terms of the API documentation around this isn't great and we provided a number of similar things that were exposing to the user back-end details that they probably shouldn't have to know about how they move is actually happening so if we look in the command line help we see that there's seven different actions in the Nova command line to client that relate to moving instances one of those is the list so we kind of exclude that but that still leaves us with six different ways we could move an instance from one host to another potentially so the the issues with this range from minor knits like the fact that we use migration rather than migrate when we're talking about the live migration down to the fact that we include servers in the API call in one case but not the rest or even in the last example the case where we conflate live migration with evacuation which are actually completely different things and the user looking at this doesn't necessarily know whether you're doing an evacuated or live migration when you do a host evacuate live and we'll get to that in a moment. So trying to step back a little bit the primary three mechanisms evacuate migrate which is sometimes called cold migrate as a way of trying to be a little bit more specific about this or live migration so evacuate specifically rebuilds an instance that is currently on a compute node that is down on a different compute node that distinction is important because in many ways it acts in the same way as migrate but migrate only works when the source host is up. It's a little bit more nuanced than that and we'll get into that in a second in that by rebuild in the migrate case we really mean resize so the migrate command is in and of itself a path through to resize which is originally designed in the Nova API for resizing an instance between two flavors and coincidentally as part of that operation it puts it on a different host so migrate is abusing that functionality in some ways and it also results in some other oddities which we'll see in an example in the moment. With live migration that's the case where we're moving an instance keeping all of its states and not just the undisstate but also the in memory state and trying to do it with as little downtime as possible to the point where that downtime should be unnoticeable to the guest operating system applications. In terms of the other commands that are listed so host evacuate host servers migrate and host evacuate live these are all actually helpers and don't necessarily map one-to-one to a Nova API call at the back end either so some of them are actually implemented in the client so host evacuate does a rebuild of all of the instances on the specified host and puts them on a new compute node. Host servers migrate does the same with the migrate command so again that distinction between whether we're talking about a host that is down or a host that is up and then finally host evacuate live doesn't actually do an evacuation but live migrates all of those instances on that host to a new place. So drilling down on each of these individually when we talk about evacuation so we have the Nova evacuate command we have optional password and on-shared storage arguments and then the mandatory service of the instance name or ID that we're trying to try to move and optionally a target host. So as I mentioned evacuation only works when the compute node hosting the instance is down or recognized as down by Nova more importantly and it will rebuild the instance on a new compute node. The main so obviously when we're doing this rebuild the node is down so we can't copy for example the in-memory state or even if we're not using shared storage the ephemeral disk state to the new location but there is still some benefit over just starting completely afresh and that you do get to keep the unique ID and the IP and a couple of other details about the instance it's the same when you use this mechanism so that's the benefit you get there. If we are on shared storage so if we specify that flag we can also get the ephemeral disk across as well. The one other thing is because we are doing a rebuild there is the opportunity to inject a new admin user password at this point if we don't specify one it will be it will be randomly generated for us anyway so in the example here I'm just doing a Nova evacuate without shared storage you can see it generates me a new password and prints that out on the command line for me. The one other thing I should should mention is evacuation does allow us to bypass the scheduler so if you don't specify a target host the scheduler is going to pick it for you which is the default behavior but if we specify a target host we are bypassing the schedule completely and that's one of the reasons this is an administrative API it's not exposed to the normal user. When we talk about cold migration it's a little bit different so it doesn't have nearly as many options available and part of the reason for that is because as I mentioned before it's actually going through the resize API behind the scenes and that resize API is actually available to normal users so that's why for example it doesn't allow bypassing the scheduler and it doesn't necessarily require you to know what storage is involved either. So as I mentioned before it only works when the compute node hosting the instance is up or still up and it rebuilds on a new host selected by the scheduler so that involves actually shutting down the instance copying the disk and then starting the instance on the new hypervisor and after it's successfully done that it'll also remove from the original hypervisor. One of the weird things about it is because we're using the resize path the resize API call has a manual confirmation step so someone has to manually confirm that the resize worked before the instance will go back to its normal operational state. The same applies to migrate because it's going down that path which is a little bit of an oddity in the current way it's implemented. So in the shared storage case migrate doesn't actually know that so it will do the copy anyway which is problematic in and of itself so one of the weird things about so this is where the I guess information that a tenant admin shouldn't necessarily have to know is filtering up through the API is that NOVA is not make but not even trying to make a determination in these cases at least as to whether it's on shared storage or not. So for that reason that tenant admin trying to initiate this command or not the specific command with the others has to know and tell it and that's one of the problems with the current implementation that we've been discussing a little bit because as a user of the cloud obviously you shouldn't necessarily need to know or care what back-end storage is going there even even as a tenant admin you know because you may just be administrating for your particular department or whatever it happens to be. So in the cold migration state here I used the poll option and that ticker goes gradually from 0 to 100 percent as the migration completes I do my NOVA list and you know oh that's weird I have this verify resize status rather than just up which is what you might expect running this for the first time. So then from the command line I'm just running that quick resize confirm it is also available from the horizon dashboard so if the instance is in that verify resize state there'll be an extra button or option there to do that as well. So I'm going to move on now to live migration which is I'm probably what I'm going to spend the vast majority of the rest of the time talking about both in terms of the fact that it is I guess one of the one of the ways of moving instances that people find more interesting but also that because of what it's trying to do it has more complex prerequisite requirements to actually get it working. So live migration as I mentioned before moves the virtual machine from one host to another without any noticeable downtime we'll get to what noticeable means in a moment but I say that to highlight the fact that there is technically actually a brief outage as we're completing the copy. So there are two approaches to live migration supported both at the QIMU slash Libvert layer but also via OpenStack Nova. So there's using shared storage and I include in that volume based so that means that either you're using shared storage to back up the Nova instance or share the Nova instances path on each of your hypervisors so that any individual hypervisor can access that's those same set of disk images or alternatively if you're using boot from volume then you effectively have shared storage in that case being supplied by Cinder. We still obviously have the need to sync the memory state while we're doing this but we've obligated the need to copy across the disk storage because that's already there. The other alternative is using block migration which in previous or older versions of QIMU was kind of a little bit shaky. It has a completely new implementation and newer versions of QIMU. Quite a few people use it and it does a direct transfer of not just the memory state but also the disks. So the trade-off you're making here is that it does take longer you're transferring a lot more data between the two nodes but on the flip side some operators prefer this approach because either or they don't use migration that often and they don't want the overhead either performance or administration wise with running shared storage if they only need it for these occasional maintenance events. So that's that's kind of the trade-off that you're making there. Shared storage migration will typically be much quicker but it does have that extra overhead to actually have it set up and working. So in terms of how it works kind of at a high level. So initially the Nova scheduler selects a destination host although again I should mention that like some of the other commands you can actually override it by specifying a target host. I should also mention what I'm saying that that again here if you're using block migration you actually have to specify that so the default is to try and do shared storage migration. So the scheduler selects a destination host unless you specified one yourself in which case again it bypasses a scheduler. Weirdly though in this particular case and this is an oddity that's probably worth highlighting it will come up again in the end when we talk about issues that we're still working on. There are additional checks done in the Libbert driver on both the source and destination host on disk, RAM, CPU model and also the mapped volumes that may or may not be connected to the instance. RAM in particular causes some issues at the moment so for those familiar with the concept of over committing memory by default or in the default configuration Nova over commits memory by 16 to 1 operators tend to change that substantially but regardless a lot of people are using OpenStack with over commit of some level enabled. This calculation is happening on the compute node and over commit is a scheduler side setting so when we do this calculation on the compute node at the moment we're not factoring an over commit at all so that means that although by your over commit calculations the destination host should have enough space to take the instance it may actually fail anyway that's an issue we're currently working to resolve. The mapped volumes comes up because when we're using or when we have an instance at the moment with kind of a mixed storage model so where we've used an image to boot it and then we have attached volumes associated with it the migration of those mapped volumes will currently try and copy them over themselves also not good so that's again on the list of things that we're trying to resolve that I'll get to particularly for the metaka piece of the discussion. So anyway assuming that our source and destination host checked out we move into stage three which is what we call the iterative pre-copy so what this means is we start copying memory pages from the active virtual machine to a new virtual machine that's in a paused state that we create on the destination. Obviously while we're doing that our source virtual machine or the virtual machine on the source host is still running still dirtying pages as it accesses memory so while we're doing this we take one big block copied across keep going and then eventually when we get to what we think is the end we take another look and we say okay we have more dirty pages we keep copying and gradually the idea is that the delta should get smaller and smaller to the point where QMU calculates that okay given the transfer rates I'm getting I'm going to be able to copy all of these remaining dirty pages in one step and that's when we pause the source VM copy that last step typically in a matter of milliseconds and then fire it up on the new host and then finally once that's worked we clean up the source. In terms of gotchas or how it doesn't work I mean I mentioned a couple of things where we went through there one of the big things that causes a lot of questions so I'm fairly active on ask.openstack.org for example and there's kind of like two questions two nova questions that are just recurring and this is one of them the other one is why did I get a no valid host error gets a lot of free calmer points for answering those so in terms of the CPU mode or model compatibility we have this idea in at least the liver and KVM based hypervisor of exposing to the guests some virtual CPU which may or may not match what's on the physical box so the maximum performance we get is by exposing every single CPU feature that exists on the physical CPU die the downside however is that live migration is going to fail if on the source and destination host we don't have that same exact set of CPU features and the problem here is that even within certain named model ranges of CPUs which you know as a consumer of the of the hardware you would expect to match exactly occasionally there are changes between them which can cause this problem so effectively what you have to make is a performance versus flexibility trade off so you want to determine how much of that performance you want to trade off for the ability of increasing your effective live migration domain or the size of your live migration cluster so where you can live migrate to from any given host so Nova provides exposes some configuration keys to influence this choice a little bit so the host pass through option is kind of what I talked about first the idea of just exposing through every single feature that's available for maximum performance but minimal ability to migrate basically alive my great I should specify in particular we have the idea of host model so where we can pick from anyone of a number of host models that are already known to live in QMU predefined so we can use one of those and those kind of best fit for that sorry what I was referring to there is actually custom so I'll go back a step so host pass through is all the way through all of the features host model is an approximation of the host CPU model so what I mean by that is it'll take a look at the the physical features being exposed compared to the list QMU knows about and do an approximation which is kind of generally relevant to the lowest common denominator within that model and that allows you migration at least if all of your CPUs in your cluster are based off say sandy bridge then you can migrate between all of those now where you can hit trouble obviously is where you have heterogeneous hardware then maybe they're not maybe you have some west mere some sandy bridge some something else and you have to find some lowest common denominator for those if you want to migrate between all of them and that's where custom comes in so this is where we have the ability to actually specify a specific CPU model that we want to use so what one of the example questions was came up with someone was trying to use a I3 at six box or I686 box and a XX664 in the same cluster and to get that you basically have to go down to like a Pentium 2 so that's probably a little bit of an outrageous example but that's the kind of thing where you really have to trade off the performance if you want to migrate between these machines that are you know pretty vastly different different although even if they're both you know x86 architectures so to find out what models are available we can use verse or QMU KVM CPU help just to get that list and then at the bottom there we just drop that's pretty much straight into our Nova config in terms of other ways to fail I'm not going to go into great detail on these so there have been a couple of good presentations particularly in Vancouver covering a lot of the details about how to set things up to avoid these specifically so I have that linked on the reference slide at the end for people who want to dig into those if they missed those but similar to CPU models we have a concept of machine types in QMU which effectively defines what hardware we're exposing to the guest the machine type in use has to be available both on the source and destination hypervisor typically the distributor if you're getting packages from one will deal with that for you it's mainly an issue mainly becomes an issue when we're trying to upgrade between distribution releases and need to migrate to do a rolling upgrade inconsistent network in configuration so but the source and destination destination hypervisor have to be able to talk to each other on the specified network in inconsistent clocks can cause a lot of issues which trying to sync this up as we go through VNC listening addresses so when we if we're too specific in the way we're binding VNC on the source host and it may not work when we move to the destination host and we may not actually be able to connect via VNC and if we're doing security or a secure live migration so using SSH tunneling we need to make sure that that's set up correctly and we've distributed keys correctly in terms of other operator issues which are maybe a little bit more open stack specific so that the list of issues on the previous slide is pretty general to if you're doing migration with the Burton KVM those are things you need to be aware of as a deployer but open stack specifically so migrations can take too long or fail to complete so if we think about what we talked about before we're basically at the whim of how big the virtual machine is and how active it is in terms of whether we're going to be able to migrate or not if we can't keep up in terms of network throughput and we're not doing some tuning around the edge to try and compensate for that then we're not going to get anywhere until very recently Nova wasn't doing any of that tuning so I'll talk about that some more in a moment we also need to use verse bypassing Nova to do a lot of things so you can't for example by the Nova API throttle migrations cancel migrations monitor how they're going so there's very little to no feedback tune the migration max downtime so what that means is the finalization step I talked about where we pause the source VM to copy those final few dirty pages across and restart on the new host the there's a value associated with that called the max downtime and that's what QMU is evaluating against when it makes that calculation of can I finish the migration so it's looking can I finish the migration in this time if you have a particularly large virtual machine you may want to extend that time a little to give you a higher chance of actually finalizing the migration so Nova hasn't been tuning that up until this point so that's something we've been missing out on there are certain instance configurations that can't be migrated at the moment so mixed storage is probably the most glaring one which I touched on before so if you're mixing image and volume based storage if you have a config drive attached all these things are currently not going to live migrate that's something we have to work on and also the use of pass-through devices so this is less likely to go away anytime soon but if you're using physical SROV cards GPU pass through anything like that then at the moment there's no live migration available with that as either there are some things we can potentially do in the future with SROV particularly if you're using a Mac V tap device there's a performance trade-off with that as well so in general we can probably assume that for pass-through devices live migration is going to be not in the question anytime soon I mentioned live migration doesn't correctly account for over commit so that's also an issue for operators currently effectively you know if they're doing a lot of migration it really impacts them in terms of the memory utilization of their cluster and then also I mentioned the tenant admin currently needs to know shared or block storage is available so what have we done about that in terms of the Liberty release so long-running live live migrations so I mentioned the factors involved here are the amount of guest RAM speed with which that guest RAM is being used or dirtied and the speed of the migration network so we were using a fixed maximum down time with QMU as of Liberty we have the ability to scale up the downtime to allow a better chance of completing live migration we also have a limit on the number of outbound live migrations that are going at once so if you think about if I have a host and I have say three live migrations all going outbound at the same time they're all using you know approximating a little bit you know 33% of the network traffic give or take theoretically by doing that I'm limiting the chance or decreasing the chance that any one of those three is going to finish because they're all getting less network traffic than they would if I did one in time so that's what we're getting at there kind of it was related work which you know is indirectly related to migrations we're also now limiting the number of inbound build requests that a single host will try and take on so the default for that is a little bit higher talk about that more in a moment and also you know just talking about all these changes when combined the idea is that we're maximizing the chance that finalization step will actually finish particularly for large VMs we also have a lot of new configuration keys to influence this behavior because it can depend a lot you know on how much throughput you're getting on your host what type of network cars are using and so on and how frequently using live migration for that matter how you want these things to work so in terms of the scaling steps so we have the live migration downtime so that is that maximum length of time for the finalization step and that's effectively saying what's the maximum amount of time that the VM can be paused to complete migration on the source the number of live migration downtime steps so we take incremental steps to reach that value so we don't initially set the downtime to the maximum allowable we try and start optimistically and then gradually work back to that more pessimistic value which is obviously going to be longer finally the live migration downtime delay so the time to wait before we increment through those steps and all three of those work in unison to handle the way we scale that downtime value and I'll walk through an example in a moment because the way those combine is kind of a little bit non-obvious to explain in terms of timeouts so we also have some overall totals so these are kind of outside of the scaling process but how long can I allow our live migration to go before I assume it's not going to finish so we now have a value to set that time out value it is scaled by the amount of guest RAM just in number of gigabytes so if I set live migration completion timeout to 800 for example which is the default that's actually multiplied by the amount of guest RAM to get to the value that's actually going to be used in terms of live migration progress and the timeout there so what we're saying is if we don't see progress in terms of the data being copied for a certain amount of time a default 150 seconds then at that point we again we assume live migration has failed catastrophically and we try and clean up in terms of concurrent operations so I mentioned you can control the number of max concurrent live migrations we default to one which is pretty pessimistic but you know it's recommended that if operators want to run concurrent live migrations then they test that first before they increase that and max concurrent builds and new instances being built on a single hypervisor with a default of 10 so in terms of the stepping example so here I'm using a 400 millisecond max so again that's that maximum amount of time for the completion step with a VM on the sources paused 10 steps a 30 second delay between each step and a 3 gigabyte guest so first of all the delay between the steps is set to 30 by the number of gigabytes of RAM in this case 3 so you can see that I go from 0 to 90 to 180 increments and we scale exponentially across those steps until we hit 400 so you can see a start off with 37 milliseconds 38 milliseconds and then gradually we go up all the way to 400 milliseconds and it's a little bit hard to read on the chart here but the blue line going up on the diagonal is the delay and the red line towards the bottom is the max downtime as we go up and try and give that guest a chance to finalize so what we're effectively saying is that at the end of the day if the guest can't be fine or the transfer can't be finalized in 400 milliseconds then it's going to fail and we fail out the migration when that happens but we're giving those larger guests the best opportunity we can to actually finish the live migration one other addition in liberty which is relevant more to evacuate the live migrate so we have the ability in liberty to report from external tools into Nova that a host is down so the reason we might want to do that is that as I alluded to when discussing evacuate Nova is not going to immediately notice that a host has gone down it does it virus here is a periodic task which means you can depending on how you set things up you know host failure detection can be kind of one to two minutes for Nova to pick that up in the database we have external tools like pacemaker which are much faster at picking these things up and have existed for a long time for that matter so there's now the ability for those to call into Nova and say hey this host is down which means Nova then puts it in the down status and then we can immediately initiate and evacuate at that point rather than having to wait that extra piece of time in terms of metaka and beyond for that matter so that discussion is kind of obviously going on this week but going into the summit there's a group of people in the Nova community coalescing together to work on life migration and their issues related to that and there's an etherpad associated with that and I think in the session after this we're going to be discussing this potentially in the unconference session but in the short term so CI coverage so it's only been fairly recently that we've had a job somewhere in the open-stack infra that can do multiple hosts which is obviously a prerequisite to being able to test in the CI infrastructure doing a live migration or any kind of migration between those so in terms of CI coverage the goal is to get that job voting but also to expand the test coverage to ensure that not just the basic parts that being covered but also things like boot from volume etc so we have full coverage to agree on and improve the API documentation so one of the big problems in this area and one of the reasons that I wanted to talk about the differences between evacuate migrate and live migrate from the API perspective in the first place is that there is very little documentation on what those differences are available outside of the code base basically so within the development community there's a desire to agree on what these are and improve the documentation of what currently is before we try and move to what we might want to do in the future. Support for migrating instances with mixed storage and in particular I think the config drive case is the big one that we really want to get fixed as soon as possible so that you can have an instance with a config drive attached and actually migrate it and then eventually have an instance that has volumes associated do the same. Support for pausing and potentially canceling migrations. Pausing in particular is a very helpful one if we can implement it. The reason for that is it gives the admin the opportunity to effectively stop the guest dirtying pages and based on our discussion about how live migration works if they pause then obviously it makes that finalization step you know much easier to complete you're no longer constantly putting yourself behind so if you have a really big VM and it's taking a really long time to migrate it gives the admin the option of pausing that VM to make sure they actually get it where they want it to go. Better resource tracking in this area in general is needed so we have a number of kind of edgy and race type conditions that can cause a long-running live migration to fail for various reasons. Using libvert storage pools instead of SSH for the migrate or resize case so that's an enabler for migrating suspended instances in particular so at the moment if you have a suspended instance you can't actually migrate it you have to bring it up and then migrate it which is a little bit strange because from a theoretical perspective it should actually be easier to migrate a lot of suspended but anyway and as I mentioned correcting the memory over commit situations the way that's tested when we're doing the source and destination will a migration work check. Medium to long-term things like not just using the current tunneling system for securing live migration but also doing TLS encryption as there's work underway in QAMU around that which is obviously a gating factor for this to actually be done. Auto-convergence so this is semi-related to some of the concepts we talked about but there's also this way through auto-convergence to adjust effectively the amount of CPU the guest is getting so not having to do a complete pause but just effectively giving it less cycles to slow the speed with which it can dirty pages and therefore again make that finalization step easier to complete. And finally post-copy migration so the idea of actually kind of flipping the way we do live migration on its head and instead of putting the destination instance on a host and putting in pause state actually bringing it up straight away and then copying the memory across as it's accessed which is a little bit of an inception type idea but something that's being looked at as well. So at that point I'm pretty much done so I can take questions but just in terms of some general information stuff so the slides I've already uploaded on SlideShare I believe informed you can submit anonymized feedback or abuse using the Summit app and for pictures of cats with stuff on their head etc I think you still have to use Twitter and email so those there as well. For people who want to get involved there is that etherpad there at the bottom and in terms of references I mentioned that there have been a couple other talks in this area there's a long list of bugs in a Google Doc spreadsheet etc so there is a references slide that is going to be on the SlideShare page as well if you had there so that's a lot available so questions. Yes basically well I think one of the things that's actually been discussed at the moment on the mailing list as more and more developers realize that themselves realize that evacuate and migrate are different things is actually trying to at least condense this client side of all this if not the API side and make like a generic move API because the users shouldn't necessarily care. I think for the most part I don't see the SSH requirement going away because for most real use cases people actually want some kind of encryption over it. Nope, taking that as a no. Migration might fail and still leave me with the most VM. So you mean the source VM? Not many effectively so it will in the cases where the VM is still up so live migration it should stay there. Where it gets complicated is primarily the migrate case I think where you can end up and I actually did this in my testing with like half the VM on one host and half on the other and that's not a good place but I think live migration you know if we find instances where you try and live migrate it fails and you can't get back to the original instance I think we treat those as bugs but migrate is the weird one. Does that make sense? The networking causing issues you know it can't recover the tap on the source one or now I don't know. Most of my failures that I've seen were exactly in that pattern. It would create an instance on the destination. Something would time out or whatever reason it felt that it should fail it but it will never get back the network tap on the source. In theory the network part is supposed to be one of the last parts we do for that reason but still because of some of the other foibles in this area of the codebase it wouldn't surprise me if there's race conditions in there that need to be fixed so yeah. Resize methods on the passenger driver? It's a little bit special because migrate does use the resize API. Evacuate uses of rebuild I believe which is different again just to confuse people. I was going to ask is there a set that you get free if from the point of view that no driver implements resize is there a lot of extra work to do? Yeah so if you look at the hypervisor support matrix which I originally had a slide in there on this and I actually pulled it because it wasn't particularly useful in a lot of cases but one of the reasons for that is that migrate is not actually treated as a separate call on that chart. So when you look at that chart you actually want to look at resize because if resize is there as supported for a given hypervisor then effectively migrate should work as well. It is actually the same code path. It's like the migrate bit in the API is like a really thin shim which is basically just passing the parameters straight through to the resize call. If you configure to resize on the same host does migrate then stop working? Yes so the thing I'll say about that is that typically we don't recommend turning allow resize on the same host on in the first place. It's primarily there for people who are doing like a single host test install just to try it out so that's the reason I don't think people are too concerned about that. That was I think one of the asterisks on that slide. Comment it like this. So if you've got ephemeral disk and volume on shared storage, live migration, what would that work? Yes. Yeah well there's a check in there. That mapped volumes check I mentioned in that when it's checking the source and destination host kind of check migrate will work. There is something in there to catch like if there's a mixture of volumes associated it won't let you do the live migrate it will fail out of the API call. Because previously it would really mess that up. I was looking at trunk today when I was double checking that so that's not a good sign. But I think it's been there for a little while because these issues in that aether pad have been percolating for a while so people have already been fixing some of those things. Any other? The fact that there's resize and rebuild stuff. Am I correct that there's no real plan to get rid of that? There's actually so I didn't have this in the slides because like with the metachron beyond stuff obviously I'm just forecasting based on what's being discussed. So there is a discussion thread that originally kicked off on OpenStack Dev talking about the idea of having a migration state machine. But it actually went a little bit elsewhere which was this there's a fair bit of agreement that the way migrate in particular works at the moment is kind of dumb. Like particularly like having a confirmed step where there's nothing you can really confirm either it worked or it didn't. So I don't know there's not an agreed upon plan for how to deal with that from an API or even a client perspective. There was some suggestions on the thread like I mentioned of having a move API which calls evacuate or migrate depending on the state of the host. And potentially reworking the migrate call to be a little bit more specific to migrate instead of just being a pass through. But I wouldn't say there's agreement around those things. There's this agreement that there's a problem. I would say I expect that to go somewhere just based on you know the people involved the amount of discussion that's going on about it in various forms but it may take a while. That's all. Yeah so I haven't looked much at it outside of Keystone in terms of the verbs they're using at the moment. That doesn't surprise me. I know they've tried to do a bit. They've tried to do a little bit of abstraction to try and make those things nicer and not necessarily bind themselves to how it is in the API or how it is in the original client is how we have to have it there. But I can imagine that. I wanted to use it specifically to try and live migrate without specifying a target host. So the schedule to take care and I just couldn't figure it out so I ended up using the API. Yeah. In preparing for this I primarily stuck to the Nova client and obviously the Nova developers at the moment that's what they're still largely focused on. I would say that while there's been a big push in so the Keystone community to move to the open stack client that hasn't that hasn't really happened in many of the other projects including Nova. So that in particular maybe a ways away. Yeah. All right. Thanks everyone for coming.