 Hi, my name is Stefan Heinatzi, and I'm going to talk about live migrating, VFIO, V host user, and VFIO user devices. We're going to have a look at how devices are live migrated in QMU today, as well as the current approaches from migrating VFIO and V host user devices. Then we'll look at how to address some of the gaps, like supporting stateful V host user device migration, as well as extending VFIO migration to allow them to migrate between different implementations of the same device, as well as checking whether migration is actually possible and compatible without initiating a migration. So let's get started. How are migration, how are devices migrated in QMU today? QMU migration works by saving the device's state on the source, and then transferring the state to the destination where it's loaded into the destination device where it can resume execution. This is transparent to the guest, so the guest driver and the applications are not interrupted by this process, and anything that's happening will continue seamlessly. The device state itself has a serialized binary form called the device state representation, and we'll talk about some of the details of that later on. One of the important questions is, what goes into the device's state? What makes up a device state? This includes the register contents of the device. Basically, anything that is guest visible needs to go into the device's state. In addition, there might be some internal state that the device has that we really need in order to resume execution from where we left off. Not everything is necessarily exposed to the guest via hardware registers. A good example of this is when you have a ring buffer interface, and the guest driver adds elements to that ring buffer, and the device consumes elements from that ring buffer. The device is going to have a read index where it keeps track of what the last request it processed was. If we don't migrate that internal state, then on the destination we won't know where we left off. And potentially, we will have to restart the entire ring, and we may process some of the elements twice, which would be a problem. So that's an example of internal state that needs to be migrated. Now, there are a few things that don't go into the device state, and this might not be obvious. It includes the device creation parameters, as well as other host side state that's inaccessible to the guest, and therefore not relevant to the continued operation of the device. Now, why isn't everything part of the device state? This is an interesting point to keep in mind because it's kind of fundamental to the rest. And there's no technical reason why QMU's devices state is split and limited in this way. It's just how it was implemented. The reason is that the destination QMU is actually launched with a full QMU command line that specifies all of the devices that you want on the destination VM. Live migration doesn't involve starting an empty shell QMU that has nothing and receives all its information from the source. Instead, the destination is actually a full-fledged VM that has all the devices configured. So that means the device state is really just the runtime state of the device. The destination's command line is usually quite similar to the source command line. Okay. Now that we've talked about how devices are migrated in QMU, let's have a look at the challenges that are unique to out-of-process devices. With VFIO, we have kernel drivers and physical devices like PCI devices that can be passed through into the guest. With V host user, we have host user space processes that emulate some of the Verdi O devices. And with VFIO user, we have a host user space process that can emulate PCI devices and in the future, maybe other types of devices. So this is a little bit different from having everything built into QMU itself. First of all, on the destination, we're going to have to launch or instantiate these devices because they're not part of QMU. And QMU's command line alone isn't going to create the devices necessary for our migration destination. Second, we need to integrate into QMU's live migration workflow so that we can send a device state along with the rest of QMU's migration. Now when we do this, the device state representation becomes important, and that's what I mentioned before, this binary serialization of the device state. We need to design it in a way so that it can support migration between old and new versions, support adding fields, and things like that without breaking compatibility whenever possible, because that would be disruptive to users if they have to reconfigure their VMs. Now finally, a nice goal would be if it were possible to migrate between out-of-process devices that have the same device type. So say we have a Verdionet PCI device implemented in hardware, with VFIO, we can give it to the guest. But what if we want to migrate it to a Vhost user Verdionet device, for example? Because we've decided that this workload doesn't need the performance of a hardware implementation. We have another workload that's more important that's going to need it now. Could we migrate in order to change the kind of quality of service? And that's the goal of being able to migrate between implementations of devices. So we've set out some of the goals and some of the challenges that we're going to look at. But let's start now by looking at how this works today for Vhost user. So Vhost user migration takes an approach where QMU actually stops the Vhost user device on the source before migrating. And then QMU takes over the V-ring. It takes over the state of the device. And then it uses QMU's common Verdionet migration code in order to migrate to the destination. And once QMU has loaded the device's state on the destination side, then it restarts Vhost user. So that means that Vhost user is not directly involved in the migration. It doesn't control the migration and it does not get to define its own device state. Finally, there are some additional steps that could be done after migration or maybe before migration, depending on the device type. For Verdionet, there is a special Vhost user protocol message to send out that gratuitous ARP packet that announces the network interface after migration. So things like that need to be done separately and they're orchestrated by QMU. We can look at some of the pros and cons of this design. So one of the great advantages is that it's easy to migrate between any implementation of the device because in fact the device state representation and the migration itself is all managed by QMU. There's no worry about incompatibilities with other implementations because really there's only one implementation of this migration. Additionally, QMU's own infrastructure for device states called VM state has all the, it's a framework and it has the abilities to safely add fields, make changes in compatible ways, and so on. So we don't have to reinvent that when we're implementing Vhost user devices, which is also an advantage. The disadvantage is that this approach doesn't support stateful devices today. And what I mean by stateful is devices that have internal state. So today it only migrates the V-rings themselves, the generic Verdio state of the V-rings and nothing device specific. So when you have a device like Verdio FS, which has a significant amount of internal state, then we have a challenge, we can't do that today. Let's have a look at some of the ways that this could be overcome. There's an existing interface called D-Bus VM state that QMU supports. It lets external processes add blobs of data into the migration stream. They can save them on the source and load them on the destination. This could be used by Vhost user programs in order to participate in migration and save internal state. That's not very integrated into the overall V-host user migration code though. And so it might be cleaner to take a different approach like defining the device state as part of the Verdio or the V-host user specification and then adding protocol messages to the V-host user protocol that allows us to save and load. So I think this is the future direction that we're going to see. Maybe one of these approaches will be needed in order to migrate devices like Verdio FS and maybe some of the other device types that today can't migrate with the host user. Okay, now let's look at VFIO. VFIO is very different in how it approaches migration. With VFIO, QMU reads the device's state from a migration region, a special region made available by the device and transfers it to the destination as part of QMU's live migration stream. It's then loaded into the destination VFIO device by writing it back to this special region on the device. What this means is that the implementer of a VFIO device has full control over their device state representation. They can define their own device state and in fact it's completely opaque to QMU. QMU doesn't inspect or parse or understand the device's state. That's really up to the device implementation. One of the advantages of this approach is that it doesn't require modifying QMU when you add new device types whereas the V-host user approach manages all the migration in QMU and therefore requires you to add code for every single V-host device type that you want to support. With VFIO it's not the case. You can just pass through an external device and out-of-process device and QMU doesn't need code to support your particular device type. This has some disadvantages too though. Now that the implementer is the one who is responsible for the device state representation and the migration, they need to solve the extensibility and compatibility issues that you have when you invent a binary serialization format that is supposed to migrate between old and new versions of device and so on. It's very easy to make mistakes and QMU has evolved and it has come up with documentation and guidelines on how to do this and so on but if you're on your own then it's easy to hit issues and what ends up happening is that you can end up with versions of the device that can't easily migrate to another where you essentially would have to reboot or hot plug a new device into your machine in order to move on to a fixed device. So finally the other issue with this approach is that the ability to migrate between different implementations of the same device type, let's say Verdionet PCI, is really difficult because if everyone is coming up with their own device state representation then these devices won't be able to save and load each other's device states. So we don't really have a way of standardizing this yet and that's something that we're going to look at now. A few more questions and challenges that I want to add before we get to addressing VFI migration issues are can we check migration compatibility before we actually migrate? You might think well that's not too important if I have a specific destination that I'm migrating to and a specific source and I know both of them but there aren't use cases where this is important because you might have a pool of machines and a cluster scheduler that's trying to decide where do we live migrate to? What's the destination machine? Maybe only a subset of the machines in our pool actually have the capabilities to be a migration destination because either they're out of process device software versions are not the right versions or we are migrating to a physical PCI device that has certain features and is only available on some of those machines. So the first approach and the one that's traditionally taken is to just try to migrate to a machine that we think will work and the migration might fail partway through if it turns out that the destination machine is not compatible isn't able to take the incoming migration but this is slow and it consumes resources so it's not an approach that makes sense once you move out and you have a whole pool of machines and a cluster scheduler. Another approach might be to come up with an algorithm that automatically checks compatibility and letting the cluster scheduler confidently decide what the right destination machine is. The risk of this is that it could be complex we need device implementers to advertise this information their devices their capabilities and all of this could result in a lot of metadata and so it might not be end up being just a very complex and tedious approach that's the risk. Finally we could also manually tag machines in the cluster this is another common approach but it's error prone so imagine you take out a PCI device that's in one of your nodes and then you forget to update the tags in that case the scheduler is going to make the wrong decision because it's information is out of sync and so this approach in the long run is always going to lead to human errors and problems so what I'm going to present here for vio includes a way to automatically check an algorithm for checking migration compatibility. Okay so let's investigate a little bit more the challenges for migrating between two implementations of the same device type say two vertio net PCI devices that are implemented differently like a hardware implementation or maybe two software implementations that were written by different implementers in order to ensure compatibility what is it that we need we need to make sure that the devices are the same device type this is the obvious one we cannot migrate a vertio net PCI device to a vertio GPU PCI device because they're simply different device types they're not compatible but it goes further than this they need to have the same device creation parameters in order to be compatible let me give you an example imagine you have a vertio net PCI device with 64 queues if you try to migrate to a vertio net PCI device that only has one queue the migration won't make sense it won't work so the device creation parameters actually need to be the same as well not just the device type and finally they both need to use the same device state representation so that the binary serialized form that they exchange is something they can both deal with so i think we've we've laid all the all the foundations here um i'm going to go over a proposal that i've been working on for vfio and vfio user migration the idea is to build on top of what's already there today you can just migrate point to point there's no compatibility checking there's no concept of a device type really um and so we don't have the ability to check whether migration will succeed we do not have the a good way to migrate between different implementations confidently because the only way to do that is to try and see if it works and it will fail if it doesn't okay so in this model we need two concepts first of all every device model has a unique string a domain name and a path that identifies it like qme.org version at pc i and then implementations that support version at pc i would advertise that they support qme.org version at pc i and the reason there's a domain name in there is because it allows you to have multiple uh implemented multiple device model definitions actually for the same device type that might be useful in case um the same device is uh developed independently or people try to fork it and want to um add new features and experiment with things so there's the freedom to do that we just need a unique identifier that's the device model the second concept is the version string so device models can have a set of version strings like v1 v2 v3 and so on and this basically identifies the migration compatibility if you are using the same version string then you will be able to migrate from the source to the destination if the version strings are different even though say it's qme.org version at pc i you cannot migrate from v1 to v2 and expect it to work so that's the purpose of the version string and this captures not just the version it also captures the variant of the device so if you think back to the 64 q and a single q verdionette device we talked about for verdionette you might just decide to define a standard q64 v1 version type to identify a 64 q verdionette device and a q1 v1 type for a single q device and this is how you can differentiate uh between incompatible at device creation parameters okay so how do we check for compatibility with this approach the algorithm is trivial it's really simple all we need to do is we need to query the device model and the device version string from the source device and on the destination we just need to check whether that device model and that version string is supported that's how we determine whether migration is possible the actual steps for migration are also simple we follow the same vfio migration that we have before that same approach but we have one initial step that we add in order to migrate to a destination we need to find an existing instance or instantiate a new device instance with the same device model and version string then we know the migration is going to work i do want to add a note here about upgrading downgrading versions because this is one of the most important things about supporting live migration of devices we need to make sure that it's possible to upgrade say your device or your hypervisor um and migrations are often used to do that so if the change you make to the device state representation is compatible because your device state representation format your serialization format actually has some extensibility built into it then you don't need to bump the version string that i mentioned so you don't need to go from v1 to v2 if it's a compatible change if it's incompatible then you do so here in this example we have a version 3 where we determined it was broken it was missing something we forgot to add we need a v4 because we needed to make an incompatible change in order to do that the device needs to provide an interface for setting the version string at runtime to an older or newer one and the thing is the sum of these version strings might not be compatible like you can't change from a 64 q-verda on that device to a one q-verda on that device at runtime it doesn't work that way so the device itself can refuse a version string update but if it accepts it then it means it's compatible and this is how you can manually move and upgrade if you really need to do that but a good device state representation will allow you to make a bunch of changes without having to bump the version string so how does this work well i don't have time to go into the details of mapping this but if you read the proposal that i sent out recently to the mailing list basically it shows how to map these high-level concepts to vfio-mdev-sysfs attributes that the mdev drivers can expose and potentially vfio core drivers will be able to expose as well as a vfio user command line interface so that vfio user programs can also participate in the same scheme what this does is it provides a standard interface that allows migration tools or generally management tools to migrate these devices without actually having device specific knowledge they don't need to have hard-coded information about specific device types anymore that's the whole point um i'll quickly go through some of the advantages and disadvantages basically we've been able to check migration compatibility which we can't do today with vfio we have the ability to upgrade and downgrade versions explicitly if necessary we can migrate between implementations of a device because we've now published a way of uniquely identifying different device types it's also easier for implementers than rolling their own because if you're implementing verti-net or verti-block a bunch of other people are doing that too you might as well reuse the same device state representation and not reinvent it not make mistakes finally the disadvantages are that it is a somewhat limited approach sticking to just a few of these version strings means that we can't customize the device creation parameters they're baked into the version strings a previous revision of this patch series actually did do that but it was much more complex so what we need to find out is whether this approach satisfies enough requirements and if the community is happy with it then it's a simpler approach that i think would make a lot of sense so finally the future directions for this for vhost user we need to support statefold device migration because there are some device types that today cannot be live migrated with vhost user for vfio and vfio user we need to finish the proposal discussion that i mentioned here and then implement it in tooling so that we can support these things and at that point it'll be possible to actually develop device state representations for the popular devices so we can migrate between implementations of the same device type thank you very much