 So, yeah, let's get started on my session for hardware-friendly V-hose VTPA. This talk is a continuation of a discussion that we had with the vendors for VTPAs, particularly in the meeting that there is some feedback from the hardware vendors to propose some, you know, more advanced APIs for device model as well as the new API. So, first I'm going to look at the problems and with the current implementation, that's the software-based Shadowboard Q implementation. I'm going to propose some new APIs additions to try to restore the device state, which is device neural and also future compatible. And I'm also going to present a way to actually to resume the device and also to how to get it recovered and which requires some PCI transport specification requirements as with the amendment. And the last I will try to come up with an update to the device model, so hopefully it can benefit for any future endeavor just trying to migrate between hardware VTPA device and also QAMU software device. I think for most people here, it's already very familiar with the VTPA, but with VTPA is actually hardware acceleration for what I owe, basically it's hardware implementation. The idea for VTPA is that we only pass through the data pass, but we don't actually pass through the control so that we can enjoy the benefit of hardware so we can offload hardware. The good benefit for VTPA is that you can have a very good way for live migration to implement because it has the good, for one thing it has good performance for the other thing that it can just leverage the whatever software that in a guest, the VTPA driver, so it's very common. The shared work queue is the way to implement the live migration for VTPA using software based solution. It has no dependency on the hardware, so because it just trying to reuse the existing interfaces, that is common between the software implementation for devices and also the hardware of VTPA. The idea is just come from the DPDK's implementation which used to intercept the use-reassess, which involved the page write-assess. The original idea was just to use to intercept the use-ring, but in the actual invitation for shared work queue it also interjected the double key case with the interrupt so that you can have a full mediation to get, let's say, the emulation for and also doing translation between different layout of the rings such that in the legacy device, if we want to support legacy device on top of a modern VTPA device, it's still possible. The shared work queue for migration is working the way that you can see in the left-hand side is the normal case that without live migration and the right-hand side is the case that with live migration get started. The shared work queue is just how to switch between the passive mode to the mediation mode and the shared work queue just sit in the middle between the devices and the guest and try to translate all of the data paths request and also relay those feedback, those response back to the guest. As we just mentioned, because of the current interface for VDIO is software-based, there's a lot of implications there. For example, there's no resume support on the VDIO level and all of the region of the state need to go with device reset and also try to relay the state from some migration string and also replay the control queue messages. As an example here that I want to demonstrate is that basically for the network device, if we want to apply those control messages, it need to have a full functional control queue and that has the dependency on the driver OK and also the queue-enabled status. The queue-enabled status, once the database is set up, it means that the queue is already full running so that will run in conflict if you have some other requests in parallel so which is not a very typical but it kind of works for networking device but maybe it's not a good way to restore the device states through this kind of interface. So that's one of the limitations because the spec has no resume support but on the other side there are the suspense board has been added to VDPA as a new API which is kind of out of spec but I think there is some ongoing work to add support for that to the VDIO spec as well which is through the device stop interface so that would be a good part for deciding the work to think about whether we have a more efficient way to implement the other part for the resume. So basically for this kind of sequences we can see that it's not very efficient to implement the data pass enablement in particular that we see this problem with NVIDIA device because they have a different invitation rather than VDIO in that VDPA itself because it's supposed to be vendor specific. It has to apply some trick to actually work around the issue in their hardware limitation so it's kind of also out of spec and it kind of works but it's not ideal. Yeah then maybe you know the same slide just illustrate the problem I just had before and also the other requirement comes from live update for QMU. So this we have very committed you know we have very committed use case for QMU live update and the work from our company Oracle is ongoing for this project and we want to have the things that we want to have very efficient way to live update the QMU and the core assets that we really want subsequent latency and because of the hardware limitation it's not probably very easy to you know implement a very efficient way to achieve this goal because of the a lot of cycles you can see that to apply those control messages if it has a lot of cues you have to you know do it a lot of times right. So another thing for the live update is that because it has a different mode on one being the execute mode the ask mode it's actually you don't need to reboot the kernel you just update the QMU the kernel the device saving the kernel is still there but you just have to you know restore the source they in the QMU but with the reboot mode you have no way to you know you have to resort to the reset and then reply all of the control messages so we really want to to see the resume interface to be applicable to this you know reboot mode so we don't have to go with the full cycle another burden is come from the device model today because today's where our device model is totally software based and the network the VTV device is not directly exposed by rather device back end for what I owe and the because of this deviation of hardware and software it's actually the migration a string for what I always actually rule by the QMU specific implementation of the software but not a rule by the various back so if we want to have build a pure hardware device for a VTPA we really need to define some self-contained of your state is quoting all of the solar state so but this is not defined by the spec so we are going to see where there's way that we can abstract this interface for for hardware devices so that maybe it's possible in the future to migrate between different device models okay so here we come up with the ideas of how to do do that so basically we need to separate those device state is can be a device level device state and also work your level device state and also feature based a state so these working sequence like it's like this so in the basically we want to add another API for resume that's back by a back-end feature resume back-end after resume and also you require the input for the migration string so that will be that will save during the suspension time so that will back by another back-end feature so that this is a working sequence so you can see that you try to get size during span and also save it somewhere and on the destination it will just put a device into the suspension state and also loaded all of the states from there yeah this is the the type for this you know different device state and I can share some example here so how to so basically the idea is very straightforward we just put some TLV type lens value headers on top of the device state so that when it comes to parse this state so let's say if the migration destination they want to pause it they can just pick the header without having to you know to know what is yeah what is the lens and it doesn't have to you know go with a specific order to parse this state and other good part is that it's composed complete independent and composable and so this state can can be optional and it's not required to have a fixed state and by rather you can have a variable lens of the for the for this device state in the in the migration string so this is a device levels this is an example for device level status states for the project of broad storage device and it can have its custom device state for for inflight data and also this is the work queue it's very similar to the concept of the existing field in the micro string and config space is pretty straightforward we can just duplicate those you know computer structures in the config state and there's also some guest invisible config state which is not presenting the config space we can you know you know it's actually configured through the control queue messages so we can decode this you know control queue messages of the type of the state including the not only the the feature but also the calcium also the command so make it a sub feature so it's more compatible with the way that let's say some of the hardware vendors might just only implement one or only only a few of the those device state of those control messages so we can have sub feature level of way to identification yeah these are examples so how to represent this with the mac table and as a wheel and table that you can see okay when we have this kind of you API and also the data structures defined for this saving device state we can also present this on the PCI transport so that it is more self-contained in the spec level and also it will benefit for the nested guest yeah and also is it's possible to extend this to other transports such as a wing queue or transfer queue depending on the need variable compatibility so we know that because to work with the solar where I owe we had to implement this new interface for for migration state but we had to be compatible with compatible with the older solar input implementation in that sense that we can easily so for any migration side that has older and KM running we can just do some translation work to make it match with the existing migration state in KM your solar implementation and the new KM you can finally you know just translate just transfer the real or more efficient device state blob as is in the migration stream and we would also need to take care of for compatibility between different VDP versions so basically the idea is that new versions will be able to recognize the old version with a kind of a shorter lanes and but if it comes to the old version I mean don't understand some of the lower lengths of the type it will just fail the resume for that specific request so the basic assumptions that the admin user has to match this based on the features that we find in the device state so if they found a match so they probably have no circumstances that we can actually get it incompatible or get you know errors in migration okay the last we I try to summarize all of the benefit here for both shadow work queue and also our resume API so basically the good part for the shadow work is that it's work with the existing spec and it doesn't have need to implement all of those stuff with the translation for between the old API and new API so I think that's particularly a very good benefit and the other benefit is that it has a good way to translate a different ring layout so but the downside is that probably is has involved to involve some CPU usage but they only might be in time it will have some slight impact on the performance the good part for the resume API can we'll address that because it has no overhead because it's just working the through a password mode then you need to immediate those ring assess and together with the dirty tracking with a device assistant dirty checking you can actually live without any software-based mechanism but the other part is that we need to really need to listen to vendors feedback on this part where it is applicable for them to implement the interface and also the current interface is very never specific I guess yeah probably I would like to see more devices or more vendors to ensure their need or their the kind of you know because I think for resume is very easy for them to it's not a new concept but the thing is that for those device state it might be not easy for them to for them to implement yeah that's what I have for today yeah any questions hi michael oh great so it's you're building a rich interface for vtba to support migration and there's now some afford on virtio site to add migration there I wonder was a you think it's reasonable to kind of share the structures used for that yeah okay the question was whether it's possible to share uh this same structure between software virtio and vtba right okay I think uh that's doable yeah but uh I think that the the the difficult part is that the QNU specific invitation uh currently uh it had to live with those you know microtrans data which is software based but I kind of you know think there there might be ways out to actually make it optional and such that uh when we come to uh migrate between different device models we just need to you know uh come with some new uh new flags stating that okay I just want to be more compatible with the hardware so I just care about those um what I was back back the standard uh device state rather than those solar state uh that's one way that I can think of and the other way is uh possibly uh we can gradually uh migrate our you know software based model towards hardware model or we can sort of you know introduce new options to enable um the the so the model is like uh the uh the the device itself is still solar based but it has a hardware back end uh it's a kind of uh vtba back end uh so we gradually uh try to move the device model to a new api with that uh uh microstream so uh and it kind of have uh have some troubling uh so that uh you can just mic to the newer uh and uh qmu and then after that uh probably you can just switch to the the new mode to only be compatible with uh hardware vtba so uh that's another thing that I I I don't you know go into the full details for whether that's applicable but there's two of ways that I can think of yep hi you mentioned you uh interested in hardware vtba have you mentioned to any of those uh vendors yes about about this and uh that's that was a cool question yeah the question was uh whether we uh talk to any uh device vendors for the uh for for any feedback right so currently we partnership with uh nvidia for their devices and and we also have uh regular meeting with other vendors like intel and also xilex and they also provide very good uh feedbacks during the meeting I guess um the thing um is that uh currently I don't see very uh so currently the problem is uh shadow work is uh what maybe work uh sufficient enough yeah to get a very decent performance so uh I'm not sure practically we need to use this big hammer to enhance the the spec and because that that will go into actually take a very long cycle to get it you know into the spec so maybe the plans uh the more feasible plans that we come with sir api first which vtba only and then uh we somewhat some all of the feedback from the vendors and maybe uh stock get started uh only uh implement this interface for only a few uh device types such as a storage and and network device and then we can see yeah whether this is applicable to other uh devices like crypto or other why system devices because they are have more implications on that part so yeah that's something that yeah that's a very good question hi mm-hmm available. This approach is basically like EFIO like migration. If you consider just exposing ACR, it's to get the state instead of saying you have to serialize and serialize by state. Yeah, that was a very good question. So basically the question was whether there's a way to actually not avoid serializing or deserializing the device state of the migration string. Yeah, I think there was some proposal from NVIDIA that they want to propose the admin queue to do that. Well, yeah, I think that the first obstacle is that right here we have the solar device model, right? So we have to live with the migration string. So it means that we still have to make it compatible between this new API and the old migration string. So that's why to be practical, I think we need to get started from there. But I think if we come with a new model, so maybe it's a generic hardware device for EDPA, we can go with that way. Yeah, maybe not just depending on the deserialization, but depending on some new layout for the migration string. Probably, yeah. That's doable, yeah, but the real challenge is that or is there any use case for that? So because we saw where we have a lot of benefits such as fall back to the user space, but I'm not sure. So whether there are any use case for hardware or the device. So yeah, that's my answer. Any more questions? This one? This one? Or the last one? Oh, I see what you're talking about. Well, yeah. Actually, the CPU, usage is the CPU that costs that get otherwise introduced by some other component, which is not really to the normal mode. Like I say, if you have a vocal running that without migration, you get X percent of the CPU on the whole site. But if you kick kicks in the migration and the CPU increase to Y percent, so that will actually give you the evidence. So the CPU usage will be increased. So I think the DPDK implementation, they mentioned their CPU usage bumps up to 40 percent in the peak time. So they depend on workload that you have. So yeah, yeah, yeah, yeah. I think, yeah, that's very good. Yeah, that's a very good point. So basically, I think the background here was we ever asked whether there's a way to improve the shareable queue performance because we care about the, basically, we care about the latency and also the PPS performance, you know, because it kind of, you know, I think the zero copy has no way to, or I think the impact is very limited to the throughput, right? But mostly, I think the latency and the cost for, you know, you have to translate between this and mediate between, right? So the cost is, for some work flow is not neglectable. So I think, yeah, that's somewhat concerned. I need to be, you know, benchmark with, and backup by some data. Yeah, we'll see. Mm-hmm. Okay, okay. Yeah, that's good. Yeah, probably we can just put it offline because I think, yeah, it's already, you know, run out of time. Yeah. All right. Thank you for today's session and enjoy the rest of the session for today. Yeah.