 I'm Michael Zirkin. I work at Red Hat. I'm the chair of the Virtayo technical committee. So that's what I'm going to talk to you today about our work over the last year. Our greatest achievement is the release of Virtayo 1.2. So I'm going to talk about that. But also about didn't make it in the release. It was still working on, and so hopefully we'll be included in the next version. And yeah, I used the stable diffusion to generate a bunch of illustrations for this. Hopefully it will be fun, a little silly. So you might be going like 1.2, didn't we talk about this a year ago? Yes, and I told you then about bunch of new devices that we introduced. Here's a long-released file system, R-P-M-B, I-M-M-U, Sound-A-Square-C, GPIO, memory, C-M-I, ton of new features as well. So what I don't want to do is just repeat all that. So I'm going to talk about the new things. They're kind of hard to decide on the order, so I just went chronologically. And without further ado, it's three years of work. Took us a bunch of time to prepare that release. But now we're committed to do year releases. Hopefully this kind of commitment will help us keep to the schedule. It's the biggest release of the Vitaio so far. Nine new device types, we almost doubled the number of supported devices. And this is because we are trying to address new challenges, new use cases, not just cloud, not just password from hardware features, but also automotive, internet of things, desktop actually, mobile. Lots of organizations participated in that. And most features are kind of collected. People definitely reviewed each other's work, but also a bunch of features are collaboration from multiple companies and multiple people. So here's just one new device that I didn't talk about a year ago. That's persistent memory. And really it's probably most useful if you have a DAX device on the host and then it integrates with the DAX subsystem and the guest. But also it solves the problem of double caching. So if you have storage in the guest, you'll have a cache on the guest. You also have a cache on the host, they're not synchronized, often suboptimal. So one way to address that is to say, all right, let's not cache on the guest. This is what this device will do for you. And contributing by Red Hat, of course you still have to forward cache flashes from the guest because this depends on the workload, depends on application. So there's support for that. Then this device I talked to you about a year ago, but it's been kind of rewritten, mostly simplified. We ripped out support for submitting multiple requests as part of a single buffer under VIRTA or VIRTQ. And now each request is just a single buffer. So that's much simpler. If you have a complex transaction that you want to submit, consistent for multiple requests, there's support for changing these guys together. This work has been done by Linaro. And the GPI also has been enhanced, not rewritten, but the big item here is support for events. So you have like a physical device and someone changes the level of the voltage on one of the pins, right? You want to report that to the guest and now that's possible using a special event queue. So the VIRTQ GPI device now becomes a source of interrupts. Okay, IOMMU, the big item here is really support for bypass. Now that's important for several reasons. One is performance really, because sometimes you have really a trusted device, trusted application within the guest, right? And you don't want to waste time just looking up all this translation and permissions. And so you can bypass these translations for specific devices now in the virtual IMU. But also it's important for like legacy guests and you know, booting with firmware. This has been important for supporting X86 subsystem. And the reason is that these guys often just don't have a driver for the IMU at all. So previously by default, you would always disable everything. Devices don't get any access and you have to set up a driver and the driver has to set up bypass to allow devices access to specific areas of guest memory. Now with this, you can opt in to enable bypass by default and then you boot up and eventually bring up a driver that limits access to specific areas of guest memory. Also work by Linaro. Secure arrays probably most useful for pass through. This is when you want to permanently delete some data on the device. So if you have hardware that supports that, you can now expose this to the guest and the guest can decide, all right, I want to raise this specific look permanently. Work by Intel. Okay. So we're a socket for a while. The only mode that it supported is called stream. Really you are just sending bytes to the device or getting bytes from the device and it's a stream of bytes. There's no specific limits. Even if you create like a message and put it on the socket, it will be a stream of bytes on the other side. Multiple messages get glued together, it's a stream, right? There is, in other words, reliability but message boundaries are not preserved. So the sec packet is a new mode where message boundaries are actually preserved. So another thing that we edit is actually support to disable the stream mode if you like to. So you can, on the device, you can support the sec packet, the old stream mode or both. Work by Kaspersky, I guess, that's somehow need these message boundaries to be preserved in their application. Now we're getting to changes in the core of the virtio specification. So historically, if you have a virtq, you select a bunch of virtqs, then you enable these, right? So you say, okay, these virtqs are enables, these are not. Once virtq is enabled, that's it. You cannot disable it back until you reset the whole device. If you take some buffers, you put them in the virtq. Also, you cannot take them back. You want some buffers back, you reset the whole device. Resetting the whole device is really, really slow operations. First of all, you have to set it up again. That takes a lot of time, lots of VM exits. Besides, it's also complex for the driver. The driver has to maintain all the shadow, like a copy of the device state in its memory so that it can restore it after the reset. And that's complex. So another use case is when you kind of got stuck. Maybe the virtq is not progressing. You are guessing there's some kind of bug, maybe on the host, maybe at the device, maybe in the driver, the loss synchronization. And we cannot get out of this situation again without the whole device reset. So the new functionality is that you can reset individual virtqs. It's already being used for resizing virtqs, actually. So if suddenly your queue is full, so you're saying, all right, let's create a bigger one. And correctly sizing queues is pretty important because queues that are too big, they fill up the cache. Queues that are too small tend to underrun. So that's all good. There's been a bit of drama around this feature because it was all set to go into the spec. We actually merged it, and then suddenly ZTC decided to change the register interface. So we had to push out release a bit by a couple of weeks so we can get it right this time. And this work was done by Alibaba and all of the feedback and changes by NVIDIA guys. Yeah, that was like a quick summary. I think I went a bit too fast. But so what's gonna be in the next release? It's kind of hard to know the future, but the way the virtio spec process works is that you reserve some ideas or feature bits and then you work on them and then you submit your spec proposal using these ideas. This avoids conflict, but it also kind of gives us a hint on what people are working on. So let's just take a look. And here's a huge list of new devices. And my crystal ball seems to be fuzzy, so I'm not sure they all will be in one of three. But here's the list. So I'm pretty sure the Chromium guys are working on video encoder and decoder. This is an active project, so I hope it will be in one of three. The simulated wireless device has not seen any progress in a while, so I'm not sure. This was proposed by Intel. Nitro security model is a kind of TPM, proposed very recently. I hope it will make it. This worked by Amazon. Very happy that Amazon joined the effort. Then we have a watchdog device. This will detect where the guest is responsive, just make guests do a little bit of activity with the device and this way, you know. Then we have the automotive applications. That's the controller on our network. The bus used in a bunch of cars. That's worked by open synergies. They have been very active in the community. There is a parameter serving in audio policy devices. I don't know what they do, but proposed by Google and when the spec lands in, I hope to find out. And finally, Bluetooth. I'm not sure. I'm hopeful the drivers in Linux has been for a while, but it's not well maintained. And we've been hearing like real soon now for a while. I hope that we get it in one of three. So these are just the new devices. I don't know much about how exactly they will operate, except the encoder, the decoder. The spec there has been on the list for a while. But also we have a bunch of new features and existing devices. So zone block devices is a kind of hardware beast where the partition, like a block is partitioned to zones and each zone can only be written sequentially. And if you have a device like that on the host, then you now will be able to expose it to the guest as a special virtual block device. Once that lands, hopefully in one dot three. We keep working on improving performance for the virtual network devices. And one such proposed feature is the header split. Really sometimes it's beneficial to split the transport header from the payload on incoming packets. That's what it does. You can, the header can go into one page and the payload goes into another page. And they just are separated like this by the virtual device when the incoming packets are written to memory. And more performance features. So interrupt coalescing is really difficult to support with a software device because it requires some timers to run and to program timers on the host requires VM exits. So you lose performance that you might have gained by coalescing the interrupts. But now that we have hardware offload for virtual, hardware devices actually can implement the interrupt coalescing without issues. So that's where this feature comes from. And another virtual network performance idea is GSO for the tunnel D packets because GSO kind of creates batches of packets and you can pass the single batch over the networking stack improving the performance. Then one of the things that have been going on the driver side of things is improving support for secure encrypted virtual machines. And that means that we are adding a bunch of code in the drivers to validate the data given to the driver by the device. So if you just look at the initialization sequence, the way it works right now is that you find out which features the device supports and then you acknowledge them, that the features okay status bit, then you go and read the related data from the device from the configuration space, right? And now what we would like to support better is actually validation of this data that we are getting from the device. This means we're working to add more documentation for what the expected values are so the driver can validate and actually bail out if something is wrong. There is also a question of forward compatibility. So what if we specify these are the legal values right now and then we kind of extend it down the road? How is that going to work? I'm going to talk about that a little bit down the road in one of the following slides. Then we started working on migration which is previously kind of out of scope for the virtual spec. But some people actually want to build SRIV devices which are virtual devices. And so now we'll just take a look at this picture. So normally you have a physical function of the SRIV hypervisors connected to that. It's driving the physical function and multiple virtual functions and then virtual machines are connected. Each virtual function normally connected to like one virtual machine. Now what if we want to migrate one of these VMs? So we need to take the state of the virtual function, right? Save it and send it over to the destination of the migration. And that means that suddenly you need the hypervisor to drive the virtual function. Now one of the issues is that we have the IMU in the picture and that limits us. Limits the virtual function preventing it from accessing the memory of the hypervisor. It's actually an important part of the security guarantee of SRIV. So with that in mind, how is the hypervisor going to drive the virtual function to get its state? And one of the ideas is that, okay, so let's just go through the PF. So the hypervisor will send commands to the PF to kind of drive the PF. And these are called the admin commands. Came to be called that. So the Fs are members of a group then and each group has a owner, which is a PF in this case. And the PF has a special VIRTQ, which we call the admin VIRTQ and we are sending commands through that VIRTQ. And we have like a standardized format to tell you which, so which VF is talking about, which members is talking about and then get back some standard return codes. And that seems to be useful beyond just migration. For example, there are standard things that you want to do from the hypervisor side, like attach a VLAN to a specific VM so that all networking traffic of this VM will go inside this VLAN. Not something you allow the guest to do. You actually want to do it from the hypervisor side. So that's another use case for these admin commands. Or maybe whether you want to allow this VM to program its own Mac or you want to limit it to a specific MAC address. But once we solve these contributions, actually another use case turned up. And this has to do with scalable virtualization. So the scalable virtualization is technology pioneered by Intel where you have a single device and you can actually partition it to its sub-functions and there could be like millions of these guys. One of the implications of that so is that suddenly using memory mapped registers for control pass for programming these guys becomes very wasteful because we'll have to spend megabytes or even gigabytes of memory just to maintain that control pass data on the device. And because of that, it would be much better if we could do all of this control pass over the VIRTQ. So thinking back, we said, all right, so, wait a second, it's the same rate. We have these sub-functions. These are members of a group. We have the device itself. That's the owner of the group. So we'll have this VIRTQ and we'll send commands. Over this, it's sending commands to one device talking about other devices. So hopefully we'll reuse these admin commands also for this use case. And this is called VIRTQ as a transport because we'll have the driver all of the control pass operations. They can be emulated, dropped and translated to VIRTQ operations by the hypervisor or we can expose the VIRTQ directly to the driver kind of up to the system designer. But from the spec point of view, we really are having this new transport, which is VIRTQ, so you can kind of nest, nest VIRTQ devices inside each other. And since it's a VIRTQ, thinkably you can use one of these VIRTQs inside the sub-function, maybe to use it as a transport again, so it allows infinite nesting, which is nice. To talk to you about the next feature, I need to give you a bit of a background. So currently VIRTQs support two modes of operations. One is called in-order, one is called out-of-order. In-order is very simple. You submit a bunch of buffers and when the last one has been processed then you are notified. And out-of-order kind of you get notified about each buffer, which is more expensive. But on the other hand, if one of the buffers is kind of blocked, like by a fault on the PCI Express bus, for example, then it's no problem, we can just proceed to the next one and eventually we'll get back and process the one that caused the fault. So it's nice, but again, it's a little bit wasteful. You have more of these notifications, so not good from that point of view. And faults are hopefully rare. On the other hand, we are kind of wasting all of these notifications all the time. So that's the idea then is that, well, how about we start doing in-order things, so single notification for multiple buffers. If we get a fault, we kind of switch, start getting individual notifications, but once things stabilize, no more faults, then we can go back to in-order, hopefully. That's another feature we are working on. A bunch of new features that are driven by hardware changes. So people are doing embedded, some people really want small and cheap devices. And these sometimes struggle to persons process all the complex data structures required by the virtual spec. So we are adding features where you would limit the device in a bunch of ways and limit the driver accordingly so that you can create these cheap and simple devices. For example, like limit the number of indirect descriptors that you can stick in the virtual. Again, we are trying to work out some better idea for forward compatibility. Like maybe, we are thinking maybe there will be another stage during the feature negotiation where you initially set some feature bit and then you go and discover what the device supports. Then you do all of the validation and then if it didn't work out, maybe you can turn off some of the feature bits and then try again. This all ties together with all of the validation and hardening that I talked to you about previously. We also want to hopefully not break things for people. And that's kind of the last slide. So what we did recently, we kind of switched most of the tracking to GitHub. So if you want something to contribute, you're supposed to create the GitHub issue so we can track that. But the discussion still takes place on the mailing list. It's a little bit awkward. Some of this is just because our hands are tied. We kind of always just bylaws require that things happen on the mailing list. And then to tie things together when you post your patch, you include a link to the GitHub issue. And this way we'll remember that all right, this thing exists, this is still work and progress. And eventually you ask for a vote. And then that happens, there were with me file in the GitHub repository for the spec. And that has all of the small details about how to do this. And we also have the development mailing list and that's for questions for people who are developing drivers and people who just want to discuss specific implementations. It's not a big deal if you're like posted to an incorrect mailing list, but still. And that's it. Thank you, any questions? Yes, could you, just a second, could you maybe get the mic? For the list of new devices that you mentioned, what's the state of the driver development inside of the Linux kernel? Okay, so let's see. So we're Tio-mem, we have a driver, P-mem we have a driver, Bluetooth we have a driver. I don't think we have drivers for GPIO and I square C yet. What the R-P-M-B is CMI, I don't remember. I think we don't yet. File system, we have a driver. I am a driver driver. You take a look at the list. Sound, we don't have a driver yet. Quite related to that slide. Are there also some of the things that have been proposed in the past that have completely ceased that are not alive anymore? All of these still alive and will be common in the future. These are still very active, yes. Okay. Okay, thank you very much.