 Welcome you to the first day track over here and to introduce Alexei Karazhevsky talking about the virtual fabric IO. Thank you. So my name is Alexei. I work in the LTC department of IBM, the Kendra based team, and I'm a part of Power KVM team, which is running KVM kernel based virtual machine on power PC processors and servers manufactured by IBM. Most of my time I spend on supporting KMU in this environment, which is user space tool to run KVM. And as a part of my job is to support VFIO, virtual function, virtual PCI function in KVM environment. And this is what this talk is about. Why are we doing this? First, when this project was started, the only hardware we had was simple network adapters or VGA or USB. So we take single device, pass it to the guest and this is how we test. But this is not what we want at the end. At the end we want to support really powerful PCI hardware such as SR-IUV device, which is basically big PCI card with built-in Ethernet or SCSI switch. And the good thing about it that it can create virtual PCI functions on fly. So on the host, some proprietary tool, you run it and it creates as many PCI functions as you want. And you can pass every single PCI function to different guest and get maximum performance from the guest. The competitor to this VFIO technology is Yurt Ayobis, the host, Mark Vitap. But I still believe that VFIO is a little bit faster, a little lower latency and it loads CPU. So we still want to use it even if we have Virt Ayobis host and all of this stuff. A little bit about terminology, because you might not be familiar with all of this. QMU is a quick emulator, it's user space tools, tool which fully emulates a lot of different architectures, CPU emulates everything and hardware and memory accesses, all of it. Nowadays, it supports accelerated mode when it can interact with KVM and use KVM to run guest a lot faster. So it's like 99% of users use KVM with QMU, but not 100%. KVM is kernel-based virtual machine, it's a driver supported by Linux kernel which can accelerate specific guest. And actually accelerates only CPU and memory accesses. It doesn't do anything about hardware, peripherals, it's still up to QMU. VFIO is a technology to provide user space access to real PCI hardware. And a lot of effort was put to make it secure, to make sure that one guest, if it gets some PCI hardware, cannot make any harm to other guest or host, because we want our guest to be fully isolated and be able to run even malicious code, so host won't suffer. And also most of the users of VFIO is QMU, KVM, but not 100%. It still works in fully emulated mode and it can also be used for some weird user space applications which require low latency networking, fast trading, something like this, which we would call user space driver. But this talk is more about QMU. One more thing I will be referring to as PAPR during this talk, which is... We are working on PowerPC architecture. And it has two sub-architectures. One is embedded PowerPC and the other one is server-powerPC. I'm working on server. Embedded works a bit different. And this has nothing to do with it. So today I will use as PAPR as server-powerPC instead of server-powerPC. And the specific of this architecture is our guest's power virtualized. That means that they know all the time that they are guests. They expect Hypervisor to provide some configuration about memory, devices and everything like that. And they provide Hypervisor API or different services like PCI-Bus Discovery or interrupts handling. And in order to make it all work, well, if devices wouldn't do DMA, then we wouldn't need anything special. It would just work. But today even graphic adapters want DMA for some reason. That means that device should be able to access main memory without interacting with CPU. And since every operating system expects this mapping to exist in some form. And we have many systems on the same host but one PCI-Bus. We need to provide some way to map every guest memory to actual PCI-Bus. And this is what IAMM use for. Two more things to mention before I show. Awesome. It crashed. It's incredible. I love LibreOffice. This crashed LibreOffice. Awesome. Kind of example of the system. On the left, it's a big box describing PowerPC CPU, Power8. And on the right, you can see several PCI cards. In this example, it's Intel 1000 2-port adapter. And when we started, we could... This adapter, it actually has two PCI functions. When you do a less PCI, you see actually two devices. Each port gets one function. And you could pass every single function to separate guests. And then we enforced limitation that we can only pass the whole adapter in the slot to one guest. Because if we pass it to different guests, we have to trust the adapter that it doesn't... One function cannot access memory available to the other function and therefore corrupt the guest. So IAMM use group term was introduced to describe this. So the easiest way to think about IAMM use group is a plugable PCI slot. And it includes PCI root complex on the host, in CPU in this case, and the slot. And if you have PCI bridge behind the slot and multiple functions, this all goes to one group. And you can pass the whole group to the guest. You cannot pass random function. User space can read what device got to what group by looking at slash c, slash kernel, slash IAMM groups. And every group has an ID, which is integer, zero based. And devices get assigned to a group by a platform code when it does PCI discovery or PCI footplug. There is no generic rule. For example on X86, multiple PCI functions are implemented right on the chip or chipset. And each of them gets a separate group. Because we know that these functions don't interact to each other and it's safe. And another thing was introduced called BFIO container. H on X86, it would be IAMM use domain. It's a feature supported by X86 CPU, Intel or AMD. Which means that the difference between PowerPC and Intel here that from outside it looked the same. Deep, bunch of PCI Express links. Well, good, well, fine. But internally in PowerPC we have multiple PCI host bridges, totally independent PCI buses. On Intel, they all live under the same virtual bridge which sits on the CPU. So they have one PCI domain, we have multiple PCI domains. I'm not sure why they keep doing this. That's the way it is. So in order to separate some PCI functions, some slots, some PCI links from this PCI network, Intel introduced IAMM use domains. So we can tell the hardware that please create domain, put this IAMM group to this domain. Then it sets up some IAMM use translation table programs to this virtual domain. And runs guest with this domain. And Power, we don't have to do this. Instead, we have container. So the container thing is pure software abstraction. And on X86 or AMD, AMD Intel free scale, many other architectures. It does this domain thing in our case. It just creates translation table and programs it to every PHP. We want to use with the guest. So in this picture, we use PCI slot number one, number two. We pass it to the guest. So on the guest, we have kind of proxy PCI device and virtual PHP. This is what the guest gets to see. And this container thing handles DMA operations like map and map. They are not addressed to device. They are addressed to container, which represents IAMM use. A few words about Vfio driver stack. There are basically three modules running. One is Vfio PCI module. It's a PCI stop driver. When you do less PCI, you will see Vfio PCI. A Vfio module which implements containers and it also provides user space access to the IAMM new groups. And IAMM new driver, which is not visible from user space. Anyhow. So basically how it works. You unbind device from the existing driver, whichever it is, Nuvo, Nvidia. Then you bind to Vfio PCI and check if it's bound. Now we are ready to go. Then we're running QMU. On X86 it's as simple as this. On Spaper we create another virtual PHP for Vfio devices. But one for all of them. It's more for convenience of debugging. There is no actually fundamental reason why we couldn't do the same thing as X86 does. Now there used to be limitation. And what we have in upstream now has this limitation. But I'm reworking the stuff to remove it. How does this all work? When we start QMU with this, QMU actually takes device, looks at series kernel IAMM new groups. There is sim link in a device to IAMM new group folder to know what group it is. Then it... Where is it? Right here. Then QMU opens slash dev slash vfio slash vfio to get empty container. Then we attach. Then it opens a group. Slash dev slash vfio slash ID of group. And when it's open, no one else can open it. This way we make sure that we have exclusive access to the group. Using special API, QMU attaches group to container. And it makes sure that every single function in the group is bound to vfio PCI. Not to some other driver. And it runs. Inside QMU to emulate PCI device you have to care about four things here. You have to provide config space access. So to let the guest actually discover what's in PCI bus, how many devices, what they can do, what bar registers are. You have to provide interrupt handling. And these two things work the same way for emulated PCI devices in QMU. And you just reuse this stuff for vfio. Bar accesses, they are either emulated or mapped to the guest directly. We don't need map for emulated devices. Again, it works the same way for all architectures. We didn't have to bother about it. The DMA is the problem. And it is also a problem with big little Indian because on PowerPC we support both big Indian and little Indian host and guest and various combinations. Kind of tricky. The trickiest part was where exactly to do byte swap. Actually, which is not really technical issue. DMA, it works different for fully emulated guests, which x86 guests are all paravirtualized guests as our guests. The difference is that x86 guest doesn't expect to see any UMMU. There is no API for this. So UMMU basically maps the whole guest to PCI bus in the way the guest expects it to see. So at this moment, the entire guest RAM gets mapped and pinned. So guest pages stay in RAM all the time. In our case, guest gets to see UMMU. It receives some information about DMA window, which is address range on PCI bus, where DMA is allowed to... And it can use hypercalls, such as hputtce and others to make actual mappings. I believe when this was introduced, it was considered as a cool feature that you don't have to pin the entire guest in the RAM. You can still swap guest pages out if there is no... if they are not registered as DMA pages. But today it creates more problems than it solves. Well, time runs fast. But the problem with this window is that it's quite small. It's 2GB. It's only 4K pages granularity, which is really fine. And if you run 64GB guest, it can only use 2GB. And it has to map and map all the time, which has an effect on performance. Recently, we got dynamic DMA windows where we can create another huge window and map the entire guest to PCI bus to some higher addresses, like 08, many, many, many 0s. And we can choose page size for UMMU access and all of this stuff. And it works just fine, except it only works for devices which are capable of doing 64-bit DMA addresses. For example, this device cannot normally, without hacks. I have some time left. Then I will complain a bit about performance on this paper. Because normally, why this small DMA window is slow on our system? Because normally, guest would do DMA alloc and that would be game alloc, basically. Guest can take any page and use it for DMA because it expects it to be mapped on PCI bus and it would just work. And the proprietary IBM hypervisor called PowerVM, the thing which we are replacing with PowerKVM, it works in real mode and when it gets this hypercall, it handles it really quickly. So no performance degradation here either. But on KVM, it's slightly different. So here is the sequence. Every time when guest does DMA alloc, it causes hypercall, KVM traps it with MMU switched off, so it's real mode. Then it has to switch to virtual mode, then it has to pass this request to KMU. KMU does the translation of guest physical address to host physical address because actual translation table contains real host addresses because actual hardware doesn't care about whether it's guest or host. Then KMU does a Yoctl call to the kernel again and even only then host driver updates the actual PCI table. And we have to pin pages doing all of this and it takes time. I run some tests on 10 gigabit Ethernet card and I would expect 10, 50 megabytes per second and without any acceleration, I could only get 180 megabytes per second. That's because of this really slow handling of DMA alloc call. So I had to do extra step because normally, as I mentioned, the FIO doesn't have to do anything with KVM. It just provides user space access, maps everything and that's it. And on PowerPC, that's not enough, it's too slow because we have these hyperquals. So we have to implement some optimization in real mode and virtual mode. That's the biggest difference, I believe. I was told I have to have the conclusion and plan. There is no conclusion. I'm planning to push everything to upstream again, 20th time and maybe I'll change my code to be more like x86 and avoid having additional virtual bHB if I don't have to. Does it make any sense to any of you? So we have quite a bit of time for questions. This diverts a little bit into the space of KVM being 99% of the cases and there's a few others. I work a lot in the finance space where there's a lot of low latency type solutions where there's some trends towards moving the device drivers into user space and just talking directly to devices via sort of memory mapped IO. A bit slower, please. I work a lot in the low latency space and there's a trend toward writing device drivers in user space and just talking directly to the buffer on the device via memory mapped IO. Is this a mechanism by which you could do that as well? Could you use this for writing user space device drivers? You can write user space application, get the exact PCI device and you don't have to use interrupts, you can do polling and you'll get maximum of it. Everything is just mapped, handled by hardware. That's that 1% of VFI users which we never heard of. But we always remember about these people. Anton has a toolkit called DPTK. I think it also works together with VFI. Can you make that work across platforms that they're playing development kit. It's a toolkit to write user space device drivers. As far as I know, this works based on VFI or how to access the device control works directly from user space. I would be very interested if this toolkit would work across different platforms on PowerPC as well as on Intel. Christoph was just talking about something called DPTK which is the data plane development kit that Intel has. It's a framework for doing user space drivers. You're right, someone added VFI support to that. But from a power perspective, I'm pretty sure we've submitted I'm not sure if it's been merged yet, but you can run it on power. With that and VFI, you should be able to build within that on top of a power box. We want to support it? It's an Intel project. Okay, it's open source. We can compile and run. Anyone else? What do you see as the typical use cases? PowerPC box with 2,000 cores in it and a bunch of PCI devices, each of them with SRAV capability and you can run like 100 guests and each of them has PCI device in it and it doesn't bother host with all of this and you can run really fast network in it. It's not for enthusiasts, I believe. The other use that the maintainer of it is Alexander Williamson and he keeps trying to run guests with 3D adapter and then run games in this guest. It kind of works. Every new driver version or hardware version of NVIDIA cart or ETI cart, they are new tricks. They detect KVM and stop working in KVM but that's more for fun, not for making money. That would bring me to a question about a project which I've seen. It's not open source currently but may eventually be where they're doing... Could you please speak louder? Where they're doing essentially router offload so they've got two nicks and going via NVIDIA cart to do content detection and straight out the other nick. So you have three PCI devices in one group. Yep. What's your support for direct between devices? I didn't get the question, sorry. I gathered that you're talking about having... Intel you have a group with lots of devices in it on how each one is separate? No, in Intel every group has one PCI device. I mean, from CPU it's a bunch of PCI Express links and basically every link connected to a slot gets separate group. Right, so that would make it quite difficult to implement what I'm talking about whereas in Intel one group would be easier, right? I'll have that told you afterwards. I think there are no more questions to be asked. So thank you for your talk. Thank you.