 So, good morning. My name is Heert Erthruven. I will present to you about getting device pass-through working on embedded arm. First, something about me. I started on real computers a long time ago as a hobbyist, Commodore 64 Amiga, and then about 25 years ago, I switched to Linux on various platforms. Most of that was on real hardware. There was some virtualization on IBM mainframe at university with VM. When I worked for Sony, I worked on many things. One of them was Linux on cell broadband engine on the PS3. That was also some kind of virtualization because Linux run on top of a hypervisor, and about five years ago, I became self-employed offering consultant service for embedded Linux work, and I've been mostly working doing upstream kernel work for Renesas on their ARM-based SOCs. In 2001, I went to an interesting small present conference here in Brussels called OSDEM. It was really great. The next year, it changed to FOSDEM. I attended, of course, and I continued every year, and in 2004, I joined the Embedded Track Program Committee. This year, unfortunately, there's no embedded track due to some miscommunication, but next year, there will be again, and next year, it will be the 20th anniversary of FOSDEM, so better be there. So, LEDs. I guess you all know LEDs, and you all like LEDs. If you take an embedded engineer, the first thing he wants to do is blink an LED. So, I made a presentation about that, and I got inspiration for the title, Getting to Blinky from a video series from Electronics Consultancy Company. They have a video about making your own board with LED using KeyCat. That's really fun and cool, but this is a virtualization desk room, so I'm not going to talk that much about hardware, but about virtualization. So, how did I get to this? So, as I said before, I'm doing consultancy work for Renesas. So, they have ARM SOCs with lots of components inside like GPIO, IOMMUSata, more about that later, and they were interested in doing virtualization. They didn't really know what the use cases were. They said, well, let's start with something simple. So, what can be more simpler than getting a blinking an LED? So, if you have an LED connected to a GPIO, then you can control it from Linux, from user space by exporting the GPIO and writing values to it to control it in CISFS. Yeah, some people have probably commented I should use the new Cardef GPIO API, but this is working fine. So, that's how we can do it on real hardware. Now, the question is, can we control the LED from a virtualized guest too? And that turned out to be a bit more complex, because there didn't seem to be any existing solution for that yet. So, let's first talk a bit about virtualization. So, one of the ways to virtualize systems is using QAMU. So, if you have real hardware, on top of that you run your operating system, which is, of course, Linux, and then you can run lots of applications on top, and one of them is QAMU. QAMU will create for you virtual guest hardware, and it will emulate CPU, RAM, IO blocks. And on top of that, you can run just Linux, like on real hardware, and applications as well. Of course, because everything is emulated, this is not that high-performance. So, fortunately, we later got kernel-based virtual machine, or KVM, if your CPU has virtualization extensions, that it means that you can just have CPU and RAM access directly from the guest without any emulation involved. So, KVM and the virtualization extensions will make sure that the normal memory management unit is not only used to maintain separation between OS and user space, and between multiple users on user, but also between the guest and the host, and multiple applications and operating systems running on the guest. That's only for the CPU and RAM. With the advent of virtual function IO or VF IO, you could also have direct access to the hardware. So, VF IO will use the IO MMU at the bottom here. So, VF IO will use the IO MMU, so IO devices that use direct memory access, and does have access to RAM, will only access the RAM that's allocated to the guest and not interfere with other systems, because the guest will access the hardware directly. Performance can be much better. Now, let's try this in, yeah, so there are multiple types of virtual VF IO. The most common one is the PCI-based, which is used already on servers. It's working well, it's a mature solution. It's fairly standardized thanks to PCI configuration space. You have vendor and device IDs. You can easily identify the hardware that you're having. The base address registers indicate what parts of the hardware are being used, some capabilities in the config space as well. If you want to use a device from a guest, you first have to make sure that Linux no longer uses the device itself on the host. So, the first thing you do is, so the first thing you have to do is unbind the PCI device from the driver on the host. For that you need to know the PCI domain, the bus, subsystem and function it's represented by. Then you have to tell Linux that you want to bind this device to VF IO PCI and do the final binding. After that, you can launch QEMU and you just pass some special options where you specify the device to use, and that's it and QEMU will take care of the rest. Now, I have to say there are other ways as well like this VF IO for AP and CCW, which are used on IBM mainframes. There's also something for the ARM AMBA bus, but that don't seem to be any support for that in QEMU yet. Maybe they use other solutions, other virtualization techniques. Now on ARM embedded systems, you typically don't have a PCI bus inside the SOC. There may be some external PCI bus where you can plug in PCI express cards or something like that. But you have on SOC devices that are described in the device tree. So there's a compatible value that specifies what kind of device it is. There are special properties to specify what resources you use like register address space and interrupts. That's quite similar to PCI, but there are also can be lots of other properties that may be device specific. You can have P handles which are basically references to other device nodes in the systems for interrupt controllers and whatever other things, especially for multimedia devices, this can be quite complicated. If you have endpoints and things like that, you can have subnodes describing things connected to the device. And all of this is much more complex than a simple PCI device. And it's also less standardized. And so far, there's very limited hardware support for that. So I noticed that there's support for two 10 gigabit internet adapters from AMD and Calxida in both Linux and in QAMU. But that's about it. To export the device from the host to the guest with using VFIO platform, it's very similar to VFIO PCI. So you just have to unbind it, override matching and tell it explicitly bind to VFIO platform. After that, you launch QAMU and pass the device and that's actually very similar and it sounds simple. So how does this work in practice? So on the Renaissance board I had, I had the GPIO block, now multiple GPIO blocks, but one of them is connected to LED. It's described in DT like this. So first you have compatible values, register, range, interrupts and GPIO cells to indicate that it's a GPIO controller. Those are the most important parts or other things like some of the GPIO ranges is related to the pin controller, but we don't care about that. And there are some clocks, some power domains and resets, but they don't seem to be that interesting at first. Most important part is you have register range and if you don't use input, you don't care about interrupts neither. So let's try that. So first you tell it which GPIO controller you want to unbind. You get some scary warning from the system if you unbind it because some other GPIOs on that chip are in use for other things, but we don't care. You tell your platform driver to buy into that specific device and that's it. Oops, we get an error. There should be a reset function. What's that? Let's talk about resets first. So in virtualization, they want to play it very safe. So the idea is that before you export hardware to the guest, you want to reset it so it's in a known state. After the guest has used it, you also want to reset it so you know it's in a same state for further use. So that's solved by having a device specific via via your reset driver to be written for each and every device to be exported. So currently in the Linux kernel to support for these two 10 gigabit internet adapters, they can be reset and there's nothing else. So can we do better? So if you look at a typical modern SOC, you have lots of components there and in many cases there's actually a single reset controller that can reset all of those modules. But you have to be a bit careful there because the topology can be annoying. So in the case A, you just have a module with a single reset, that's fairly something simple. If you assert that reset, that module will be reset. Okay, you can have a reset that's shared by two modules like B and C. That's definitely a no go for virtualization if you want to export device B to the guest. It means that you cannot reset it without also affecting device B, which may be used on the host or exported to a different guest or something like that. So that's a no go. You could have a device with multiple resets, also very complicated because you don't know in the general case, which device, which reset you have to assert first or maybe some ordering requirements. Or you could have some deeply connected two devices, A and B, E and F, for example, that have multiple resets impacting both devices that's complicated. So Linux has a reset subsystem where you can just reset devices if the reset topology is described correctly in the D. So I was wondering, do we really need a device-specific reset routine? For the case A, we can just use the standard Linux reset subsystem and do that. So I wrote a patch for that, that's still not yet accepted upstream because people are still worried about implications on some systems with couple reset topologies. But the solution I have there will only allow to use reset in case A where you have a single dedicated reset line for the device which should be safe. So with that patch, we can continue and see what happens. Oops, we need IO-MMU group. So as I said before, the IO-MMU allows to use devices from a guest, that devices that use DMA and will read and write from memory. Of course, you want to prevent that the device controlled from the guest will access memory that's not allocated to the guest. So the IO-MMU takes care of that very similar to a normal memory management unit that does similar things for accesses from CPU to memory. Now, in our case for the GPIO control, it doesn't have the DMA. Fortunately, there seems to be a VFIO, no IO-MMU mode in Linux and we can enable that. And then we will be set later. This GPIO device does not use DMA. So in general, can you know whether a device uses DMA? Because you will need IO-MMU only if it uses DMA and else you can do without. For PCI, it's always assumed that there's IO-MMU available. The IO-MMU hierarchy, how it's connected to the system, it's always well-known from the bus hierarchy. So that's very simple. For platform devices described in DT, it's a bit more difficult. So there's a special IO-MMU's property in DT which you can, which you have to use to indicate that the device is connected to IO-MMU. So if that property is present, the device will use DMA for sure. Some devices, they don't do DMA themselves. They use a DMA controller that is indicated in DT by using a DMA's property. And if that property is present, then you know that the device uses DMA. You have to be careful there because the DMA will be handled by the DMA controller. So ideally that one should have IO-MMU's property too as it will use DMA without IO-MMU, which is unsafe. If none of these is present, then it's not safe to assume that the device doesn't do DMA. Perhaps it's a device that is not connected to IO-MMU and then it's unsafe to use from a guest. Or perhaps IO-MMU support is not yet enabled for the device in the device tree. That's also a problem. So better safe than sorry, we only want to use export devices to the guest when they have IO-MMU support. But for this CPIO example, it's safe to not use one. So with that patch to use VFIO without MMU, Linux is happy and we have the device exported to the guest. So let's start launching QMU. So as I said before, for now there's limited support in QMU for instantiating devices. So I wrote some code to instantiate minimal GPIO device node for the Rcar GPIO controller, which basically adds this to the device tree of the guest. And then we can launch QMU, we'll pass it with the device we want to use. It fails, oops, it cannot open a VFIO file. All right, we don't have that file. There's a def VFIO no IMMU zero file instead of a zero file. So while Linux has support for this VFIO no MMU mode, that never got accepted into QMU. So I'm wondering who's using it, perhaps some other virtualization techniques. Fortunately, you can find a patch on the mailing list archives to enable support for that. And after that, yes, Linux in the guest initialize the GPIO driver, and we're happy. We can control the LED from the guest, similar like we do on the host. And wait a bit, nothing happens. That's annoying. It turns out that we forgot some important details. So how do as you'll see is do power management. So typically you have lots of devices, there and they're all controlled to power controller. And the power controller can turn off power to one device when it's not in use to save power. Usually in embedded people care about power consumption in service space, they also care, but it's less. Of course you cannot use a device when the power is turned off. One other thing they may have is a clock controller where the, which supplies a clock to each module. If the module is not in use, the clock will be turned off. Of course the module won't work without that clock. You could have real complex system with hierarchical power areas, clock domains, and whatever, and that's actually what we have on the Renaissance embedded SOCs. So we need to control the clock domain and the power domain and whatever because if the device is off, the hardware manual calls that undefined behavior. If you're lucky, like in my case, nothing happened, but if you're unlucky, it could cause exceptions and even crush the whole system, including the host, which is definitely not something you want. So how do you control clocks? Many drivers still do that using explicit clock management, but that doesn't work well because Linux has this nice system called runtime PM where it will take care of that automatically. For power areas, there's no other option than to use runtime PM. You cannot just tell the Linux that power on this power area. So the solution there was to add runtime PM support to the VFIO driver in Linux that one did get accepted upstream and after that, we have clock control. One question there to ask is, can't we delegate this to the guest because the guest knows better when the device will be used and will not be used so it can power off the device when not used, it's like that. So yeah, that sounds interesting, but yeah, since it may crush the system if it does wrong on some SOCs, that's not a good solution. So what did we get? No, it's working, great. We have an LED that works. Now the question is does this work for inbuttons too? And yes, it did, we needed some patch that was fortunately being developed in parallel by somebody else. Can we get this to work for other devices as well? Some people try that, works well. I tried it for serial atta and then I discovered some other issue that it only works with standard ARM IOMMUs in the system, so we need patches for that. The SATA drivers still use explicit clock management instead of runtime PM, all things we can fix. But so it means we can add devices support in QMU but all of that required adding new code. Can we do that in a generic way? So I wrote code to try to do that in a generic way, just by copying the device node from the host to the guest. So the register and inter properties are remopt. All simple properties can be copied from the host. Basically simple properties are the ones that do not have P handles. They do not refer to other nodes, so they can be copied. Unfortunately, DT does not have a system where it provides type information so that was a bit difficult. So the solution I took was only copy properties without the size because they cannot have a P handle or properties that are in a white list. Properties handled by the host like power management, isolation, IOMMUs, reset, ping controls, they can be ignored. Clocks, I reject them if there's no power management going on. Subnotes, no that's too complicated but for simple devices it works. So I have a patch for that. It worked with serial atta device I had. So it will transform the above device node from the host to the one below for the guest. So our conclusions. So Cs we usually don't have PCI but we have IOMMUs and people are interested in virtualization so it makes sense to use device pass through there for high bandwidth devices. For simple devices I've shown that we can have a generic code to handle that. For more complex devices you will still have to need a device specific instantiation code but probably the best way there is to provide some help is that it's easy to write those routines. There are still some small difficulties to handle. Clocks are not always used for power management. Some drivers really need to know the clock rate and maybe multiple clocks for that we may need some other solution. This is not yet support devices that don't use DMA themselves but you do the DMA controller. Another typical things like yeah is your hardware suitable for virtualization? Are there things using devices that share some resources and you cannot export one device of it to the guest because it will interfere with the other device that uses similar resources. So we really have to teach the SOC and both designers to take virtualization into account. Yeah, which devices are suitable for exporting to the guest? Maybe the new D binding schematics can help there in doing something. Now back to the GPIOs for one more minute. So in real world GPIOs can be useful for devices as well so you may have a guest that wants to control a relay or power switching on the host. But using device pass through is probably not the best solution there. I did a proof of concept using emulation where I used the existing PLO61 GPIO controller that's in part of the standard QAMU ARM virtual machine. I wrote support that it uses the lib GPIOD back end on Linux so you could just tell it for example with the command line there that you want to export GPIOs 11, 12 and 13 on the GPIO controller specified there to the virtual GPIOs 012 on the PLO61 in the guest. Probably we want something better there than emulation. Some per virtualization that mimics the lib GPIOD API would be good to have. And we may want to expand this to similar systems like the Pulse Width Medulation system in Linux which uses a similar way to export PWM outputs to CZFS like GPIO did. And then we can do motor control and control RGB LEDs. So basically that's it. Thanks and acknowledgments. Questions, I will have to take them. Yeah, sure. Yes. So you mentioned in your talk that the Renaissance were interested in virtualization but didn't find the application. Did you find that? No, not really. Yeah, so the question was that Renaissance was interested in virtualization and what I found out in the meantime what they really want to do and so far we don't. But they have many customers and we're not always aware of what the customers actually want. But... Thank you.