 Hello. My name is Martin Koenig and I work in the Wind River Technology Office on emerging technologies, especially those relating to an intelligent edge compute. We have recently been using KVM and doing some unusual things with Vert.io that I want to share with you all. I work for a company that has a rich history and experience helping other companies build embedded systems. Back in the day, traditional embedded systems were often designed, implemented, and tested by the same team, and then manufactured and shipped, never to be seen again. Nowadays, system architects prefer a divide and conquer approach using software partitioning due to the amount and diversity of code that must be integrated into intelligent systems. Some of the needed services and applications are custom written, and some of them might be licensed from independent software vendors, but the majority nowadays are likely built using open source. Regardless of where those services and applications come from, it is now common for them to require their own operating system instance to avoid host contamination caused by conflicting operating system configurations, such as subsystem version dependencies. This is why host operating systems are now integration platforms that can run multiple separate runtime instances, whether they're containers or virtual machines. This is desirable not only because of better understandability and manageability of the larger amounts of software, but also for security reasons, and to enable the provisioning of dedicated compute resources to certain systems, and also for fault containment purposes. This is one reason why hypervisors like KVM are such an important part of intelligent edge devices. I know this is a Linux Foundation event. We're all Linux aficionados here. But I also want to zoom out and quickly underscore why Linux is so important to intelligent edge compute by reviewing some relevant mega trends. Firstly, edge devices are connected devices and connected devices have large amounts of open source to securely communicate with the internet. When we need to implement all of the connectivity protocols, manageability features, middleware framework services and applications required to deploy edge devices, we're at a point where we need to integrate large amounts of open source to build solutions in a timely manner, and that open source is written for Linux. Secondly, the hardware enablement code provided by the selected and board vendors is for Linux, and since hardware is becoming so complex and low level code is so expensive to develop, it is easiest to just use Linux for deployed systems. And finally, porting code from Linux to other operating systems is increasingly problematic, because as Linux becomes richer and more optimized, APIs and services specific to Linux are increasingly used at all levels of the software stack. As Linux becomes good enough for more use cases, open source increasingly becomes Linux open source. Long story short, intelligent edge devices, especially those built around complex heterogeneous multicore processors will increasingly contain an instance of Linux. Meanwhile, the intelligent edge requires low latency reactivity for many categories of devices from 5G connected infrastructure to autonomous mobile devices, robots and various cloud connected control systems. So if intelligent edge devices will have a copy of Linux in it, where will the real time and safety related workloads run? Well, the easiest solution is to just run them natively on Linux when Linux can meet the latency requirements. This is done by provisioning dedicated memory and compute resources for those workloads and making sure they are not interfered with by other resource hogs. We can call this software based workload partitioning because we use software techniques in Linux to partition the computer so that real time and safety workloads can make progress in bounded time. Another approach is to put Linux in a virtual machine on a hypervisor that has real time or safety features and run this real time or safety workloads beside Linux. We can call this virtualization based workload partitioning because a hypervisor is required to partition and enforce the virtual machines, and in some cases it's scheduling them too. It is becoming more common to see multiple compute islands instantiated in the same SOC. They're basically multiple complete computers, and they're often connected with shared memory so they can collaborate. We can call this physical partitioning since there are dedicated physical computers for the real time or safety workloads. For completeness, there is another way to partition hardware to achieve a hard real time engine beside Linux, which is to offline cores from Linux and reactivate them with a real time workload running from some dedicated physical memory that Linux doesn't manage. We've been calling this whiteboard partitioning since you have to figure out ahead of time exactly which computer resources and devices will be used by which partition to make sure they don't step on each other, and this is going to easily be done, well not easily, this is typically done on a whiteboard. This is one, this one's a lot of ropes, so be careful with it. Here's a potential use case for a real time workload that runs beside Linux using KVM. Time sensitive networking is a technology that is useful to separate, to physically separate sensing control and actuation, while still enabling them to operate together with a very low latency. When an application mandates a very low worst case latency, it may have been implemented with an RTOS on a dedicated system. The question is, can that real time workload be brought into the Linux environment and run using KVM and still retain good real time performance. If we configure the virtual CPU thread onto a reserve core, and we run an RTOS on it and map the, in this case the TSN nick into the virtual machine using VFIO, it is possible. To understand why this is, consider KVM running a real time workload on that reserve core. If 100% of the instructions are executing inside the virtual machine, and those instructions are all written using hard real time techniques, and are using a mapped in real time device, hard real time should be achievable. So the question becomes, when is that core not running instructions inside the virtual machine, and can those instructions be minimized and make deterministic. To achieve that, the payload needs to avoid hypercalls and minimize traps to KVM. Let's come back to that a bit later and zoom out now to consider more scenarios for running real time and safety workloads with Linux. This is a bit of an eye chart, but it's another way to see the landscape of scenarios for partitioning real time and safety workloads with Linux. The software partitioning approach boots Linux on all the cores, and then we use Linux features to isolate a real time workload, typically on its own core. This can be done with a user level process using CPU reservation features to pin one or more threads to a core. That process could also be a unique kernel, which helps reduce the number of system calls into Linux. It could also be a KVM based virtual machine with one or more VCV CPUs running using CPU reservation. We'll call this soft real time because it can achieve reactivity, but it's hard to guarantee it since there is some non determinism in Linux, and the core can still enter the Linux kernel for system calls, traps, exceptions and interrupts. The whiteboard partitioning approach leverages the Linux CPU hot plug feature to take a core offline, and then we can restart it with a new payload in physical memory. That payload could be a native polling loop, although we've been using an RTOS running under a separate hypervisor that only runs on the offline cores to protect Linux in case the workload crashes. This scenario can achieve hard real time if the payload is hard real time, but isn't safety capable since Linux can still interfere with the hypervisor and thus the workload so you wouldn't want it to be a safety workload. The third scenario here is often referred to as a mixed criticality scenario, because if you use a safety capable hypervisor, you can mix safe and unsafe workloads on common hardware that the hypervisor partitions. And finally, we have the compute island scenario where you have Linux running on a first CPU cluster and decided a real time or safety workload running on a secondary computer in the same SOC. All right, here are some general guidance on how to integrate real time and safety workloads with Linux depending on your requirements. If your use case only needs tens of microseconds latency on average, and it can recover from an occasionally mis-deadline, then it's perfectly reasonable to run Linux across all the cores and use CPU reservation to dedicate a core to a soft real time thread in a user level process, or to a vCPU thread, perhaps running an RTOS in a VM as mentioned. Keep in mind that it may be difficult to map your real time device into the user level processor virtual machine, and your mileage may vary with respect to the best approach, depending on the device in question. If you need microsecondish hard real time or safety with no mis-deadlines, then you need to run your workload either on a dedicated compute island, or in a virtual machine running on a real time or safety hypervisor, and let Linux run beside it. Okay, this is where KVM and VertIO come into the picture more. For the workload partitioning scenarios that have Linux with auxiliary runtimes, we need ways for those auxiliary runtimes to collaborate and integrate with Linux. In particular, we want them to be able to send any printf or console output to Linux, and we want them to be able to read and write files from Linux file systems when those file systems are available. And since it is likely that those auxiliary runtimes are more brittle to program on than Linux, we probably want to run most of the applications on Linux and only have the real time or safety functions running on the auxiliary runtimes. That means we will need to split our applications across Linux and the real time or safety auxiliary runtimes, and thus we need some way to send messages between them. The traditional way to do that in the cloud is to use TCPIP between the partitions for edge devices using a WAN protocol like TCPIP is a heavyweight approach. So it is interesting to have an IPC mechanism that is more optimized to local communication and shared memory in particular. VertIO to the rescue. It's already available for Linux and also for many of the runtimes that could be deployed as auxiliary runtimes beside Linux. VertIO has an open specification and it has flexible connectivity options. It has AFVSOC, which is particularly nice for local IPC. It can in theory and now in practice be run over a shared memory without a hypervisor and it has provided not only for low level device access, but also for higher level services like file systems and IPC. The hypervisor-less VertIO proof of concept that we developed at Wind River for 64-bit Intel and ARM processors targets all the partitioning scenarios we just discussed where Linux is beside auxiliary runtimes and with the workload integration features mentioned. Using hypercalls down to a hypervisor-based VertIO backend to notify it of VertQ or config register changes, we used interrupts over to a Linux demon process where the VertIO backend is implemented. That demon is a modified version of KVM tool in our project and it maps a shared memory that the VertIO front end uses for the VertQs and data buffers. The KVM tool was also modified to handle VSOC and VNet VHOST offload for scenarios not involving KVM based workload virtualization. Here's the generalized hypervisor-less VertIO architecture where Linux owns the general purpose cores and devices and an auxiliary runtime is run using dedicated real-time or safety cores and they just have the devices required by their real-time or safety workloads. Each auxiliary runtime is a VertIO front end with its own chunk of shared memory, its own notification mechanism and hardware and its own KVM tool-based demon as a VertIO backend. The interesting aspect of this is that this architecture is common to all the partitioning scenarios that involve auxiliary runtimes, whether software partitioning using CPU reservation, whiteboard partitioning using core offload, virtualization-based partitioning using a real-time or safety hypervisor and physical partitioning with compute islands. It is helpful to have a common approach and common APIs that can unify multiple system architectures to enable software and knowledge reuse. Here you can see the differences between standard VertIO and hypervisor-less VertIO. With hypervisor-less VertIO, we have to locate the vert cues and buffers within the shared memory. Otherwise, the backend can't see them because there is no hypervisor-based backend to map and access guest memory. Also, when important device configuration feature bits and device status bits are changed by the guest, the hypervisor in the standard config gets a trap to detect it. Whereas for hypervisor-less VertIO, the auxiliary runtime has to kick the backend with an interrupt so that it is aware of that change. The other difference is that the VertIO device configuration now includes information on where the device-specific shared memory regions are located. Hypervisor-less VertIO uses a device tree blob to share the VertIO device information with the auxiliary runtime. The shared memory region is laid out with a device tree fragment that describes the VertIO devices, followed by device configuration descriptors, as per the VertIO specification. And then the shared memory for the vert cues and buffers. It turns out not a lot of shared memory is needed by VertIO, although for auxiliary runtimes on some resource constrained compute islands, this could still be considered a significant amount. In that case, it would be possible to only configure VSOC and run the console and 9p file systems over VSOC, giving only about 64 kilobytes or so of shared memory needed for this configuration, plus the size of the device tree blob. Here you can see where memory is actually shared between the auxiliary runtime and Linux-based KVM tool demon. This is shown for each of the scenarios involving auxiliary runtimes, notably core reservation, core offload, mixed criticality, hypervisor, and compute islands. With VertIO setup like this, an auxiliary runtime with a small RTOS or executive can offer ANSI standard IO for the purposes of output and file access, and also POSIX socket APIs for AFINET or AFVSOC IPC. Our approach to enable socket-based IPC was to start with AFINET over VertIO, and then switch over to use AFVSOC, which really reduced the amount of code in the auxiliary runtime, while at the same time increasing IPC performance, which we measured as about a 10x improvement over TCPIP. Basically, the way hypervisorless VertIO works is using hardware interrupts between the runtimes to signal changes to the shared memory. When the interrupts arrive at the Linux kernel, it notifies the physical machine monitor using an event FD. The PMM then determines why it is being notified, and it handles the request. The PMM also relays notifications to VHOST when the request can be offloaded for VSOC or VNET. As mentioned, during the development of the POC, we realized that once an auxiliary runtime has VSOC, it does not need VNET for screen socket communication. To achieve that transparently to Linux, it is possible to modify KVM tool to proxy TCPIP ports to VSOC ports on a one-to-one basis in either direction, so that Linux can use TCPIP or VSOC, and the auxiliary runtime can use just VSOC and still be reachable over TCPIP on Linux. The auxiliary runtime services we think are worthy of running over VSOC are the GDB server for debugging, a shell for command line access, a 9P file system over VSOC instead of directly with VertIO, and also to enable any other client server or pub-sub application connections in either direction. We also want to note that such an approach might be worthy to minimize code for safety certifiable systems, although there will have to be some changes to how vertices are implemented so that fault propagation cannot occur from a Linux-based backend to a safety-critical front-end. A side note worth mentioning is that when we adapted KVM tools MMIO-based VertIO transport for hypervisorless use, we noticed that it was not as fast as KVM tools PCI-based VertIO transport when used as intended with KVM. When we were digging, we determined it was because the backend was taking a lot of traps due to front-end VertIO register accesses, which also required the backend to do more work to determine why it was being interrupted by removing those traps and using MSI's message signal interrupts in the MMIO transport like the PCI transport does. We were able to double the performance of the MMIO transport in KVM tool, which benefits both KVM deployments and hypervisorless deployments. You can see that in the numbers here, where you see that with MSI's, the backend gets twice as many notifications during the test cycle, which is due to the doubling of the bandwidth. To compare the updated VertIO MMIO transport performance to the VertIO PCI transport performance, you can see that with MSI's, VertIO with that MMIO transport is about the same performance level as VertIO with the PCI transport. The interesting aspect of this is that the MMIO transport implementation is a lot simpler than the PCI transport implementation and also is more compatible with ARM and RISC-5 processor architectures as they don't have standardized PCI controllers like Intel architecture has. So the conclusions from this work are, one, there are use cases for auxiliary runtimes beside Linux at the edge. Those runtimes need ways to integrate with Linux and VertIO can help, such as for console, network, file systems and IPC. Compute islands can remove the need for virtualization to enable real-time or safety workloads with Linux-based file systems and they can still use VertIO for multi-OS integration using hypervisorless VertIO. VertIO over MMIO and MSI is as fast as VertIO over PCI and has a smaller implementation, making it potentially more suitable for safety certification. And lastly, AFVsoc sockets can be 10x faster than AFI net TCP IP sockets and AFVsoc also has a much smaller implementation. Here are some links to further information and sources on what we have done to KVM tool to enable hypervisorless use cases and also to enable MSI support. I should mention that this work is being done under guidance from the Lenaro OpenAMP community project working group on application services. I would also like to give honorary mention of Stefan's great presentation on VertIO Vsoc from KVM Forum 2015. And he can likely easily find that on the internet. And finally, I should mention that this activity is sponsored by my employer Wind River as it is aligned with our vision for mission critical intelligence systems, including those that leverage Linux at the edge. Wind River Studio is a cloud native platform for the development deployment operation and servicing of intelligence systems, and you can get a tour of Wind River Studio at the link on the screen. Thank you.