 All right welcome everybody. So my name is Martin Koenig and I worked in the technology office at Wind River and we work on technologies that are a little bit in front of or beside where the company is currently operating and the reason I'm here at the Linux Foundation and Open Source Summit is because we are slowly and surely becoming an open source company. You know we have the number one commercial embedded Linux based on Yachto and we have a hardened cloud platform based on Linux, Starling X and Kubernetes and OpenStack and you know the traditional RTOS use cases where you have standalone VX works, standalone RTOS running on the CPU cluster is being relegated a bit more towards well safety and hard real-time and not so much general purpose compute for a number of reasons that I'm going to cover and I'm going to start by talking a little bit about those meta trends that are driving CPU architecture for embedded especially for edge devices that are connected and how that is affecting the way we do system architecture and also what that might mean for the future for how we're assembling software and so we'll start with that broad that broad context and then we'll go deep a little bit on some proposed partitioning technologies that can use for leverage Verdi-O for hypervisorless Verdi-O in particular which I think is quite interesting for assembling multiple CPU clusters into collaboration fabrics within the SOC. So of course historically you know we would develop test ship or embedded devices you know within the industry and never see them again and now with connected devices they live on and so we have to update them and that brings requirements around security and connectivity and protocols and also the fact that you can update them can mean you're bringing in new features and sometimes you need to partition out those new features and so the platforms are slowly and surely becoming integration platforms and that means that it's interesting to have a little bit more separation from the software space and the hardware so that you can assemble these partitions into working systems and reuse those components from one variant of a product to the next and so basically back in the day something like an infusion pump it would have a fixed function medic be a fixed function medical device that people would connect themselves up to to exchange blood and it would do that very well and very safely but now those same devices have to be integrated into the hospitals infrastructure for the purposes of credential management for the purposes of validating like into a patient database to validate the settings and not so not just for IT but also part of the whole business of running a hospital probably for billing also and so that means that the amount of software that you need is a lot more than just the fixed function the dedicated system of operating an infusion pump it's becoming you know this this entire world into itself to integrate into basically a business deployment and that's the same across all the vertical markets of we're seeing robot arms that have to integrate into hospital sorry into factory infrastructure cars are integrating into becoming a transportation services platforms so you know that one day will just be subscribing to transportation service so that you don't have to you know we're about charging your car if your car runs out of juice maybe you at the charging station you just get into the next vehicle that belongs to the company that's operating the fleet that you subscribe to and and these paradigms are changing how we need to think about like these business paradigms are changing how we need to think about the amount of value how we how we create it how we deliver it to know we capture it in this in the software so the problem is basically how do we engineer huge amounts of software into edge devices given that you know these modern multi-core SOCs are specialized complex in many cases heterogeneous and the software that has to go into these devices is quite diverse you have open source might have real-time elements safety pieces and even licensed software and bring them together in a structured way and also maybe enable the ability of the component trade to be carried forward and reused so that you can amortize the NRE across multiple product variants so this brings us to element separation right we need some way of say well this is a software element and that is a software element and we can combine them and the reasons to combine them are for CICD we when you update something you want to only update the thing that's changed and not the whole system similarly if you have something that stable you don't want to be updating that you want to just carry it forward into your next product fault propagation prevention is another reason to have a partitioning technology because so if something fails the fault can't propagate into some other active entity resource allocation another good reason to have an entity to put a box around your software so that you can say well I want this part of the system to have this much CPU this much memory or this much bandwidth and similarly for privilege management you want to use best practices such as lowest privilege for you know some function so that it doesn't have the keys to to everything it can only have the privilege needed for what it what it's actually doing so then so we have all these requirements and then there's what's going on in the whole computer industry of course Moore's law is ticking along and I didn't think I was going to be mentioning Moore's lever again but somehow is relevant and Moore's gap you know that's the expectation that we have around performance and I know Moore's law doesn't talk about performance but this Moore's gap is is you know this expectation that's driving us towards multicore SOC so I just want to talk a little bit about that so that we can look a bit into the future software complexity and mention software partitioning is the solution of their time to market cost of ownership that's driving us all to open source hardware enablement is driving us to Linux and you know those chip again the fact that it's getting difficult for device manufacturers to get the chips because of the whole chip again situation might actually motivate some designs to move more to sort of hardware consolidation so you have less parts in your devices and to value silicon independence a little bit so you can substitute parts more easily and not be beholden upon a specific skew so let this Moore's law thing you know back in the day when we got more transistors to put into an SOC we use those the hardware designers use those for creative new features in the chips you know caching pipe lighting super scalar all kinds of multi-threading out of order and speculative execution features and for pulling peripherals into the chip well then we got this Moore's gap where you know Denard scaling kind of stopped allowing us to have you know the power density Denard scaling basically states that power density is a constant even when the semi when the transistors get smaller and so you can lower the you know the voltage in the current and increase the frequency and get better performance and that sort of broke down a long time ago like 15 years ago and so now the solution to leveraging more transistors to get performance is more about specialization and having exactly the right kind of cores for the workload and so that's brought in into the SOCs you know GPUs and NPUs and real-time processors compute islands safety islands in some cases lockstep cores and so that's that's where that's going and that's just gonna gonna continue so we're clearly in the era of complex heterogeneous multi-core SOCs at the edge. Meanwhile, this is a graph here or a slide from the automotive industry cars are basically software-defined vehicles now and they're and they're claiming that the lines of code are 10x every decade and in continuing to some are predicting you know 500 to a billion lines of code by 20 by 2030 so there's a lot of complexity being driven into some of these devices and can you imagine with hardware consolidation without having a good technology to integrate all of that software together it would be untenable situation. Of course, so the remedy is partitioning and as software architects, we want to have a strong foundation for partitioning our software so that we can accommodate future requirements. We can reuse components. We can manage constraints. We can compose, configure and enforce policy. We have some structure and organization to the system and so we love partitioning technologies whether it's you know libraries, kernel modules, programs, packages, containers or virtual machines and even multiple run times running in the SOC. It helps us do a divide and conquer. At the same time, there's all these other aspects that are happening to us as software architects. Free and open source software is becoming free Linux open source software. You know, portability is waning a little bit so it's because open source projects are using the features of Linux and so there's less use of things like configured to configure software to alternate run time environments. Bring your own OS, sorry about that acronym, you know, where people want to have bespoke OS instances that are configured exactly for their application and integrate those into a system whether it's using a virtual machine or it's a container or it's running on some kind of a compute island. That's in vogue and you have all these abilities and I'm not going to go through them all that as software architects we have to consider and deal with and meanwhile ready made software is coming, you know, platforms, as binaries, middleware, trusted software with provenance that we're going to be able to stream in to our systems and etc. You know, so we have a lot to deal with as software architects and so the reality is that edge devices are going to increasingly contain Linux. You know, when I have a little proof here around that, you know, edge devices have all this open source middleware ready made applications that are increasingly only available on Linux. Board support packages for edge devices are increasingly only available for Linux. It's because the board manufacturers and SOC vendors are doing board support packages and drivers for Linux and not for other operating systems. And, you know, meanwhile the SOC designers are drag and dropping the HDL and cranking out SOCs faster and faster so that gets more difficult for any OS vendor who is not leveraging Linux drivers and Linux BSPs to consider supporting the matrix of pain that's across all these SOC variants. And so that's driving things towards Linux and porting code from Linux is increasingly problematic because of the reasons mentioned previously that, you know, the drivers are written for Linux. Well, their GPO, you can't bring those into commercial operating systems. The code for the low-level code is using features of Linux and so porting is increasingly problematic. So, therefore, devices will increasingly contain an instance of Linux, QED, which is, you know, English for quite easily done. So, meanwhile, intelligent edge devices need to deal with reactivity, real-time safety, planes, trains, automobiles, drones, robots, the medical devices. We need to have the ability to run these payloads. So, if edge devices will contain Linux, where are these payloads going to run? Well, they're going to run on Linux, if they can, right? And over time, Linux will be able to run more of those payloads. If Linux can't run the payload, well, then we need to run it somewhere else. And there's still a window of opportunity for alternate run times to help Linux. You know, we can call those auxiliary run times. And they could run, for example, in a virtual machine beside Linux with a hypervisor. They could run on a compute island, whether it's a real-time compute island or it's a safety-based compute island, maybe it's lock stepped. It could even run on a dedicated core that's in the same CPU cluster that Linux was running, that Linux is running on. And I'm going to talk a bit about that. That one's a little tricky. I know some call it static partitioning or whiteboard partitioning, where you sort out, well, you know, Linux is going to have these peripherals, and I'm going to map this one into some bare metal engine on a core. But it is a possibility, so it's sort of more there for completeness. But let me just walk you through how that one works, because it can be a little confusing. Basically, you can boot up Linux across all the cores. You can use CPU hotplug to take a core away from Linux. And then you load a bare metal image into some memory that that core has access to. And you can use some technique, architecture independence. You know, on Intel we've been using K-Probes. On ARM you can use PSCI to activate the core at the entry point of that image. And now that image, as long as it's only using per core resources, you know, the per core timers, interrupt controller, MMU, etc. and whatever real-time device that you've mapped directly into that payload, which is not hopefully conflicting with runtime running on Linux, then you can achieve a real-time environment on a core in the same cluster as Linux. And we've done that. So it's definitely an option. But it's more there for completeness, because it's a lot of rope and it's quite dangerous, susceptible to configuration error. So here's kind of the whole landscape of options to run real-time and safety workloads with Linux. Core reservation. This is where you use Linux features to pin some thread to a core. That thread could be running a user-level process, could be a uni-kernel, which is hopefully offloading system calls from Linux and dealing with them locally to reduce traps into the kernel, could even leverage virtualization and have a VCPU that's running a real-time workload that's pinned to a reserve core. Of course, preempt RT is helpful in all of those cases. If not necessary, I guess it's necessary. Another scenario is core offload, which we mentioned, where you take a core away from Linux and you run your bare-metal image on that. So that bare-metal image could be a small executive. It could also be an executive running in a hypervisor that's just on that core. And that won't stop Linux from resetting the image. So you can't use that for safety, but it will stop that low-level image from killing Linux, which is probably more likely because Linux is quite field-tested. And if you're developing some custom application, you're likely to have bugs over in that low-level payload more commonly than in the Linux kernel. Another scenario is to take those two runtimes, whether it's a real-time payload or a safety payload, and run them on a hypervisor that is across all of the cores in the cluster. And so we'll call that virtual partitioning. And the final option here is physical partitioning, where you're using compute islands. And these complex heterogeneous SOCs are increasingly providing real-time and safety compute islands that have their own cache memory and are completely free of interference from the main CPU cluster. So general guideline, if you're looking for tens of microseconds of real-time, you can use Linux on the main CPU cluster with the real-time preempt RT patches, which actually aren't patches anymore now that they've been integrated into kernel.org into the mainline, which is great. If you want more than or less than that for your latency, say you want a microsecond-ish hard real-time, then you're probably at this point still looking at offloading the payload and leveraging an auxiliary runtime on a compute island or in a VM running on a real-time hypervisor. And if you want safety, same sort of thing, you would deploy your workload on a safety island or in a VM with a safety hypervisor. So here's a use case that's kind of interesting and may be a little counterintuitive, but and may possibly even slightly controversial, but think about running like a auxiliary runtime on KVM on a core that's reserved for this one vCPU that's running the RTOS. Now you might think, well, hang on a sec, if at VM exits into KVM and KVM is running some non-deterministic code, well that that's a problem, but let's look at it the other way around and say, well, what if a hundred percent of the instructions were running in the RTOS, right, and the RTOS was taking direct interrupts, had its own local peripherals for its application, then it's not VM exiting in its normal data path, and if it's an RTOS therefore it can be real-time within the, you know, on that core. Now of course the challenge is trying to receive and acknowledge interrupts without doing any VM exit, and so this is something, this is a place where maybe we need to provide some feedback to the silicon manufacturers to enable this, but it would allow us to have a virtualized real-time payload with Linux running across the entire CPU cluster, and so it's up there more, you know, as an interesting scenario, we have done this and measures the real-time performance, and it does take some VM exits, but it does provide, you know, the 10-ish microsecond similar to what you'd expect with with preempt RT, and so that's a perfectly valid use case for enjoying the simple easy Linux boot up experience across all the cores and still being able to deploy an auxiliary runtime within the cluster. So once you're partitioning like this, the question is, well now how do you have your runtimes collaborate? You know, printf is incredibly important today, you need to be able to see what's coming out of your console, you might even want to be able to get access to the serial port for the console or the serial stream to issue some commands and interact with your real-time or safety payload. That payload may want to read or write Linux files, and it might want to send messages to the part of the system that is not real-time or safety, and you know, this is, this is going to be, what you're going to want to do is only run the real-time or safety part of the payload, you know, on the RTOS or the SafetyOS, you're going to want as much of that application to be on Linux, and so you still, you'll be split across the different runtimes, and you want some common way of having them collaborate. So the de facto approach for that is to use TCP-IP. That's a pretty heavy weight thing to be doing within an SOC, you're using a WAN protocol to be sending messages between runtimes that could even be in the same CPU cluster or from a compute island to the main CPU cluster. The question is, can we use VertIO for intra-SOC workload integration? VertIO is already available both in Linux and in many runtimes. It's an open specification that's transport-independent, and that it can run on PCI or MMIO. It has AFV SOC, which is quite interesting, since it's similar to AFINET, and you can provide a SOC at API with it, and our experiments are showing that it's 10x faster than TCP-IP over VertIO, and it can be run over shared memory without a hypervisor. That's why this talk is called hypervisorless VertIO, is because if you're instead of going and doing traps down to a hypervisor, if instead you use VertIO over a shared memory and go across from the shared memory to a daemon process running on Linux on another core, then you don't need a hypervisor, and you can actually leverage the VertIO devices and the VertIO specification without a hypervisor, and that's kind of a novel way to use VertIO, but it has promise and it does work. So VertIO also has low-level devices and higher-level services, like file systems as well, and so it's a good multi-device open specification for integrating run times at a low level and at a slightly higher level with file systems as well. So the thesis for the POC that we've been working on was can we use VertIO on both x86 and ARM without a hypervisor, just use interrupts between the cores, leverage the VMM, and we chose KVM tool as the VMM, to be the back end for VertIO, and then get good performance with that, and so that means that we had to add some support for MSIs to KVM tool and enable vhost based VSOC and network, VertIO net in the system, and then what we did is we took out the opening of dev KVM for the configurations where you're doing compute islands and talking over a shared memory to a compute island. We still call it LKVM even though it doesn't open KVM, but that's how we leveraged KVM tool to be the simple C-based KVM back end. Here's the generalized architecture for this, so basically you have a shared memory between your auxiliary runtime and the KVM tool daemon running on Linux, and the partitioning can be any of those four scenarios, core reservation, core offload, virtual partitioning, or compute islands, so you have the same system architecture regardless of which way you're assembling your system at the hardware level and as software architects we like to have flexibility and reuse, and so that's quite interesting to be able to unify all of those different partitioning technologies with a single mechanism hypervisorless VertIO. So with standard VertIO with these down calls down into the VMM, and for example on KVM, there's these VM exits and traps that are happening as you're interacting with the VertIO devices, but with hypervisorless VertIO there's nothing to trap to, so instead you're writing into the shared memory and then updating these device registers for the VertIO devices, and something has to process that, so what you do is after you do those register updates to the virtual VertIO devices, you send an interrupt over to the what we call the PMM, the physical machine monitor, and it goes and processes that update. Now it has to poke around and look at the device registers to see what's changed, and so that can add some overhead, but it can be remediated by using MSIs, and I'll talk about that a little bit on another slide. So in the shared memory basically the way that's laid out is with device tree fragments for each of the device headers, and the ones that we've been leveraging are the 9p for file system, VSOC for IPC, the VertIO console, VertIO net we started with that, but once we got VSOC working we were preferring to use VSOC because it's faster, and it just seemed like we could reduce the amount of code needed in the auxiliary runtime by eliminating the TCP IP stack. So this might provide some clarity as to where the shared memory is in each of those four scenarios of core reservation, core offload, mixed criticality systems with the virtual partitioning and compute islands, and basically the non-real-time, non-safety services are all provided on Linux with this kind of a system architecture, and the real-time and safety services are provided on the auxiliary runtime, and they reach across to access file systems, and to provide their serial, their console, and to do the IPC. So we started off with printf, then we added the AFI net socket family with the VertIO network, and then moved to the AFVSOC and removed the IP stack requirement. That could actually be interesting for safety-based systems because it'd be a lot less code to certify, so you wouldn't need a certified network stack to do socket-based communications in these complex SOCs where you have Linux, it could just be communicating over VSOC to Linux. Now there would have to be some changes made to the way the VertQs are implemented, so you have safe IPC with one-way channels so that there isn't fault-propagation paths between your safety runtime and Linux, and we haven't done that yet, but that's an interesting area to pursue, to implement this for safety-based systems. So I covered this, I think hypervisorless deployment, the hardware mechanism is interrupts to send the VertQ device notifications across between the runtimes, upon receiving the hardware notification from the front end, Linux delivers notification to the PMM via event ID, event FD I should say, and the PMM then goes around and looks and sees what device status fields have changed, and updates its copy of those registers and handles the request, and at the same time, or if that offload can be handled with V-host, it then acts as a V-host proxy and punts it over to the Linux kernel for for processing. So similar to how Ncat and SOcat do port proxying, it could be interesting to add this feature to KVM tool so that it could proxy TCP ports to VSOC for hypervisorless vertio situations so that you could, for example, GDB or auxiliary runtime by connecting to its TCP port, which KVM tool would then accept that connection and then proxy that over VSOC to the auxiliary runtime, or vice versa, so you could do client server in either direction without actually having to have a TCP IP stack on the auxiliary runtime, and this is probably one of the next things that that we'll be implementing in our KVM tool implementation. So side note on the performance of vertio-mmio with MSIs, without MSIs, there are a lot of traps that happen when you are modifying vertio registers with the VM exits into the VMM that has to process those, and to reduce those you can use MSIs on a per device basis so that instead of having all these device register changes, you can send the little bits of information that you want to signal to the PMM in the MSIs, and that way you can eliminate a lot of these VM exits and you can see here the number for the signals that go to the the backend can actually be doubled with MSIs. We're going from 660,000 in the second farthest right column to one point almost 1.2 million, and that's because you're sending more useful packets across. That's why the number is increasing there, and so if you compare this to the performance levels that vertio-PCI gets, you know PCI of course is sort of the preferred way to do vertio on x86. You can actually achieve it with MMIO with MSIs, and that's quite interesting because the amount of code needed to implement a PCI driver and transport for vertio is a lot bigger than simple shared memory. So if there is interest in reducing the amount of code in a VMM or a PMM scenario then recommendation would be to use MSI-based MMIO transport instead of PCI. Alright, some conclusions. So partitioning systems at the OS instance level using virtual machines, containers, and auxiliary runtimes helps deal with edge device software complexity. Linux-based system architecture is increasingly used at the edge, and auxiliary runtimes for real-time and safety partitioning can sometimes help. It may be less in the future as Linux gets better at real-time and perhaps even safety. Compute islands can avoid the need for virtualization to enable real-time or safety workloads with Linux-based systems. Hypervisorless vertio can help unify workload integration for the various partitioning scenarios involving auxiliary runtimes on multi-core SOCs. And the Socket API can unify TCP, IP communication, and higher speed VSOC-based local IPCs. So this work was done as part of the Lenaro OpenAMP application services working group, and the KVM tool changes can be found at this GitHub URL, and there's more information that can be found under the OpenAMP project in the news. And if people are interested in the MMIO MSI support for KVM tool to accelerate that transport, you can find that at this URL here. This work was and is sponsored by Wind River. Wind River is a cloud- native DevSecOps platform for the development, deployment, operation, and servicing of intelligent systems. And if you're interested in learning more about that and how it can do payload generation for complex heterogeneous multi-OS systems, you can find information for that at the link on the slide. So that's the end of my presentation. Thank you. Yes. Named pipes or... Oh, FIFOs. So the question is are named pipes or FIFOs an alternative to VSOC? So they could be. They could be. And VSOC is particularly attractive because you can do a one-to-one mapping to TCP-IP and carry forward all of the clients and servers that use TCP-IP. So without having to make code changes, and that has huge value. But I do like the simplicity of FIFOs, especially for safe IPC. It actually has a lot... I did a lot more than I expected. Yeah, so... The FIFO puts some information in it and the host gets it immediately. It's like a... Yeah. But you said that one direct call and on the host will see if, you know, FIFO and FIFO out. On the guest, it's a character device. They just redirect you. Yeah, so there was a follow-up comment that VSOC implementation in the guest goes through a work queue which causes context switch and has overhead in the guest. However, that is actually a property of the guest and it's possible to implement VSOC in the guest using altered implementations without work queues and without context switching. And so you must have been looking at a Linux guest. Yeah. That was my... That was my job. Right. Thank you. Any other questions? Yes, Stefano? I think it's because the API is a lot simpler and it's just one step above a simple read buffer. So, you know, this is the direction of the type of question as it would be of some other... Right. So the question was, have we looked at Argo as a way to simplify the transport at the lowest level between the run times? Right? Because it could simplify the... Instead of using vert queue implementation, have a path potentially to safety. And I do think that is something that should be looked at in this very interesting question. Thank you. Any other questions? All right. Thank you, everyone.