 I will do a brief introduction as well so just to get you an overview of what this thing is about. So JLOS is a tool to run applications, safety applications, RTOSs, but also other Linuxes aside Linux. So it's a partitioning tool for a multiprocessor platform which is designed for AMP, so as a metric multiprocessing. But nowadays basically we also do, we do the Linux by Linux so it becomes stronger. So the idea is that they have a strong clean isolation between these partitions. We want to achieve bare metal like performance, we will see this today, what this can mean and latencies and we don't want to modify Linux for this. So there's no para virtualization involved, well almost no, at least not on the primary side of Linux, a little bit on the other side, we will see this. And we open source it. So the project is now three years old and we got quite some nice contributions already being open source working in the community. What makes it different compared to well full-fledged hypervisors compared to embedded hypervisors that there are plenty out there, well we use hardware virtualization, we rely on hardware virtualization, that itself is not really new. But we focus on simplicity over features so this thing doesn't come with all the shiny thing you may know from your desktop virtualization or cloud virtualization thing. We basically just do resource access control. So if there is anything not available in the amount you need, you can't really partition it. That's the whole story behind it. So we do one-to-one resource assignment. One CPU, one guest, maybe one guest and multiple CPUs but not really one CPU, multiple guests. So there's no schedule involved. We don't have a schedule. The same for the devices. Also you assign the device completely to one guest or you leave it to the other but you don't spread it. And we also do not much invest in hiding our existence. And the final thing we also see, we partition the booted system versus what many of the embedded hypervisors do, they boot a Linux system, they boot the guests after being the primary workload targeting or booting after the system. So we just turn this around. So we rely on Linux to boot the system and that means we can also offload certain tasks to Linux. So besides the system boot, yeah, the whole command line interface, the whole command and control interface that we have, we need to have, every hypervisor needs to have, we offload it to Linux. So we have a command line in Linux to operate the hypervisor. And also, yeah, run time control and monitoring of the systems. We will see examples for this. So this is a multi-core system where we booted Linux on and we want to run something aside Linux. Let's start with an RTOS. We inject hypervisor underneath. So the hypervisor really is control of all the hardware. Linux is no longer in control once it's running. So it can establish this kind of clear partitioning, a clear barrier between this. Then you can run, for example, safety critical workload in one of the cells. This is one of the primary scenarios that we started off with, but nowadays we have more of them. Some terminology. So we talk about the root cell. That means that partition, that Linux instance that booted up the system and initially started the hypervisor, that we call the root cell, and then we have non-root cells. There can be many of them. They can also spread across multiple cores. They can be, yeah, more than one root, non-root cell, but that's basically the terminology. As I said, we started off with doing partitioning for these bare metal workloads, for these classic RTOS that you may have in your basement still. And you want to reuse it. But then some folks come around and say, ah, it would be a cool thing to do on this kind of monster machines with this plenty, of course, because we don't really want Linux to, or we don't see Linux scaling that well on these machines. There are scenarios for, specifically, when it comes to real time, where Linux becomes a bottleneck, and we could solve it just by running multiple Linux side by side. Or you can imagine that one of the Linux is actually doing some critical control tasks, and the other one is just doing the fancy UI stuff. And you really want to be sure and keep both separate. So you run Linux for control. So prem.tat, for example, and you run another Linux for doing the nice, shiny UI things. Also scenario, and that's also now possible with Jailhouse. So late partitioning concept. That means we have, well, classic bootface firmware, bootloader, whatever. Linux starts up and takes control over the hardware. That's what we rely on. That's the precondition. And then, actually, only the hypervisor starts. Not the other way around. That means Linux loads the hypervisor underneath itself. It lifts itself up, puts the hypervisor underneath, adds some fragments to this, configuration, system configuration, cell configuration, and the images for the cells. And then you end up with a partitioned system. That makes the actual task of partitioning way more static. Then you have what you have to do when you do a full system boot up. So for example, in X86, if you want to boot up the whole system, you need to take care that the guest actually sees all the virtual hardware or physical hardware that it expects from the physical machine. And if it's not there, it won't boot. Because Linux is already booted, we don't have to care about this. We just freeze the configuration of the hardware. And no longer let Linux mess around with it. But just, yeah, froze it in the freeze in the state that we find. That's basically the concept behind it. That's the question. Well, we have to solve the same problem. And we'll see you later on, then KVM has to solve. So we rely currently on Linux, installing the hype stop after taking over control from the boot loader. So the boot loader starts the kernel, the payload, and hype mode. Linux installs the stop. And then later on, either KVM comes around and installs itself. Or we come around and install ourselves. We will see you later on. So yeah, there is some precondition for this. A little bit different than on X86. X86 can just simply turn on virtualization and continue running on ARM. It's a bit more complicated. And we will see also what it's going to come and ARM in the future, which will make it even more complicated. But it still works with some. Well, this is all embedded, serious embedded. It means, even on the server side, you now have to tweak a lot of things manually and work around some preconditions. You will see this. We is not really that shiny, new, toolish things we have to tune your systems. You have to do a lot of tuning yourself on the low level side. We will slowly improve on this. And this is part of this, basically. How you have to boot up your system, that's why we do a tutorial today, just to show you, basically, where the traps and where the pitfalls are on this. Another question in the back? No, OK. So by the way, just interrupt me if you have any questions right now, because it's a long session. And you may not remember all your questions at the end of the session. Just jump in and, yeah, we will move on. Oops, this is the wrong slide, by the way. OK, so let's get our hands a little bit dirty, but not too dirty. So one of the first things you can actually try out is to run JLOS in a virtual environment. This is actually how JLOS was born. It was born in a virtual environment and not on the physical hardware. So we did the first bring up in Linux KVM with QEMO. All the necessary steps for this are documented in the README. So if you download JLOS for the top level README, and there's a chapter on how to do the bootstrapping on QEMO KVM, the advantage is this is a predefined hardware. So all the other things you have to do later on in these days on the physical hardware you don't have to do, like tuning the configuration for the system, but you have to fulfill some requirements to get this benefit. So first of all, of course, you have to have virtualization available for virtualization for KVM. So for Intel hosts, which is these days more or less common, you need to have VTX, so CPU virtualization, the standard these days. Don't take this notebook, which is maybe six years old. It should be a recent one. You don't have to have VTD, means IOMMU. That's not necessary for this demonstration or for this virtual setup. We will use it, but it's emulated. That's the advantage. So you don't have to have it on the physical side. You need a rather recent kernel, 4.4, just to have all the bugs fixed in KVM to do this kind of nested virtualization setup. And what you do these days, what you need these days actually is a very recent QEMO because we now really enforce to have an IOMMU emulated by QEMO. And that's now fully available with version 2.7 only. So really use that version. And actually, the configuration file that we ship depends on the version of QEMO because QEMO tends to change a little bit and lay out once in a while and add devices or remove devices. And it's also way to achieve that it's really compatible. Linux guest-side-wise, take a more or less arbitrary Linux image. So I'm still running a very old Susan image for my tests. But you need a more or less recent kernel on this. So I would recommend something in the fore-range as well. And you will have to do a build against this kernel. So part of the component of the jailhouse actually is a kernel module. And it has to be built against the kernel in the guest. So you have to have your build environment available for this. And yeah, we don't release that many versions. Last version I think it's from last year. So better take master. So GitHub master version of jailhouse. And yeah, I've prepared something. So what I'm going to run right now is it's a virtual machine with four virtual cores. Linux on top of this, we will insert the hypervisor. And the hypervisor will talk to a virtual UI just to dump out its messages and maybe also its errors. And we will also start up one test cell, which is measuring time interrupts, the latency of time interrupts. So it's programming a timer to fire each, I think, 100 milliseconds, measuring the latency delays, and is reporting this also via the same UART to the console. That's the simple scenario. The same test you can also run on physical hardware later on. Yeah, that's the question. The control tools run actually in this root cell. So it runs in the cell which booted up the system. Yes, we will see this right now. And I will later on tell you the restrictions. So what kind of control models you can run and what the limitations are. Because of course, these tools are powerful. They can enable the hypervisor. They can disable it again. They can do a lot of nasty things to the system. So there is some measures available to confine these tools and the power of these tools during runtime of critical workload. So we'll see this later on. Yes, no, no. OK, let me switch the output so that we can see everything together. It's gone. OK, I hope you can read this. I'm not sure if I can. No, who said no? Zoom in. So yeah, so this is, you see, Q-Amer? Well, you don't see it. Let me check if I can make this larger. Oh, come on. Yeah, OK, see it's still improvable. Getting better, OK. So this is now the root cell in the guest. I see the guest is even a graphical one. So the first step you have to do is to, well, oh, wait. OK, I prepared something, as I said. So this actually just mounts back the host side where I've built Linux, where I've built the hypervisor. And leave all the last steps. So this is just the preparation. And now I will do. So this is the first step. Basically, you load the driver into the kernel. No magic happens. This is just a loader and a command and control interface later on. So what you get is a dev jailhouse for this. You can even get this device's jailhouse entry where there's some information, which also the command line too later on can use. Well, you can explore this. I don't want to go too much into detail. So the first step now, or the next step now, is to enable the hypervisor. And what you pass in here is the configuration cell command, the configuration cell file for this particular machine. We will go later on and go to produce one for the physical hardware. But here, this one is already prepared. That's it. Oh, yeah, something happened. So this here in the back, this is my physical, my terminal console where the cream was running. And this is actually showing the virtual UR output. And so it was reporting. The jailhouse has been started on four cores. Some physical devices, PCI devices, have been under control now, assigned again to the root cell. So the root cell is still running on all the cores. The root cell is still having all the devices it has before. Otherwise, no, things would go wrong. That is the precondition for this. But now we are as the hypervisor in control of the hardware. So what you can do now, first of all, is to, oh wait, cell list. You can list what is running. Well, just one cell on all cores. Everything is still fine. Now you can create a cell. That means you pass in, again, a configuration fragment that describes basically now a cell which consists of a single CPU and some resources, some memory. And it's now running. Not yet, actually. It's just the container. The container is empty. So you need to load something into this. And this is just a tool which basically now takes an arbitrary binary image or whatever you want to pass in and loads this into the cell at a specific address. On x86, it has to be a specific magic address because we will boot up from this point. Let's load it. And then we can start it. And it's running. And you see the output on the other screen here. OK, the latencies are not that impressive simply because it's a virtual machine. So you see the latencies of the KVM and not the latencies of the hypervisor. Well, you see both, but anyway. So these numbers are not really demonstrating the performance, but it demonstrates it's functional, at least. So we see now, OK, the cell is running. And it has still a CPU. You can also do something interesting, actually. We can do some statistics. That now shows how much the cell interacts with the hypervisor during runtime. And you see it did some interaction. So it had some exit, some dropout of the virtual machine into the hypervisor, but only doing startup. So this particular cell runs without any kind of hypervisor interaction during runtime. You can imagine if this is your RTOS, you now have full control over the hardware, and you don't have to talk to the hypervisor in order to take the hardware. That gives quite nice performance numbers. It doesn't look that good, but it doesn't matter that much on the chemo side. So on the root cell side, sorry. On the root cell side, so you have some interaction there. Well, the hypercores are currently a statistics called, actually. Otherwise, it's MMO and other stuff. So there is ongoing activity on the machine. And JLOS once in a while have to look into some of these commands and decide if they are valid or not. So you have some exits, simply. So yeah, it's running. Now if you want to get rid of this, actually, you can destroy the cell again. Oh, not permitted. What's going on here? So actually, the cell, as we will see later on, this cell actually has some magic power. It can vote against being destroyed. And therefore, this demo cell says, no, try it again. This demonstration now can shut it down. So now the cell is gone because it says, OK, fine with me. You can kill me. So it permitted the access. And this is probably answering the question. We have some control. Or we can enable some control for certain cells to vote against, actually, what Blenix is doing on the root cell side in reconfiguring. So I wasn't also able to create a new cell in this mode while it was locked. You see it before. Up here, the cell said running locked. So it not only locked its own shutdown, but also locked the reconfiguration of the system with respect to creating further cells, which may then have an invalid configuration. And therefore, we said, OK, we locked down the system. That's one model of doing it. So now we have all CPUs under Blenix controlled again. And we can even shut down the hypervisor. Not very spectacular. It's just unloaded. Blenix is under control of the hardware again. So you can do a quick development cycle this way and reconfigure things. So yeah, that's just for the starter for the play. You can basically replay it, as I said, based on the readme that we have. Where was I? Let me check if I covered everything. Loading, yeah. Let's start. Yeah, the basic steps. So actually, there's more possible. Just to give you an idea later on what we wanted to do. As I said, Blenix can also run. So the second setup I'm going to show you is actually running a non-root cell with three virtual cores running links inside. And even some doing some inter-cell communication. So this is basically the setup that I'm going to produce now. Need to enable things again. I need to, oh, is it correct? Yes, this is the correct. So I prepared something. So this is a lengthy command line which loads. Which creates actually a cell according to the description of this x86 demo cell. It loads a kernel image here. It loads an init-odd image. It provides the kernel some command line parameters to do some printing. And it even provides an IP for it. And it doesn't work. Let me think. Why doesn't it work? Why doesn't it work? Probably because I forgot to set a link, yeah. Damn it. So this is actually implemented as a Python script in contrast to the basic commands which are part of a C, out of C build tool. Ah, wait, does this use a local liveexec? I hope this works. And I'm afraid it will be. Yeah, there's more broken. Great. Let's try something else. It's a path issue. I didn't properly install it. Ah, yeah, now it works. I didn't properly install it in the virtual machine. So it was too quick for us. Let's scroll up. So it created actually, it created a virtual cell here. It created a cell the next day more. It assigned three CPUs to the cell. It loaded some kernel fragments and then it booted off a kernel. So you see the kernel prompt here, that the kernel boot up lock of the non-root cell on the virtual UART up to the point where I can lock in into this thing. And I have a CPU info. I have a three core machine running here apparently which is also what we should see here. Yep. So three cores assigned to the secondary Linux cell, one course remaining for the root cell. That's the setup. And I should even be able to do some SSH to this thing. Yes, this is my machine again. So this even has a inter-cell communication channel implementing a virtual network and the network is now exposing an SSH demon on the other side so I can do an SSH over the other side. So that's just to give you perspective where we are heading for also with a physical machine as possible and also this example. Well, except that the fragments are not available in the source tree means the kernel and the init ID you have to build yourself but the configuration and the other tools is all there in the example as well. So you can reproduce this case as well. Okay, so let's go back to the slides. Yeah. So let's go for some real hardware and you probably can't see it. I can hold it up. But yeah, anyway, there's a mini ITX board lying on the floor actually running. This thing is a super micro, well micro server board. It has a C on D on it, C on D. That's a Intel rather Intel embedded, well kind of embedded server CPU. Eight cores, two threads each up to two gigahertz. Nice thing about this regarding real time it has no on ship GPU that gives lower latencies. You will see this. It has plenty of RAM. It has plenty of interfaces. So besides two gigabits isn't it? It also has two 10 gig ethernet ports. And yeah, I have some artists attached to this. So this is our target system and that we want to basically run jailhouse on. So what do we have to prepare? So one of the first things still because otherwise we were tapping in the dark is yeah, some UART connection. So I know that not all of these boards these days have UARTs anymore. Sometimes it's critical, sometimes problematic. You can also use some PCI adapter for this if you don't have but this one fortunately has a UART. So I plugged it in. Furthermore of course, what you need as I said before you need VTX, you need VTD. This has to be enabled in the BIOS, sometimes it's off. You have to disable this trusted execution technology usually used for secure booting and things like this, common conflict. Yeah, kernel-wise again should be more recent 4.4 or later. Currently we don't need patching for this. For some features we need additional drivers but just to boot it up, we don't need a patched kernel. Furthermore, what you should prepare before or when booting the system provides some kernel parameters on the kernel command line that is disabling the IOMU, the Intel IOMU but enabling interrupt remapping actually. We will see later on that this is required for this. Then you can start up the system. Let's see where the console is. Let's get rid of this one first. Why isn't it working? Yes, okay. Try to reverse DNS me and I'm not in a network. Okay. So I hope you can see this from the back. So I'm locked in on this machine. It's a, well, in this case, I'm somehow a SUSE guy, it's a standard SUSE machine. What do I have here running? I think it's a, yeah, it's a 4.80 kernel, so pretty recent. As it's such a big machine, I can easily do the compilation here on the same machine. So I already have jailhouse here and as I'm running natively, I can just do a make. Let me check. Well, someone prepared already something. Yeah, it's already there. So what we need first of all now is a system configuration file. And on Xcd6, we did some work already, so we have a config file generator. Because what we actually need on Jailhouse is a pretty loadable description of the resource that we're going to assign to the guest and, well, that we also use on the host side, on the root cell side. So we need a description on the level of memory ranges, a description on the memo of CPUs, of course, and of interrupt registers and things like this. And all the stuff, or most of this can actually be, fortunately, be generated by looking at the information that Linux already collected about the system and then generate some configuration file out of this. So let me check what I had here. Yeah, so by the way, if you noticed, and we have command line completion. Thanks to a student of us who did this work, it's quite convenient. So config create, we want to create this config. There's a special folder in Jailhouse configs, and everything you put there, which is looking like a C file, will be compiled as a config file later on. At least the build system tries to do this. So we will try to do it in this folder here. And for reason, I was going to explain also later on. We did some tuning here. So we tuned the hypervisor memory to be only six megabytes and to tune the memory for the inmates, for the non-root cells to be 25 megabytes. Okay, this just run, didn't report anything. So this, what it did basically, it tried to generate something permissive, some policy being permissive. So enabling basically that Jailhouse can start and not immediately throw away all the things or crash or whatever. If you really want to confine the usage of hardware of the root cell and other cells, you have to do some fine tuning later on. So this is not really completely detailed configuration with minimal privileges, but it's the one you can basically use to get started. Furthermore, the script sets up immediately the first UART, the first PC UART as the debug console. You can also tune this later on, depending on your system setup. But this just have to know. So in this case, it usually matches. Furthermore, what you can do now, what you should do now, because we just created this thing, is to build the configuration. And you see, we created from C file some object file, which is then again our dot cell file, which is just an area of binary description of the configuration. Well, could truly be done in nicer language, not talking about XML, but other languages are there to do this. But yeah, this is how things evolved and they work quite well so far. Yeah, what you can also do right now, if you're not sure if things will actually work in the end, because on X-T6, we have some dependencies on certain hardware features to make it work. There's now a built-in script hardware check, where you pass in the config file that we just generated. And that script will now analyze if your hardware, if your VTX has all the nice features underneath that we need. And to check pasta. Lucky. It also checks, of course, if you didn't forget to enable something in the BIOS and things like that. So yeah, all fine. We can do, we can start actually off. Yeah, just to give you a brief idea how this config file looks like, somehow a C file, as I mentioned. So it basically just defines the structure. There's a lot of fields in it. We will look into the details later on. And it's actually pretty long. If you want to write yourself these 2000-400 lines, you will take a longer afternoon. So luckily we have generated it from the description. Yeah, okay, so we can do the same that we did in the virtual machine. We just did, oh, wait. No, the first thing we have to do is to insmote. It's a jailhouse driver, of course. Okay, and then we do jailhousing. If you are doing this kind of things and you're not really sure if it's going to work, it's better to sync to disk. Because if things go wrong, they go really wrong. There they sell. And that's the UART, the physical UART I'm attached to here on the other side. You see that there is something booting up on quite a few cores. So the threads also count as cores. So we have 16 cores running. A lot of devices being assigned to the root cell. And oops, something went wrong. So this is what you get, well, if not everything works out of the box. And this is what we are here for. So the hypervisor started up. It said activating hypervisors. And for a while, actually I was able to type it in. But then something went wrong here. Some, well, error of the hypervisor. Some access to memory mapped IO or RAM, you can't tell, read access. That physical address was accessed. That was a registered state of the guest at this point. And that's also interesting or important to know. Well, right now it can be only one cell running, the root cell, but later on if you have multiple cells and one of them crashes, it may be the root cell. That's not a good thing. It may be the other cell. Then you may recover from it and you have to look in the right cell configuration, of course. So in this case, we would have to look into the root cell configuration. So yeah, that machine is now, oh, it's still working. Yeah, we have enough cores. So one core less is not really that critical, luckily. So we can just cross fingers and do some cell list. And yeah, one core failed, too bad. So you can no longer disable the hypervisor. You can basically, you can just wait for Linux locking up because one core is no dead. So let's just reboot. If you can still, otherwise you do the heart reset thing. I better do the heart reset thing. And as this is a nice server board, we now have plenty of time to do the recovery because this takes an agent eternity in BIOS. A question in the back, yes? Well, we don't take away the hardware. We just take away the control over the hardware, but we give it right back to Linux. So this is what this initial cell configuration is about. It tries to describe the hardware as Linux saw it before and re-granting the access to the hardware. But what we can do later on, we can remove the access to certain parts of the hardware, like some cores. And that, of course, would either make Linux very unhappy because it would continue to access the hardware and then crash, or Linux can step back from the hardware. And that's what we are doing. We offline a CPU, we unbind a driver from a device, and then you can take away the access to it and ground it to another cell, and that cell would be able to use it. So right now, the configuration is supposed to be permissive, and everything should work except what is not covered by the configuration script. And that's what we saw right now. So apparently, there is some description gap, and therefore we saw the crash or the violation that the hypervisor reported to us. Ah, right. I forgot one important step, by the way. But let's go this way to resolve this, first of all. So I already pointed out that we saw an access violation by, well, by reading from the RAM, this important part address. We also saw that the CPU which caused this access violation has been parked. Well, basically nothing is running on it anymore, specifically not Linux anymore. And that was the cell, basically, which was in control over the CPU at the time. This kind of crashes can actually be, well, there can be further reportings associated with it. Specifically, if the access happens, we are an instruction, an assembly instruction that JLHouse doesn't support. It only supports a very small subset of x86 instructions, which are sufficient to run Linux, but not necessarily sufficient to run Linux when it's violating the access. So you may get additional reportings. Ignore this. Important is the information, really, first of all, about what has been accessed by whom. And then we will see later on. We have to look into the configuration of the system and see why we have no coverage for this. And when the system is up again, we will look into Proc IOMM and we'll see, OK, there is apparently associated with this address here. There is an address range that is reserved. And reserved addresses are, first of all, not given to Linux. That's the policy of the configuration script. Although there is something behind it, it's the ACPI platform error information or error interface, whatever thing. And the driver behind the Linux probably access this region only for reading. And that's caused a violation. So the first step, of course, is to permit the access. So we will do this later on when the machine is up and running again. Meanwhile, I can briefly explain to you the step I forgot to explain, because actually after creating the configuration file, you have to do some further thing. Because the hypervisor, it can take away a lot of resources from Linux. We don't do memory hot plugging or hot unplugging. So the memory that the hypervisor as well as the other cells are supposed to use later on, this memory has to be reserved in one way or the other, prior to booting the system or during boot. So there are two ways to do this. On x86, this is with memmap. You can specify a specific range of RAM and a specific size for this being reserved. So taken away. So this is the syntax you see below. You take this away, basically, from the Linux memory. It uses for, yeah, for RAM. That's one way. And the other way, which you also use on ARM, is simply to say that the RAM is smaller than it actually is. So you take away a little bit of the top memory and, yeah, leave it unused. But of course, you can specify later on in the hypervisor configuration that this is actually being used by the hypervisor or by other cells. So the layout that we are using here on x86 looks like that. The whole thing is physical RAM. Right in the middle, we usually use a fixed offset here simply to make the configuration files reusable, specifically those of the non-root cells reusable on different hardware platforms and virtual and real ones. So the offset here is more or less fixed. Then we use a certain range of RAM. By default, it's way too much right now. I think the default is 64 megabytes. But we can run as a specified only in 6 megabyte, the hypervisor code and data. And we also reserve some RAM here for the cells so they can grow up. And you can have more RAM underneath or be behind this on x86. Important is that you give this piece of RAM as well as everything, including this one, in the root cell configuration to the root cell so it can access those. While you reserve during boot up, this whole region, either via mem app or via mem, or on ARM, you can also do device-free tuning. So the step actually I forgot to do on this machine is to prepare this kind of reservation. So enter it and grab and grab is a nice tool, which gives a lot of fun with the latest version. If you specify now a dollar in your command line, you have to quote it. So the common mistake, if you don't properly escape this dollar sign and even the escape to escape sign, you won't get the proper command line. And if you don't get the proper command line, there's no reservation. And it still works because we have 32 gigs of RAM. So usually Linux is not using this part. We are just giving to the guest or to the hypervisor. But when it actually uses this RAM, it will cause a violation. And you will also have a crash. OK, so let's go back here. I should have turned off DNS reserve. Look up. So let's try this in etc default grub. London's grub cfg, I think. Now is it? Ah, no, wait. On that machine, I'm still old school. It's actually part of grub 1. Oh, yeah. Luckily, I already had this in. So you see here, mem-map, we reserved some half gig at this default base address that we use here on x86. So we was on a safe side, so nothing could go wrong. But for educational purposes, I should have done this actually interactively, anyway. So that went well. But it didn't went well, actually, was the configuration. No, wait. We have to go to genhouse and do whatever command line is. XAMD. Oh, wait. I wanted to show you first of all where this comes from. Oh, man. So this is what you get. This is basically also what the config generator script causes. And yeah, it skips all the reserved region. And if you're unlucky, one of the reserved regions actually be used by some of the biostrated drivers, usually fumerated drivers. In this case, we basically missed out this whole region here. That should have been part of the configuration file. So let's fix it up. We will look into the structures later on. So what you basically have to add here is another structure of this jailhouse memory type. We already have some 53 of them. We need one more. Let's add one of them and find some space for it. Well, it should be sorted. It not really is not technically necessary, but maybe helpful for finding some stuff. And I've prepared something. I think, yeah, here we go. So we can just enter this. And yeah, that's basically the description. We have to enter here the physical address of the region, the virtual one. For the root cell, we do a one-to-one mapping. So physical equals virtual on non-root cells. This may be different. The size of the region is basically derived from what the PROC IO memory reported and the kind of access you want to permit. So read, write, execute just to be safe, or you won't execute anything on this. And yeah, it should be DMA capable. Well, probably Linux won't use this. But anyway, better safe than sorry than starting again and crashing again. Make it. Sparking, hopefully. And int mod enable. And yeah, proved by the absence of errors. Well, you have to believe me. It will work for a while, at least long enough for the talk, I hope so. So again, look at the format of the configuration file, which is nicely undocumented. But you can read source code, I suppose. So a little bit is going to be covered by this following slides now. As I said, we are generating from source file right now. This is not a requirement. It's just the tool that we chose at initial time when there was only a few lines of code. And the format itself is consisting of several C structures. You can find the structures in that header file I'm giving here. So we differentiate between two types of configuration files. Both are called .cell. It's a good thing. I don't know. But one .cell file is actually a system configuration file consisting of a system configuration header. And then the actual cell configuration. And the other .cell file is just a cell configuration. Thanks to Ralph in the back, we now have even tech in the beginning to differentiate both kinds of files. At least Hypervisor will bark at you if you put in the wrong file in the wrong command line. So the system configuration header basically describes where in physical RAM the hypervisor is located and how much memory was reserved for it. It specifies also the deeper console, its address, its size, depending on what kind of type you have. You can have, we will see later on, both classic PC-based URs, but also memory map ones. And furthermore, depending on the platform, there is some description of key parameters the hypervisor need to know. So we used to have ACPI support in the hypervisor, so parsing ACPI tables. But actually, it's much simpler to do it that way. So the configuration generator script will extract all the information that we need to know, the static information, and put it in the static structure and leave it there for the hypervisor consumed. So in case of X86, for example, is the location and the size of the IOMU units and some other parameters that you basically have to know. On ARM, you would use a device tree, but we don't have a device tree parser, so we also extract this few parameters we actually need to know and put it in this kind of header. And then there's also strange field called interrupt limits. Usually you won't have to tune it unless you have really a lot of devices or a really big machine with a lot of interrupt lines or MSI interrupts. So this is basically defining an upper limit to do some static resource allocation. So if you hit some kind of strange error messages, that is the knob to tune, but normally you don't want to hit it. Yeah, so after the system configuration header, there comes the root cell header. So the cell header, in this case for the root cell, containing the name of the cell, some flag, well, zero for the root cell, and yeah, the number and the size of the resources and then the resources follow. So what kind of resources do we have? First of all, a bitmap describing the number or the CPU set you want to give to a certain cell. Of course, if you are booting the root cell, you should better give all the CPUs you have in your system to the root cell, first of all. Otherwise, something is missing and yeah, you will see the crashes. But later on, of course, you can remove things as we create new cells, you will see this. Same for memory regions. So the memory regions, we saw already the fragment. It contains a physical address, a virtual address, a size, and a flags field. The flag fields, there are various of them, controlling the kind of access you want to permit. So read, write access, execute access, it's not, I think it's not really enforced on all the platforms. Well, could be fixed, but we don't do it right now. Then the flag DMA that basically says, okay, this region is also relevant as a DMA target, so devices should be able to access it. So all the devices of the cell will gain access to this region, to the memory regions, if it has the DMA flag set. If not, you will get an IO permission, an IO violation from the IOMU. If you specify, IO is actually memory mapped IO range and actually six is not really that important in other architectures, it is important. Then if you're unlucky, some IO resources may not actually be on the generality of a page. Page is a 4K thing usually, so you can't really hand out a full page to an order to grant access to a device. But nowadays, fortunately, we can split the pages into smaller pieces. Of course, this comes with a performance penalty, but at least it can be modeled. And in case you really have this kind of weird hardware, it's usually not actually six, but on ARM you see it once in a while. These kind of flags IO 8, 16 and so on, they specify basically permitted access width on this page. Because if you access specifically on ARM with the wrong width, you may trigger interesting hardware effects or faults. And therefore, you basically specify here what is permitted because this is as the hardware is implemented and hardware is capable of doing that. So it's a safety measure. Same for unaligned accesses, not an issue in X86, but on ARM it is. So this basically tells you, okay, unaligned accesses are possible, otherwise it will be rejected and the system will continue to live. Then we have a special communication page, which is basically some info and signaling page between the hypervisor and a cell, a non-root cell. So the root cell doesn't make use of this. And if you specify this flag, the region, usually a page is being tagged as this kind of communication page. Yeah, then an interesting thing, well, important thing, how we saw that we can load images into the non-root cells after they have been created. This is done by the root cell. So the root cell has access to the memory of a non-root cell before that cell started. And this is controlled by this memory region flag. If you set it, the root cell will have access to this part of the guest of the non-root memory as long as that thing isn't running. And that is required to load in some, yeah, binary, some code, some data initially, also later on. You can shut down the cell again and then it becomes loadable again. You can load something into it. Without that flag, the root cell will get a nice violation message if it tries to load something into the cell. Another flag is root shared. That's used for shared memory. It's also used for nasty hex if you want to share a device between two cells, but that's not really an official solution, of course. There are some chronic cases for this, but yeah. Usually you do this for shared memory. So if you declare a memory region as shared or root shared, the region will be mapped into both cells because what normally happens if you assign a memory region to a non-root cell and you create the cell, the memory will be taken away from the root cell at this point. No more access possible. If you have this flag, it will stay in both memory address spaces. Okay, yeah, this is basically what we did already and so it works, fine. That was fixing the memory, or fixing the missing memory region, adding memory region for this API IO region. This one I want to show you some people at the time. You can also get violations if you, or if the cell on XT6 access some port-based IO. So in this case, I just revoked the access to the RTC and well, of course, the root cell stumbled over this eventually and caused this kind of error. Actually, I didn't include in the slides, unfortunately, the error messages for the IOMU. So if you forget some memory region to be DMA capable, you also get a nice error. Maybe you can make this up quickly. Quickly is, well, we will see this if how quickly this works. System RAM, system RAM, there's plenty of it and it may not work immediately, but let's just try it. Let's try it. So of course, if you change the root configuration, you have to disable and re-enable it. Apparently, I will have to disable even more of the RAM. What else? Oh yeah, there's a lot of this. This one is small, small. There must be a bigger one actually somewhere. Yeah, here. This is a good chance if I disable this kind of access. I think it must crash. Okay, disable, enable. Yeah, bingo. Okay, that's now not on the slides, but also interesting. So this is now the IOMU complaining about, okay, some device try to access some physical address that is not really in my page table and you will get this kind of fault message. So what you can read from this is most importantly, of course, the device that caused this. It's the PCI ID of the device, bus device function. So you can look up the device in your cell configuration, check if it's really part of the cell or if you messed something up in this way. What you also get, and while this is now really where you have to go for the specs, we don't have a nice parser for it. This is now really raw information about what how VTD is reporting errors. And error six, if you look this up, is probably a read or a write access to this is the page address of the memory region. Well, anyway, usually it's about checking that the device belongs really to the cell and it's about checking that the RAM is properly marked as DMA capable. Otherwise you will get this kind of violation. So you also have some nice reporting if interrupts are not properly reported, but this is all low level. And if you're totally lost on this, well, just drop some mails and we will try to resolve this. So back to the configuration format. So the port IOs that could cause these kind of violations are described with a bitmap. So a bit is clear, it means the access is permitted. Let's just model after the hardware. So you see all these things are more or less modeled to make the hypervisor life convenient, not the user's life. So right now we have a rather permissive PO bitmap specifically for PCI devices. They are all, all access are permitted to the cells. This is going to be reworked later on to have really the proper ranges permitted to the right cells. Yeah, this is for PO. For interrupt pins. So this is not that important for x86 because normally you don't do that many interrupts anymore via the pin-based legacy interrupts. So IOAPig, most of them are rather MSI, MSIX these days, but well, some have to. In that case, you can assign interrupts as long as they are exclusive, exclusively to certain cells. So there's a structure for this describing which interrupt controller chip you want to, yeah, partition. This is identified by the web IO address. This ID passed behind is also filled by the config generators to identify it on the IOMU level. You shouldn't mess it. Interesting is actually for you the pin-bed map. So per entry, per this record, you can specify up to 128 pins. So on x86 usually, usually you have only the 24 pins with IOAPig, you have to specify an arm, of course you have plenty of them. And even on x86, I just enabled recently 192 pin IOAPig thing. Didn't know that this thing is possible, it is possible. So they are summing up there. So you may have to have multiple of these entries. So therefore we have not just a pin-bed map, we also have some offset, so some first pin in the bitmap field, which describes okay, where does this bitmap start in your range. Good. What we also have, but this is only for the advanced one and it's also going to be reworked, is the description of the cat, cash allocation technology. So it's a partitioning tool for caches on Intel. Currently only supporting L3 cache partitioning, L2 is well, work in progress. And that would basically describe of your cache range. For example on that board it's 12 megabytes and you can describe or assign cache fragments on one megabyte size basically on a cell. Give exclusive access to certain parts of the cache to a specific cell. That's useful of course, if you have very tight real-time control loops and want to avoid that the non-route cell, other the root cell evict some of the non-route cell data from the L3 or other shared caches. And last but not least, there are the PCI devices and you saw the long list that we had or the large file we had consisting of several thousand lines of code that mostly relates to PCI devices and their capabilities. So Justin would go to all the details, what is described here because most of the stuff is actually filled by the config generator. So you don't have to deal usually with the details. You just have to, we will see later on, just copy certain fragments if you want to transfer a device to another cell. So the PCI devices consist of a number of parameters and they consist of an index, a link so to say, into a PCI capability list. So PCI capabilities, this is a special range in the configuration header of the PCI device. They can control certain parameters of the device and some of those parameters have unfortunate global effects. So for example, you can specify certain physical addresses there and you shouldn't allow for various reasons these direct accesses. Therefore we have an access control mechanism that you describe basically which capabilities of a PCI device cell can access and which not and that's the whole idea behind it. These things are prefilled based on a knowledge base where we know, okay, the access is safe, but well, not all might be safe. So we will see the data on an example that some capabilities actually may require some tuning. See this? Yeah. Okay, another thing to look into is the debug console thing. So right now as I said, where you can look into the file, maybe it's up again. Right now it's per default and X86 configured to be the first PCI UART. Nowadays the embedded system, specifically also on X86, they come with memory mapped UARTs so you can also configure this. Furthermore, what's possible, but well, there's a description file for details. You can even use a VGA console if you don't have a UART to have some basic output. It mostly works. There are some corner case, some tricky things to get this working, so therefore look into this configuration file. So if you're really out of UARTs, but I can only recommend use a UART simply because the VGA console is scrolled away and you won't get all the messages if you're unlucky. Okay, so now we want to create a non-route cell. If we have a working setup, we don't want to split it up and make it non-working again. So basically the non-route cell, as I said, consists of the header and the resources that we saw before. And the creation of this non-route cell removes the resources from the root cell. Not all resources need to be part of a root cell before, but usually you configure the root cell in this way. Except well, CPUs really have to be part of the root cell. It's simply because of the setup process. The setup has to run on all CPUs initially and that means also the root cell has to have control over all CPUs in the system initially. So if you create this non-route cells, they take away the resource access. Technically the hypervisor doesn't care about if the root cell stops or not stops accessing those resources. This is out of scope for the hypervisor. This is something that we actually do in the driver, in the kernel driver that we've written. So the kernel driver, if it creates a non-route cell, it offlines the CPU cores that this non-route cell is going to use. Very helpful. It also offlines or unbinds the PCI devices which are going to be assigned to the other cells. That's also work automatically. So there we are already on a pretty convenient side. And if you destroy the cell again, the resources are given back to the root cell and you can start using the CPUs again. You can start using the PCI devices again. So if Linux doesn't do this, well, you get the violation thing. We saw in the QEMO example before the case where the cell actually said no, I'm not going to be shut down now. This is the voting mechanism. This voting mechanism is controlled by a cell flag. So by default it's on. So you really have to set this flag to make a cell basically non-voting. That is useful, of course, for debugging for any kind of testing. If your voting cell crashes, your root cell can no longer shut down that cell if you're unlucky. Well, there are some cases which are detected. So if it really crashes visibly for the hypervisor, then the hypervisor says, okay, the cell can't vote anymore. You can shoot it down. But if the cell just spins an endless loop and doesn't reply to these kind of voting requests, you won't be able to shut it down anymore. That's the feature. But of course, for development, it's not really a useful one. So set this in your own flex. Yeah, so this is basically explaining the voting model. The permissive model, the open model is what you want to do probably do in development. Maybe there are also other scenarios where you really trust the Linux as a managing environment. So you basically leave all the cells or leave out this voting feature of all the cells. But for our safety scenarios, you really want this feature because then the cells can control the point where they're going to be shut down, the point where they release certain critical hardware or stop controlling certain critical hardware. So usually this is used to enter the physical process in a safe state before really shutting down things. Therefore, there's two models. Yeah, so really for creating this non-route cell, I'm sorry, there is no tool support for this yet. It's possible, of course, someone has to write it. But right now, we really have to use the source code and do it yourself. So there are many existing examples for this from the demos. Tiny, actually six specific timing, APIC, IOAPIC, whatever. You can look into all these C files and look what they are doing, what fragments are there and try to derive from it with information about the configuration format and your specific use case. So basically it's about copying the fragments over from the route cell. If you want to assign a specific device, copy this over and add it to the non-route cell contribution. So that's current process, a bit tedious, but well, it could be approved. And anyone who wants to hack on, well, currently we are doing Python scripting on this, it's welcome to do it, it's not really rocket science to be done, it just work. Okay, so let me see, I'm online again. So first of all, I have to reward some demo things here to make things working again. Oh, some error. Let's make it build first, you will see. Yeah, there is something, oh yeah. Oh yeah, oh yeah, oh yeah, okay, I see. Okay, so, I should have, something is not correct. Yes, this one. Yep, let me briefly look into the script. So, first of all, let's try how far we get on this machine. Enable, oh wait, it's not. You should have a working system again. Yes, let's try to do the same thing that we did before on the QAMO side, but now with a real hardware. Load and start, and it's running. And you see the numbers are more decent. Actually, because I think, yeah, it's currently probably mis-tuning the timestamp counter, so it's actually in reality, it's lower than a microsecond. This is a timer interrupt running on this assigned core, and whenever the timer interrupt fires, it measures basically the scheduled time versus the actual time of the interrupt handler. So this is a bare metal thing, it's not really a full Linux operating system running or an RTOS running, but this is basically how low you can get on this machine. Which is quite decent, I think, microsecond latency. It's okay. That's one thing, let's destroy this thing again here, same thing as before, because what we actually want to do now, damn it. We want to start the Linux, non-root Linux, but actually it complains about something. Ah, yeah, another interesting case to look at. Where's my mouse here? So what happened now? Because this configuration actually contains already a fragment that is going to assign a PCI device to this non-root Linux cell. And because it contains the fragment of a PCI device to be assigned to another cell, the driver, the JLOS driver tries to be kind and tries to remove that device first of all from the root cell. And while it's removing it, it apparently triggered some corner case which we didn't catch on our configuration yet. And specifically, as I said before, there is control over the PCI conflict space, what kind of accesses are permitted and whatnot. And this is now a case which is apparently not committed by the default configuration. I have some slide for this. So we have a conflict space right. That is the interesting information. It's on the root cell, yeah. And the other interesting information is what is here in this last part. This kind of value. So we have to know again, it's not really convenient for the reader, but it's convenient for the hypervisor to dissect this information. But I prepared something for this. So the first, the highest bit can be ignored. It's just a naval bit. So the interesting are the next nibbles we have. It's 32, 23 to eight. This is basically the BDF bus device function encoding. So you can read from this, actually, it's device number six. No, it's bus number six, device zero, function one. If you read hex, you can read this. So this we know already, okay, that kind of device apparently caused the access and the last byte that is encoding the conflict space offset. That shouldn't be zero, nine. That should be zero. Okay, so the last thing is the offset here. So if you look into our configuration file again, you will find there is something prepared already for this capability. So the configuration parser already detect this kind of configuration, this capability and left something in the entry here. So there is some capability at starting address A zero. It has a length which apparently covers this space. It's the ID 10. That means if you look up the PCI specification, it's a capability structure. It's apparently harmless, so let's declare it harmless. And if it's harmless, we can just set a flag here which permits the access. So that would be the step now basically to permit this specific access and you are fine in this regard. Of course, the story behind it is really doing well an educated decision about these kind of resources. We will try to extend our knowledge base, automatic knowledge base in the long run also about these kind of cases so that you have to interact less with the hardware in this regard. But sometimes these kind of information are not obvious and depend really on the concrete device, what you can permit and what not. Okay, so this one is now broken again. Have to restart. Yeah. So while we are waiting for the start. Right, so let me look briefly into the non-road cell Linux setup. So basically it's just also about cell create, cell load, cell start. But of course now with fragments which are capable of booting a full Linux system. And as these are multiple fragments at different addresses, we also wrote a Python wrapper script for this. So this command cell Linux, especially in the back it's a Python helper script which does this basic commands, but with the right knowledge about, okay, this is going to be a kernel being loaded somewhere, this is going to be an ended RD loading somewhere. And it creates all the necessary commands on the lower level and issues them. And that's the magic script basically to start up a non-road cell. On x86 you need some patches for the Linux kernel to be able to boot in this environment because it's a very restricted environment. A lot of hardware is simply missing. This is not a full PC. It's a core with some memory and some arbitrary hanging around devices. It's there, not really a real machine. But the patches fortunately are pretty low invasive and are in that kind of queue. Eventually we will push them upstream and then you will come a standard feature. And there's also documentation file about the steps to get things running in the source code. So read out for further information, for the details on this. It also describes basically what you need configuration wise on the kernel and things like this. So let's see if we are up and running again. Not working, it takes a while. So what we were trying to do before this capability crash came in. So actually the mission was to add a physical PCI device to our non-road Linux cell. So we have plenty of gigabit ethernet ports on this thing. So the mission would have been to add one of the gigabit, one gigabit PCI adapters to the non-road cell. And the steps for this basically required is to look at the root cell configuration, identify basically the resources associated with this physical PCI device. Well, of course it's obviously it's the PCI device entry in our table that has to be copied over in the non-road cell configuration file. So you don't have to remove it from the root cell because this is going to happen automatically when the non-road cell is being created. So the config entry for the PCI device, you also need to copy over the associated PCI capability table entries. But then you also have to identify and enable on the non-road cell side the memory regions. So the memory mapped IO regions of this device. You can also check again for this, the PROC IO MIMM entries for the specific PCI device. And depending on what kind of device you have, you may also have to enable certain PO based or port based IO ranges in the non-road cell configuration. Watch out for tuning the number of devices for the cell, of course, the number of memory regions, the capabilities. If you copy over the PCI capabilities, you also have to adjust the index because as I said the PCI device points into this area of PCI capabilities at a certain index. The index may probably change if you copy over to the non-road cell, so adjust this. And another trick to stumble over, if you assign MMIO to the non-road cell, do not assign the page that describes the MSIX region. This is a part of, usually part of one of the memory mapped IO bars of the PCI device. The address is actually is part of the PCI device structure. You can read it up there. You can also read it up in the PROC IOMM. This is also not done by the configuration that you have found in the root cell. So unless you really tweak something, you can't really break it normally. But if you do, things will go crazily wrong because this is basically the access defining where the interrupts go from that device. And if you configure or let the guest configure this natively, it will still save, it will just crash the guest because this is, well, it has to be adjusted by the hypervisor and therefore this page has to be intercepted so the control is there. That's one of the pitfalls you can run into if you are not careful with transferring the information here. Yeah, given this time, I guess I will step over, step over this actual demonstration here. Some more notes here just briefly also because of the advanced time. We also have a shared memory device. This is pre-configured if you take the Linux x86 demo cell configuration, there is already this device added. It's not pre-configured for the root cell configuration that you generate with a config generator. It is pre-configured for the QEMO cell description because there we already prepared something. So if you want to enable a shared memory device between your non-root cell and the root cell on a self-created configuration file, you basically have to go through this. The shared memory device is a PCI device as well so it has to be listed in the list of PCI devices. It comes with a specific flag declaring it as a virtual device and the association between this device and the peer on the other side and some other cell is basically based on the BDF, so on the bus device function ID and the physical memory region. So both of them have to match and then they form a channel. That's all the magic behind it. So if you create these kind of entries on both sides they have connection and they can talk to each other over shared memory and then over also interrupts this way. So I'm not sure. Oh, Montz is not on the room. So we're just gaining shared memory network or memory network virtual map up over shared memory. So Montz rule God did the driver writer for wrote the driver for this for Linux that is work in progress still not completely perfect but it's working nicely. You saw this earlier when I SSH into the other cell in order to do this, assign a little bit of memory for this region because it's basically where your packets go and it shouldn't be too small. Yeah, that's for shared memory devices. Okay, let's just a brief look up but trap some pitfalls are waiting for you. As right now as we have this little tool support everything you do can break things spectacularly. We are thinking about having some kind of checker for the configuration but up to then you really have to look twice and if things break and not in the way I described before where you get a clear violation and something clearly broken. There are basically things you can get wrong. This is mostly about having overlapping regions configured which then contradict in the access permissions or by assigning or permitting access physical access to one of those resources which actually have to be moderate by the hypervisor. So if you assign direct access to the APIC crazy things can happen same with the other things. So basically check out your memory configuration and memory regions again regarding these addresses. And also things that can go wrong is with the indexes regarding the other shared memory duration device actually that has here another field which is an index which is an index pointing to the area of memory regions and describing basically that way which memory which physical memory is supposed to become the virtual shared memory. So this index of course can change if you add something above in the index moving. So another easily taking trap and the PCA capability index as well I mentioned about this. So really this is screaming for improving the tool support and yeah, if someone is willing to work on this we would be very happy to get this. So right now it's more being careful and taking twice about this. Good, so enough of x86 but if it's moving so slowly let's look at some arm board. So this is now my new toy here. It's a high keyboard. This is a Cortex A53 eight core machine. With some more or less decent performance a little bit of RAM. So I picked the bigger one with two gigs of RAM and some built in flash memory. It even has Wi-Fi and Bluetooth on it. It doesn't have ethernet on it so I have to have this dongle here anyway but it has USB so at least. Now let's try to boot this up. Where are we? Oh yeah, it's already bubbling. Wait, not bit big. Where's the power? So it was lazy because I only enabled it just a week ago. So it's still the original debut on it. It's working fine. What I did actually is replace the kernel because the vendor kernel, I don't like vendor kernels. So fortunately the 4.8 kernel is booting fine here except that the Wi-Fi driver is throwing some strange exceptions once in a while. But anyway, we don't use Wi-Fi today. So it's running, it's fine. And actually, well in this case I cross-compiled the whole thing and already installed some stuff. So it's not ints mode, it's in this case it's actually modprop, it's properly installed. So it should, it works. And maybe just go for ethernet connection so that we don't mess on the same serial console then we are starting the hypervisor on. This will take again some time, et-et-et-et. Hopefully it works. Okay, so we have oriented it, we want to enable it. Nah, nah, nah. No time for this. I know it works. So just the same in green on ARM 64. So things are working so far. So the key difference here on ARM and ARM 64 to x86 is first of all, you don't even have a config generator here. So it's even more fun on x86. Fortunately, we don't have PCI on this thing. So that makes it a little bit simpler again. But also it's a bit tricky to get initial setup run usually while you really depend on the UART. So if there's no output, you don't see why this thing is locked up. So the fun of the first thing you really have to do is to get the UART running. So there's a variety of UARTs on ARM, larger variety than x86, although x86 trying to catch up on this. So this is one of the first things you have to look into and you have to get running. Furthermore, ARM requires some build time configurations unfortunately. Hopefully some we can reduce, some actually are inherent probably. So you have to tune more on x86, it's less regular. Yeah, ARM 64 actually is pretty new and thanks to Tony and his folks here. It has been contributed by Huawei. It's actually quite long development by now over a year I think you started with this. I was about to merge it and then I found some interesting effects on ARM, 32 bits and the code is shared. So I thought let's understand first of all what's going wrong there. And a lot of things went wrong there like interrupt handling and PSCI, pretty basic things. And after fixing all this, his batch is no longer applied. So I was just waiting a rebase, this was done now. So you are running or this is running now the latest version. It's still in the work in progress branch. So it's not upstream yet and I'm not done with my talk yet. Come on. But it's working so far, it's nicely and I think it's pretty ready for upstream. So probably after I flushed out the next branch I will also push those for review and I would appreciate review on this by the original creators, definitely, but also other folks. So also thanks in this context to many of the ARM folks who also looked into one of the interesting corner cases that I discovered about ARM like cache management and things like this. I thought x86 is interesting, but yeah. Okay, so while it's still fresh, it's working nicely on the Hikey. We had it before working on the AMD Seattle and maybe I will show the Seattle tomorrow at the booth crawl, I'm not totally sure about it. I have it with me. So it's working decently well on both things right now but the Hikey is less tested, it's only a week tested by now. So on ARM in contrast to x86, we really do have to patch the kernel prior to using it, also the root cell kernel, but it's trivial, the patch is just exporting a symbol which is involving and what's involved in detection of the boot mode of a CPU and as we are tuning with it, this is messing around with it, we really need to tune this symbol and as we, our module is not part of the kernel, this symbol is not exported right now, there's no other user externally. So yeah, the trivial patch is required, but it's maintainable. There are some configuration adaptions you have to do statically, so in contrast to x86 where you can actually load KVM on demand and enable it on demand on ARM, it's a static configuration. So if you enable KVM on ARM, it's being automatically enabled during the boot time and as then in our hypervisor already running, we can't run jade house, so you have to disable this. Furthermore upcoming, but I think it's not really in silicon yet, are the virtualization host extensions which actually enable Linux to run in hypervisor mode on ARM and then there's no space for jade house anymore. So if you are on this kind of hardware, you really have to disable this feature because we really rely on running the hypervisor in the hypervisor mode. Yeah, prepare the system, attach some UART, so this thing here has a nice low voltage UART attached to it, so it's working now, prepare the boot kernel and yeah, as I said, you have to reserve some memory again on ARM, we don't have mem maps, we basically just reduce the memory by cupping off some top of it, yeah. And you also have to look into of course, what kind of UART this thing here supports and which address it is because this is one of the first thing you have to configure in the hypervisor configuration. Yeah, so the steps to prepare, as I said, there is some build time configuration needed in contrast to x86 and this build time configuration is currently here in this config header, it's not existing in the source, you can create it, one of the nastiest things we still have, you can create this file, you have to create this file and if you bring up a new board, just look for the AMD Seattle as reference and just make it compile by adding these kind of parameters for it, then you can build it on ARM this is the command you have to pass for cross compiling the stuff and yeah, there's also nice documentation for bringing up the banana pie, it's a 32 bit ARM board and a lot of these things actually apply on 64 bit as well, at least conceptually. So if you look in this detailed description, you can probably derive a lot of valuable information from it on bringing up an ARM 64 board. Yeah, system configuration, I said there's no tool so we use the AMD Seattle as template to adjust the hypervisor where it actually located so depending on how your memory layout is, check this. The interesting thing on ARM is right now and I'm still not totally sure if we really need it, I need to think about it again, it's the jailhouse base parameter that we set here and this value is random, you really have to put in here the physical address of where the hypervisor is being loaded later on and that depends on of course where your reservation basically leaves you off, so where the reserved memory starts, there you put best the hypervisor too. So adjust not only the root cell configuration, also adjust this define here and yeah, we need some eight max or less RAM again on the hypervisor side. Keyboard console, so the system parameter you have to transfer from your device tree to the general interrupt controller on ARM, so there are different regions, you can look them up on the device tree and these parameters, these values have to be entered to the system configuration of your target device. You need, well this is very, very simplistic then, you need at least two memory regions, usually an ARM, it's more or less regular, you can just assign the RAM in one chunk and you can permissively assign a lot of MMO space in one chunk and it actually worked out on this board nicely, but one interesting thing and some engineer at TI went through this very painfully recently, do not give physical access to the gig to the cell. If you do, things work partly interestingly because then basically the hypervisor and the cell tries to acknowledge interrupts at the same time and that works I think for two interrupts in a row and then things will just lock up and it was very hard to debug this until I spotted the overlap in the memory region. Yeah, so better check this twice before you enable and if things go wrong in this direction, that is really important. But you also have to do more intensively on ARM and actually six to three, this is because most of this is still pin-based interrupt based on ARM, you have to grant the interrupt pins to the root cell, check out how many interrupt pins the target actually supports, there's a limit on this. So there can be up to 1,019 interrupts I think, the first 32 are always assigned to a specific core but the others are shared peripheral interrupts. If you assign too many, actually you can make the hypervisor access register, we don't exist and that probably works but it's undefined and it may also cause some interesting effects. So check what is there. Regarding the UARTs again, maybe we are consolidating on ARM 64 as well so luckily the prime cell, the standard ARM UART was picked both by AMD and both by HiSilicon as the UART on these boards. So there's a driver, a simplistic driver available, just pick it and be done. We have some others but they are ARM 32 specific probably, maybe this one could become generic as well but we still have to find a board for this. If you have to write your own, it's not that hard actually because we only need pulled output, that's basically checking if the register, the write register is available, then writing something to it and that's it. So look at the existing code, it's not that hard. Yeah and then compile, cross fingers and type the new board ring up and actually this worked quite well on this one so I'm pretty happy about it. This is how a violation can look like on ARM. Little bit differently formatted and a little bit different information but the key information is the same. So we have apparently here, this is a made up thing so I evoked the access, we have some data read at the specific physical address. This is the size of the access and that one is not really working out and therefore it's complaining. You can even see like you see where it was happening. On ARM you also have a little look back in the stack, the call stack, possibly can look up in Linux where this came from if it's really acquired and again, yeah, this is the root cell which was crashing. Same story here, check a PROC IOMM why that region wasn't included, if it should be included, grant permission and move forward. We don't have an IOMU implementation on ARM yet will come, so this kind of errors can't happen but of course if you then create non-root cell configuration you have to keep one-to-one mappings of the memory otherwise DMA will go crazy. Apparently because there's no one translating the device DMA access to the virtual addresses, the physical addresses so you have to have one-to-one mapping. Fortunately in contrast to XT6 you can do this easily with a device tree you can just move the physical run in address space more or less freely so that's very important and if you look at the existing examples for the AMD Seattle board and also now for the high key I will push this soon you will just find the same pattern there and you can drive from this. Let's briefly, well, just to show you that things are really working create the gig demo. The gig demo is kind of the equivalent to the APIC demo on XT6 so it's a timer event programmed in the gig firing regularly and then measuring the latency. Load, start, and it's working. I think the granularity of the timer is a bit rough here so you have at worst four microseconds latency which is not that bad actually. We have on the AMD Seattle for whatever reasons the latencies are much higher. Apparently there's some hardware involved in this and this one apparently, I haven't run it for a day but at least for some longer period it didn't go up even on the load so it looks quite decent. You could probably do something out of this. Yeah, that's the trips and pits for addition for ARM a little bit shorter because we are lacking some features on ARM that cause problems on XT6 already. We'll come soon, we're working on this. As I mentioned already ARM 64 is going to be merged. Also mentioned that there is no PCI support yet also no virtual PCI bus and therefore we also have no shared memory device yet. There are work in progress branches for this and some hex so this shouldn't be a matter of month anymore, hopefully, and we have PCI support on ARM as well. SMU will probably take a bit longer but you have workarounds right now and the conference generators also topic should be worked on but there are still lower level topics at least for us on the core team to work address immediately so that's a typical area for contributions, welcome. Yeah, the Linux ARM inmate so for 32 bit is there's a branch for this, for 64 bit it's part so all the fragments are there, 64 bit works already, 32 bit requires a little bit loader for this. It would also be nice to have this wrapper script on ARM available that we have on XT6 specifically too and on ARM 32 bit it's tedious work to inject the init.rd and the kernel command line you have to tune the device tree for this and yeah, there's more work to come. Otherwise, yeah, more SOC support on ARM, more boards are welcome so currently TI is working on support for one of the 32 bit ARM, dual core A15 I think and yeah, we can extend on this but there's a little bit to be refactored in order to make extensions more easily possible and less boilerplate code and less duplications come up from each board we add so there's some refactoring also needed on this and I already saw some patches today to make at least something more consistent that would be nice to add on this. That's it in nutshell and yeah, if there are questions, ah, I'm nearly, go ahead. So Android would be Linux as well so Android and non-resale would be more or less straightforward on ARM specifically because you only have to tune the device tree for this you don't have to tune, not much tune the system otherwise, yes, you can't use unmodified systems in the non-root cells so they have to be aware that something is missing or they have to be, yeah, they pass a device tree then it's probably easier or they have some hard-coded resources then you have to comment out usually some resource accesses from these cells but we already enabled there's a free RTOS port available on the net we made it with RTEMS once, it's not upstream unfortunately, we ported an in-house RTOS more or less quickly to this environment on XZ6, it worked quite well so other people actually are doing this as well so there's not much effort involved, it's possible but you need to have access to your RTOS sources, yeah, no general restriction you probably won't get windows running in this but this is, yeah, you can't modify it, yeah. So as I said, so the question is if we support older systems or Cortex A9 so we really depend on hardware virtualization support and the A9s, they didn't have hardware virtualization support they can't get it anymore so Cortex A7, Cortex A15 and anything more recent, it's basically the limit or it's just a requirement so no A9 is no plan conceptually you can use the secure, the monitor mode but this is a very restrictive thing to do because it's one way protection only, it's only one cell or two cells at most so I don't think we will ever support this because simply the time is moving on and we have hardware, on 64 bit it's way better now than on 32 bit usually even have an SMMU so many of the 32 bit systems we have have no IOMU, no SMMU these things come usually with SMMU so we will see more support on that side it's a matter of time so things are moving on so one of the next platform interesting for example is this Siling's UltraScale SyncMP while the old Sync has an A9 or dual core A9 it can't be enabled but the UltraScale has an A53 and also an SMMU and all the nice things and it's under my desk you just have to have some time to hack on it and it will probably run soon on this as well so that depends on what project we're currently working on it's about two to three people but not full time on this and then we have support externally so Huawei in a time had quite a few engineers working on this they will continue hopefully on this again so there's no fixed team doing it the whole day but depending on the topics to address on some of the past months a lot of work has been done on the ARM side both in Siemens but also with external support on this so this pushed us forward and we will continue to work on this the other thing that we're actually working on is a non-technical part this is the certification of the whole thing for safety scenarios so already from that point of view we are working on this continuously there will be some impact on the code base to fix some issues, to improve some code make it more reviewable, more analyzable so there will be a continuous work on this yes, eventually, not yet one thing is what is missing is a functional thing we want to have this case of running Linux, the side Linux, running smoothly so all the things required for this and the last thing actually technically required for this is a shared memory network device if this is settled and have all the patches ready then we will be going to discuss this upstream and the other thing is, well, this is the guest support, so to say the non-root cell support is probably going may go in earlier if we are not pushed back the other thing is really upstreaming the whole subsystem, Jailhouse as a subsystem of Linux kernel I personally would like to have it eventually one day we would like to sort out initially the certification aspects of it so how much we have to modify the code for making it easily certifiable in the safety scenarios until we know this for sure it's easier to discuss this if we know basically where we have to possibly adjust our development processes and what the implication will be if you do it in the kernel community there will be some implications and not all will probably be useful for all scenarios and you have to have the knowledge about it to get it upstream but I think there are a lot of use cases now with Jailhouse without any safety in mind and that makes it also interesting for upstream so like partitioning of large machines or someone recently approached me and said okay, I want to do some robotic control with prem.t on one part and then they have some Ubuntu system on the other part and I would really like to partition this thing and Jailhouse looks perfect for this so another interesting use case was that in mind on that in the background I think we can propose it eventually for upstream Okay I conceptually, why wasn't it possible to adapt QEM, KVM on QEM and why does it have to be a new... I don't have it on the slides, it's a code size so QEMO itself, I think it's about it's a million lines of code by now it's at least a few hundred thousand lines of code it's amazingly growing, that code base but even if you exclude it from your critical code pass the KVM part of the kernel even just that relevant part is a few hundred thousand lines of code on XT6 we are currently around 8,800 lines of code and on ARM around 7,000 something and the goal is always we say about 10,000 lines of code per architecture that's the goal and that's palatable for a safety scenario, it's not palatable if you go with a million lines of code So there was a question back is it process of course so we discussed it already with the TÜV the first concept on this and we are continuously discussing it specifically the approach you want to take on the certification process the formal method to choose for this and the arguments you want to pick if you don't want to do this afterwards otherwise you will lose time and money on this it looks quite good so far but of course in the end it's only proven when it's done so if you really go with a complete system through certification process and this takes quite a while in this domain so we can't really promise to have a result by next year but it may take longer Do we have to leave the room or? Yeah, here? Oh yeah, okay Thank you Yeah