 So welcome everyone so I'm just briefly here to warm up the stage basically Most of the talk will be held by Ralph it's a god building a mixed critical Linux system with the jailhouse hypervisor, so I'm here because I Kind of created this hypervisor and so I give you a brief introduction to what is this about and then hold over 12 to show Basically what he was building out of this So the introduction and the current status. That's the part that are going to begin and then we look into the further details So the motivation to start this project is that well first of all we now they have S&P systems all around all around the world in embedded devices specifically as well and That of course opens some chances in our domain where we used to have a lot of separate devices doing single tasks on Unicor processors now they can do these tasks causally dated together on a bigger machine and Many of these devices Have Linux on it But not all of them and not all of our workloads are Linux based So there's a lot of legacy software stacks around which expect well single core which expect bare metal maybe Then there is the the area of safety critical systems. We are about what more about it later on and Yeah, there's also workload which is basically running DSP like so we really want to squeeze out the last cycles of your hardware and These days out of the multicore high-performance processors So this is a setup and in order to Avoid that we are running like on the picture Linux bare metal Blank a site some other workload which is technically possible, but not really clean the isolating We created the hypervisor jailhouse Which is a little different from other hypervisors in the world out there So first of all, it's really only about the target the static partitioning of the system. That's our main purpose Well, we use hardware visualization for this. That's a technology But what we don't do in comparison dollar hypervisors. We don't schedule So there is no CPU course getting it all and there's one-to-one assignment of devices of resources To well one user one guess so to say Another major difference is that we are Splitting up an already running Linux system and it's shown on the graphics basically so we let Linux boot the system completely Then we load the hypervisor underneath and let the hypervisor basically then Split up the hardware into the pieces that we need for running additional work on it So basically shrink Linux and extend it for other use cases and that all in all it's Emphasizing simplicity over the features So just a brief start up it what we are doing what we are right now. So in January we released the version over at six So it's not quite one dot zero. There's still some bit to go But we made major Steps forward already. So we merged on the seven arm v8 support by the time finally officially Contributed by Huawei nicely Along this we reworked on V7 support We now have a nice way to communicate between the guests. Well, obviously need this via shared memory devices And on top of this we have virtual networks available We now also have support well, I showed you the graphics basically we are running something else than Linux besides Linux But of course we can also now run Linux besides Linux So we can partition large Linux systems into smaller, but many Linux systems Why we see later on why this is useful? Yeah, and we have integrated more support for Interesting knowledge like the cash technology cash and a case technology of Intel which allows us to more fine-grained control which kind of Latencies we can get from the hardware and more and more more support also available these days What's coming next so I'm breathing the future. Well, we want to further advance on the shared memory device That's more a detailed technical requirement But it will help at the end one of the final goal for jailhouse certification Furthermore we are participating Google a sum of code most likely while we have two projects running where we are hosted Coal-located one is QEMO and the other one is slipword So there are the links to the ideas proposal up there and we already received some feedback on students interested in this topics And last but not least we are heading for safety certification of Jailhouse well not of jealous alone. We think it's possible. We have to certify a whole system But first of all we have to look into what is jealous able to do this and that looks quite good so far Now the question is what the hardback can deliver to us because well, we are building on hardware So we can't really say the problem is solved just by solving in jailhouse So we have to look at both parts and that's actually interesting part now So stay tuned. Maybe we'll more about this later on So now I'm handing over to Ralph and it's interesting activities on this. So, thank you very much for the introduction And now I'm going to explain to you how on how we can use jailhouse to build mixed criticality systems so mixed criticality systems are usually systems where you run different payloads of different levels of criticality like critical and uncritical payloads a Typical a typical candidate for such mixed criticality systems is a car where for example The engine control has a higher criticality than the instrument control for instance and currently those control units of those devices or of those devices are Currently located on separate physical systems and with the use of jailhouse We could think of consolidating them back together to one single hardware unit So and to demonstrate and to prove the suitability of jailhouse for mixed criticality systems We implemented a demonstration platform. We call it the chapter the jailhouse copter which is Typical mixed criticality environment. So we have so we have a we have real-time requirements there It is a safety critical environment because the flight stack that is controlling this device Is critical for us and we have typical requirements that we have in any industrial product like reliability robustness maintainability of the whole system Yeah, and we want to show that it is feasible with With reasonable effort to port an existing payload application to run as a jailhouse guest So let's have a look at the classical approach on how those technologies are implemented at the moment So you'd usually have a hardware control unit Where you typically run a minimalistic real-time operating system? So in this case this controller unit runs the nut x operate real-time operating system and on top of it there on the autopilot flight stack, that's an open-source flight stack and This device acquires at a high sample rate sense of a sense of values such as sense of values from gyroscopes compasses accelerometers GPS or the remote control or the remote controller and it uses those values to calculate new values for for setting for setting the motors and This is the flight critical part of the system and then you have to mission critical part of the system where you use very typically use a Bigger hardware stronger hardware That is able to run Linux on top of it and then you can benefit from the whole Linux ecosystem so you have those millions of lines of codes are here for for instance you want to run an open CV application where where you want to measure the landscape or where you want to follow some some waypoints and Then you use some Python application to To to process those data and you in the end you presented using some web server application for for example So we have the critical world and we have the uncritical work and there is some kind of communication channel between those two worlds to to to dump logging data for example or to set new controls to to the flight control so now we want to Consolidate those two worlds to isolate that separate domains running on one single hardware unit and We do this with shale house and shale house is going to ensure that those two domains are not going to or will not interfere in In an unacceptable way So in the beginning we had to make some architectural decisions and we chose to use an octa copter an octa frame This is from the microcopter Project it is originally driven by an AV air by an AVR microcontroller, but as you probably know They they an AVR is not a candidate to run Linux on top of it So we dropped away the original hardware and we put our own hardware on top of it that consists of an emulate navio 2 So this is just a sensor board We have a GPS we have accelerometers we have compasses all that we need is on top of that board and then We have this this is the core of the system. There is an Nvidia Jetson tk1 and This one is a quad core RMV7 32-bit CPUs it comes with two gigabytes of main memory It's not the latest model so it was released in the middle of 2014 But it has a feature rich expansion header So exposes everything that we that we need to do to the outer world such as SPI i2c uR and many GPIOs and What is important, especially for us is that it That it has the arm VE virtualization extensions That we need to run shell us on top of it So it's shell us enabled and what is pretty nice about this board that it that you can run it on latest a mainline Linux So we want to split up our system in the following way on the left side You can see the critical world where we run shell house on On so very divide our hardware in the middle So we assign to CPUs and all the critical devices that we need So these are ice we need an i2c device for controlling the motors the barometer and the rc decoder We assign an SPI device to the critical So to control the girls to get values from the gyroscopes the accelerometers GPS and compasses and then we assign the GPIO device for indicating some LEDs that indicate the current flight status and Then on top of it we run a pre-empt or t-patched Linux kernel and On top of that Linux kernel as that's the user space application. We run the autopilot flight stack yeah, and the rest of their system you can run any payload and It is guaranteed that it will not interfere in an unacceptable way with the other with the critical world So our approach was as follows we first wanted to develop the application without shale house On bare metal so we did everything that you would have to do in any case so we We had to patch our kernel So we first applied the the preempt or t-patch stack later We modified take our device drivers to meet to meet our requirements and then reported the autopilot flight stack So we had to implement some motor drivers that were not supported before we had to implement battery sensing because our our hardware was not supported at this point and This is the stuff that that you would have to do in any case and later on we enabled shale house and moved the critical software to the critical cell and We added some uncritical payload running in in the rest in the rest of the system So What was given is that shale house already supports to run unpatched mainline linux on arm we It runs a preempt or t-patch kernel as I already said and we provide tiny tailored device trees to describe the configuration of the hardware So if you think of a typical device tree of a platform, which is pretty huge We really only only pro describe the parts that we are really actually using So we so we describe the CPUs the memory region some devices that are assigned So that's that's actually all that we defined there the user land so the autopilot application is provided as an initial RAM disk so we so we run everything in memory we have no solid state devices there and Intercell network driver implemented by by Mons, I think he's here hello Mons Act as a communication channel between the critical and the uncritical world So we can simply for development. This is very useful because we can simply SSH into our critical cell and Run our and yeah to all our development stuff so This was already given and now we want to assign devices to the critical cell and this is the point at the part where it got interesting so If you look at a device tree of a device, we see that it consists of some memory region Like this one here jailhouse already supports to remap that memory region to the to our critical cell and If this memory region would be aligned to the page size then there would be then there would be no need of trapping trapping the hypervisor or to intercept Otherwise we need to otherwise so in this case we need to intercept because this Size here is not aligned to the to the page size jailhouse then supports to re-inject interrupts so this device uses the shared peripheral interrupt 38 and We can simply be safe in our jailhouse cell configuration. Okay, please remap this interrupt This interrupt shall arrive in in that cell and with a little of overhead Yeah, the interrupt will eventually arrive in that cell But then we have those three guys left so clocks resets and DMAs So let's talk about clocks and resets so clocks peripheral devices are typically driven by clocks so Device drivers typically ungain clocks if the device is not in use This is mainly done because of power-saving reasons, especially this is especially important for battery-driven environments clock drivers also support to select different speeds or board rates of the device and the reset lines Help us to De-assert or assert resets to to the devices to bring them back to their initial state So this is what so this is what the driver code typically looks like so we first enable the clock then assert and De-assert the reset line to all the driving stuff here and later on when we are finished with disabled a clock again So but in so and those clock and reset controllers are typically organized as memory map.io memory memory regions and so this is one memory region and every bit there describes the gate of a different clock and The problem is that this memory region controls the whole platform of Controls our home a whole platform. So this is really hard to partition So in our context we we have real-time requirements and no low-power requirements So we first thought okay, let's statically gate our clock before before we before we start our cell and We do we do not want to dynamically change the speed or board rate of the device So there is actually no need for the for for clocks at all So let's just not define clocks in our in our cell and in our device tree But it turned out that we must not ignore clocks because device drivers make the strong assumptions that clocks are always available And they are aware of the state of the clock The same is for reset lines. So we must not ignore reset lines The driver will complain if we are if you're not providing the reset line So there there is no way out. We have to somehow Paravirtualize our clock and reset controller and we did this by providing a Chailhouse clock and reset controller to to the critical cell and the hypervisor will trap if we access if we access that memory region and decide if this cell is allowed to gate or ungate the clock or not and The same is for the Linux that was existing before so the uncritical cell This one will also trap so that an uncritical cell cannot ungate a clock or gate a clock of of the critical part and vice versa The problem here is that we must define access bitmaps on a bit granular level And this is really tedious and exhaustive. So we are we are still thinking about better ways of of how we could define that Yeah, so now we have the clocks and resets defined in our cell then there's one thing missing the DMA controllers and Again in the context of real-time systems where latency usually matters more than high throughput We just do not need DMA So our idea was okay, so let's let's simply not provide the DMA channels and Try to get rid of them But the point is here again the drivers make the make the assumption that the DMA channel is available Even if they do not use it So we would have to patch the driver and we didn't want to patch the driver So fortunately on this on this platform We were able to exclusively assign the DMA controller to our critical cell Because there was no other device that was using the DMA controller in the uncritical path Yeah, so now we have know we have now we have a real-time operating system running inside shale house. We have MMO based devices assigned to the critical cell We have a virtual clock interface and we have a jail a jailhouse independent executing environment So we can run the same application inside shale house that we were running before without shale house So this makes it easy for us to port existing legacy applications Yeah, yeah and While the grid the uncritical cell is under full load We start our platform and we are ready to fly and I would like to show you a short video Yeah, let me go back to the slide so you're talking about this one So this is part of the critical cell So this is this is this is not really part of the hypervisor that is part of we have to provide it to linux So before putting linux we have linux. Yeah, we have to put it into the memory and yes and You have to define the same memory regions and interrupts in our cell configuration of shale house Yeah, so this is now our platform starting And it might still look a bit clumsy But that's not because of jail house, it's because of my bed flying skills and it's my bed flying skills So now we got it running, but let's talk about the about the performance of Of jail house when running a real-time application inside it so We so as I already said we are re-injecting interrupt So if an interrupt arrives it first arrives in hypervisor context and then we re-injected to our to our cell and we wanted to know what's the latency that is introduced by jail house, so we so we Made a measurement and we externally toggle and GPIO Wait until the interrupt arrived and then just toggle another GPIO and measure the lat and the the latency between Between those two events. So the green line here is Is our stimulus and the blue line is the answer and if we are not running jail house And if you are running a bare metal application, which only which only tasks is to answer to the GPI to toggle the other GPIO then we get a bare metal latency of 420 nanoseconds and this is the bare minimum of the hardware So what happens if we now enable our hypervisor? Of course We get a we get a performance drop and we drop from about 440 nanoseconds to 1.2 microseconds. Yes, please Say it again No, this is really a bare metal application So we toggle the GPIO in a sampler code in in the inside the interrupt service routine So this is this is the bare met a minimum that we can that we can get in on this platform And we'd let the same application run inside jailhouse and then we get 1.2 microseconds So jailhouse introduces an overhead of about 800 nanoseconds on average on this platform and on maximum. This is what counts we get 2.8 microseconds, which is fairly okay for for for our requirements now it gets interesting what happens if we if we stress the other CPUs in the uncritical context because the system bus of the of the system is still shared We get a performance drop and on average we drop to one point to about 1.4 microseconds on average and to 7.5 microseconds in maximum so this measurement was was done at a frequency of 50 Hertz and for four hours This was a stress test where we stressed very stressed memory IO and Yeah, and CPUs. Yeah, just keeping the cup the bus on load and and as well keeping IO on load And then as I already told you The page size is the minimum if we if we want to do paging So we have to dispatch if we if we if we want to assign a memory region That is smaller than the page size And then we have to we have to we have to trap to the hypervisor and decide if If this guest is allowed to access the memory region or not And we made a we made a simple measurement where we used the the ARM performance monitor counter to to see Yeah, what kind of of overhead is introduced when when it when it comes to when we have to dispatch those pages and Without and it doesn't matter if we use cellars or not if we assign a full page So if we don't use subpaging then including the measurement overhead we have eight cycles and now if we And if we put the rest on the load so the system bus is shared then we have about 217 cycle cycles for For for memory access. So now it gets interesting when we when we enable subpaging Then we get a performance drop from to up to about 136 cycles for access for for getting The the response of the hypervisor and a maximum of about 1000 cycles So if we again put load on the rest of the CPUs then we get a and again we get a performance drop to about 3000 cycles Which is still okay in our case, but this is a performance drop that is actually preventable If hardware vendors would place their devices on would place every device on a separate page And this is what brings me to the next point the requirements on hardware for Yeah for hardware partitioning So memory map they oh if we look at At the address space of the of the i o address space of our tk1 then we can see that There are several several different i square c devices located on the same memory page And even worse we have Different functionality different devices of different functionalities on the same page here so subpaging leads to performance impacts and actually this platform has two gigabytes of Of of memory. So we have two other gigabytes left for placing devices on On the on the i o address space. So This would give us enough space for placing 52 000 devices If we make make the make the pessimistic assumption that we need 10 pages per device So please it would be nice if every device would be located at a at a at a separate on a separate page The next point is the clock and reset controllers So if it is possible It would be nice if clock and reset lines of a device would be Located at the device's physical memory space and not at a And not squash together to one single contiguous memory region where we are controlling the whole platform So this would really help us for partitioning the hardware And this makes the the clock and reset controller partition able for us. Otherwise again, we need we need costful para para virtualization And direct memory access as I already said latency matters more than the throughput and We don't want to use dma channels. So if you develop a device drivers Please keep in mind that someone might want to partition their hardware. So please allow the absence of dma channels If they are not really Necessary for the for the driver and again, otherwise we need costful para virtualization And then I stumbled over an interesting over an interesting bug. So hardware misbehaves when implementing When implementing our platform I Realized that if you touch a physical i o memory region with its clock ungated Then the whole architecture the whole platform will immediately immediately freeze and stand still no kernel panic. Nothing and I I sent this This report to the tag around mailing list and they said, yeah, well, we are sorry about this is the way the hardware behaves and another developer of nvidia, um added It does apply to the whole architecture pre something so in in the in the latest version it got fixed but this bug is known and We need to be aware of that because when someone ungates a clock we need to hide The physical memory pages from that from that guest so that we can prevent that the whole architecture will stand still because An uncritical cell Must not influence a critical one And this brings me to the conclusion in future. We have to take more attention on On hardware software code design. So in the end of the day, it's the hardware where our software runs on And it's the software that runs on the hardware And every work and every software software based work around leads to Leads to latency impacts leads to more overhead But still we were able to show a solid testament for implementing Real-time safety critical mix criticality applications with the jail of hypervisor so thank you very much and And if you have any questions do not hesitate to Question me now. Yes, please It supports arbitrary partitioning, but you have to assign at least one cpu to a cell So the minimum configuration of a cell is you have one cpu and the region of memory. That's the bare minimum Yes, sure. So there's no limitation. There is no Under ideal conditions, there is no influence between between those cells So you can actually run any criticality on top of it Yes, there is influence, but Well, you have deadlines and if you know that you can meet And if you know that you can meet those deadlines, that's that's that's okay Yeah So and yes, there is there is some overhead But if you know that if you can keep it as minimal as possible and if you can keep it deterministically minimal Yeah, so the question is where the influences The interference come from. Yeah, so caches is one reason and I mentioned before on intel for example We have now the cache partitioning mechanism in hardware available That probably will spread to other architects as well in the future Um, that's one one reason but there of course, there is a more going on inside the hardware Some of them you see on the on the diagrams as blocks and okay Understandable, but there are also channels that you don't see because they are well hidden behind and maybe not documented I guess this is very important is what I said before if you really want to go for a critical system certified for a certain functionality for certain behavior You have to include this kind of channels in your in your consideration as well Either you have to argue. Okay. There is influence, but there's always influence But it doesn't matter because the deadline is still fulfilled even if that influence is maximum exposed But you have to know this you have to Detect it either if something goes completely wrong or you have to argue Okay, we know because of the architecture this can't happen That's basically the whole story. So you can't really do this in software. You have to do it Co-designed with a hardware for this kind of scenarios. Yeah, okay Okay, yeah, we saw one case today It's about the clocks and the gates of this in the reds at lines where well, basically you have one one Resource controlling the whole platform or at least multiple devices. You mentioned another case Power control so you have this already on the core level on some devices on some architectures That if you go down with the power level of one core The other one may also go down or at least have a higher latency So you have these kind of interference channels and you have to identify them Well, if the hardware is not capable of Allowing some kind of control then you may have lost There's no general answer to this except that you have to look at the individual cases Okay, oh yeah So this is the kind of scenario of a shared bus. So you have multiple devices on the bus And and only one house could control it There there are always cases which where where this static partitioning doesn't work Well, you can always fall back to do something in software Split up this kind of interrelationship in software But this is not our first step. It's the last resort on a platform, which is otherwise nice But maybe has some shortcoming in this regard Then you may consider software solution, but this is always a trait of basically How much effort you spend on the software and how much effort this costs you basically for for certain certification efforts Or simply of maintenance efforts. Yeah, the whole hardware. Yeah, okay So this is not the model we have here Wherever possible we hand out the hardware directly to the guests So the hypervisor is in charge of controlling the cpus with respect to this Privileges that you need to give to the guest or not give to them It's in charge of of certain interrupt aspects because interrupts usually well in some case you can root them directly But in many cases today you basically are in charge of the hypervisor of redirecting them And that's another task for the hardware for the hypervisor and the page tables You have so the memory accesses that the guest issue And also the memory accesses the dma access devices issue These are also controlled by the hypervisors in this particular case tk1 We don't have the smm you implemented yet, but it would be possible. We have an x86 running So this is another task for the hypervisor and then there are some minor details I would say the special teeth of your architect the special teeth of your board where you have to add something But in general we want to keep it very simple. So this is the major The major aspects I mentioned already and these are really even cross architecture. So the concepts apply across architectures It doesn't make the code generic, but at least the concepts Yeah, well, this is not rocket science and so to say well, there are some some minor details which are new but You mentioned one and there are dozens out there Implementing this on on arm implemented on x86. Maybe on both. Maybe on something Yeah, yeah, even on a bed you have dozens of them So I heard the stories from automotive hypervisors. There are many of them and they help all their pros and cons But there are not that many if you talk about the open source market, for example, they're not that many with this small code base So in case of arm, I think we have some Seven or eight thousand lines of code for this architecture to implement this logic Right now it's not complete but pretty close on x86 the same situation basically And as far as I know all of the commercial wants to scheduling And we do no scheduling at all most of them. Yeah So this is really it's really for this specific purpose designed and we don't want to go beyond that So there are other solutions which can do more and which are already existing so That's not our goal Yes, both Yes, of course So because we do not want that the critical cell can Can Can somehow Interfere with the uncritical one and vice versa So we do not want the uncritical one to interfere with with the critical one The question was if you if everyone is it does Decepting every memory access on a on a shared page on both of both cells just to repeat it for the recording Yeah, correct So So the question was what happens if the critical cell accesses somewhat outside its scope or Well, if you We don't assume this we can actually enforce this So we can apply the same Enforcement on the on the containment on the on the memory access on on the cpu x or whatever you Resources you hand out on the critical as well on critical cell. So there's no privilege right now In this regard for the privilege for the critical cells There are other privileges But this is not in the scope because there are also scenarios where you think about you have two critical cells Basically monitoring themselves and if you now start basically Dropping the access control for these cells you can no longer argue that these two Each other monitoring cells are independent of each other They can actually override each other and then you basically lost your argument Not trusted no no at least from the hypervisor point of view Sorry, I didn't get the last one Okay, yeah, so the question is what what mechanisms we do apply on the on the shared memory Communication channel to ensure the isolation. So so right now the the model which is in the release The odor six release is basically a symmetrical read write memory region So both sides see the same memory region physically and can interfere with each other happily on this memory region That's the current model implemented But we also have a staging tree where we implement two independent channels Which is which are unidirectional So one side can read write on the channel the other side can only read from it That makes of course Argumentation of a protocols built on top a bit easier that they are not really interfering, which of course it is an interference channel logically But you have to you have to apply certain additional measures on the software being built on top That they don't Well violate the protocol you want to implement on top. So that's the basic model What but monts was now implementing on top of this device is basically a linux driver Doing a virtual network device on both ends. And that's what we are plotting here That works both for this Unitary over the multirectional channels as well as the unidirectional channel But you have to same you have the same problem if you have an if you have an architecture where you have separate devices So even then there is some communication channel between those two devices and Yeah, you have to be aware of that The question is how to port it on a different architecture. So yeah, well, so if you look at the read me We have some some architectural requirements on the target platform So we we require for example an arm virtualization extensions on the internal vtx and vtt on amd the corresponding features That's basically the fundamental requirement. So currently we are on the x86 and on the arm world With with the supporting so if you go for a different soc or a different board And these days on arm at least it's it's pretty simple that you by basically Um on arm v8 for example You just have to describe the hardware and if you're unlucky, you maybe also have to implement for debugging purposes mostly virtual and a driver for the ua to the console of the hypervisor is dumped On on a physical device and you see basically what's happening if something is not working initially It's a normal case that you bring up is not smooth and you have to tune a little bit But then the dependency is very very minimal. So they are currently well at this stage right now No further hardware dependencies Yeah, we don't have we don't have drivers currently except for the the console drivers at this moment because we didn't solve Probably the the clock topic yet, which may lead to drivers or may lead if you're lucky to more generic pattern Access control pattern, but there might be some need so on x86 for example We have a driver so to say for the interrupt controller that it interprets basically the access and then decides if an access is allowed or not allowed No, we don't expect Too much from the pre-configuration of the hardware. We just expected the current user of the hardware linux Is not further configuring it. So basically we take over control in the hypervisor of the hardware We freeze the state And we only interfere or control the accesses which we really need to have or keep dynamic during runtime So there's no further assumption basically on this But of course if we look for the details later on I suppose there will be further checks required and further Well validation needed of the platform state before really declaring it to be safely partitioned That's that's work in progress and that depends of course also on the information that we Just are starting to collect from the hardware vendors on this card. No further questions Then thank you