 Hello and welcome. I'm Jan. I'm working for the open source automation development lab. We are a German cooperative which is basically a non-profit organization and we do support our members in using open source in their products. So that covers many different aspects and one important aspect for our members is real-time specifically with primed RT and this is basically what I'm going to talk about within the next let's say 40 minutes but today I've picked a very special context for using primed RT because another topic we've seen really approaching to the embedded industry market is virtualization and well there's different technologies when it comes to virtualization but we really see this coming to the industrial market but since on the industrial market usually real-time is a important requirement we actually had to get a good understanding how virtualization and real-time with primed RT can basically work together. So let's have a brief look on what I'm going to talk about and this presentation so this this basically is three parts of my presentation so we're going to get started with containerization so we'll look into docker containers and how they could behave with regards to real-time behavior so I just took docker as an example but as you know most of the container engines and technologies are working on the same principles basically the knowledge shown here could be applied to any other container engine as well. In the second chapter we're going to look into hardware virtualization and specifically I'm going to look into KVM and shale house and last but not least when we look into virtualization usually separation is a very big topic and therefore I would like also to have a look at how shared hardware resources specifically a shared level two cache could have an impact on the real-time behavior and the influence between the host and the guest operating system. So this at the end of the day for all the technologies we're going to talk about I looked into three different things so basically I looked into the real-time behavior on the host so that basically means just the fact that you run a few guest operating systems on that system influence the real-time behavior on the host. Second thing I looked into is it possible to have real-time behavior within the guest system or within the guest systems and third thing already mentioned that looking into the topics of this presentation what is the level of isolation. Once again this is pretty important to look into because when people start to look into virtualization in many cases the motivation of using virtualization techniques is to have a level of separation or isolation for different parts of their system and therefore it was crucial to look into that as well. Well we're going to start with containerization and before we dig a bit deeper in my test setup just let's briefly remember on what preempt RT does just a very brief and simplified overview so for those of you that haven't worked with preempt RT so far preempt RT is a so-called single kernel approach which means it makes linux itself capable of real-time basically by introducing a additional preemption model which is fully preemptible kernel. So that means that a real-time task in comparison to other solutions for linux and real-time is just a standard linux task so basically you need nothing special you just work with the standard C library with the standard tools standard posix api it's a standard linux task you just need to follow a few rules but now since a real-time task is just another linux task if we now look into the topic of containers in real-time well you know a container is also with a simplified view just a set of processes that is running an isolated environment so since a real-time task is just a normal linux task any process inside a container can have real-time priority so there's no problem with that so and this is basically what we did to evaluate the real-time behavior so we put actually we took one system out of our so-called qa farm where we do latency monitoring on roughly 200 boards actually you could access these systems if you go to osadiel.org just go to qa farm real-time and you pick rack zero slot two you'll find the test hardware and on this hardware I just put a docker container and on the host and in the container I'm just running the real-time smoke test tool cyclic test to evaluate the real-time behavior so this is basically what we did if you want to learn more about like the load scenarios we're using and how the tests are really looking just go to osadiel.org and check the qa farm for slightly more details okay um if you haven't worked with cyclic test before just to give you a rough idea on how you could use that to evaluate the real-time behavior of your system cyclic tested at the end of the day a pretty simple tool that just gives you the possibility to start a given number of threads that will wake up at a given interval let's say like 200 microseconds so every time one of these measurement threads wakes up it will just um take the current system time and compare it to the time when you wanted to wake up you calculate the delta and then you know the wake-up latency and this latency is continuously reported to a master process that would monitor the worst-case latency for you or could create a nice histogram for you and so on and so on so this is this is pretty simple and this is pretty much what cyclic test does setting up a given number of threads that will wake up in a given interval and this is basically what we used but yeah so this is this is how it would look like simple command line tool there's just a few parameters you would give like the interval to wake up the priority the number of threads on which CPUs you want to run and um most importantly it would tell you the worst-case latency that you have reached over the runtime this is basically the number you're interested in right so what's the worst case you could you could see in different load scenarios well before we dig a bit deeper in my test setup it might sound simple but to understand the tests a bit more we need to remember a few key principles when it comes to setting up a real-time system right so first of all once again it might sound pretty simple but it is crucial to understand and crucial to follow so if I want to guarantee real-time behavior the corresponding real-time task needs during its entire runtime the highest priority right of all tasks running on the same core because otherwise you know we might get interrupted and that might violate our timeline so sounds simple but we need to remember that if you want to optimize things a bit further we could also to we could restrict the task to run on a particular core and we could even isolate that core from additional noise right to to improve the latencies so these are two important points but number one is the most important one and these general principles also apply if you'd like to talk about real-time and virtual environments so we need to think about that so the the main message here is that I cannot just blindly deploy virtual machines with real-time workload and just think it works because I have a real-time kernel working right so I need the overall picture and make sure that I do not overcommit the system right so this can be simply shown with cyclic test just to give you an idea of what I wanted to tell here I know that this might look a bit stupid but it tells pretty much about the principles what I just wanted to tell so in this case I'm starting 12 tasks with an interval of 100 microseconds all running on the same core with decreasing priority right so all these real-time core applications are running on the same core and as you can see with the increasing number of tasks and the decreasing priority we see that the worst case latency that is being reported gets worse and worse and worse so what's the reason for that because the higher gets the chance that they get interrupted and they violate their timeline so at the end of the day what we pretty much did here which is heavily over committed the system so it just doesn't work so once again you need to really think about the design of your real-time system just randomly deploy any workload and enable real-time well that just doesn't work right and the same we need to do in a virtualized environment so what's the fix in this situation well I could just ask cyclic test to like run all the tasks on a separate CPU I am back to the key principles right I'm the highest priority task during my entire runtime I won't get interrupted and I can meet my timelines so that's pretty much it and this is what we also applied to our test setup so this is a bit to zoom in on the test setup I showed you it's been an eight core machine so as you can see I've reserved four cores on the host to run cyclic test on it and another four cores within the container to run cyclic test in there as well and well when running cyclic test we've just made sure that they are running on dedicated CPUs and there's no overlap between the host and the guest once again this is just because we need to follow the key principles when designing a real-time system right so we had a proper partitioning and for all technologies I'm showing in this presentation we had to apply this key principle that you have a proper partitioning of the system right well now you might be interested in the results we've reached so this is the latency histogram of the cyclic test run on the docker host looks actually pretty nice we've seen a worst case of 30 microseconds and all the tests always did run in parallel so cyclic test was running in the container and the host right so what we can tell with that is that just the fact that we have some real-time load running in the container doesn't have any influence on the host so there's no interference right here which is already good news right so let's look at the guest and as we can see it looks pretty much the same so I mean we've been at 26 microseconds which is the same area as we reached on the host like also the distribution of the latencies looks pretty much the same which is at the end of the day not surprising because it's just a bunch of processes running in a isolated environment so there's no root cause for additional latencies right so which is approved with these measurements so these measurements were running under heavy CPU load and now in the QA farm we do different load scenarios so sometimes you cause latencies with load sometimes with an idle system because you hit any weird power management issue you enter C states or whatever so therefore we also have idle measurements and actually this we messed up a little little detail in the configuration that's something I wanted to share with you this experience so this is the idle measurement on the host which is in the same area like we see 29 microseconds so that even the distribution looks pretty much the same as under heavy load then we looked into the measurements in the idle docker container and oops that looks a bit different actually a bit different and we've been really surprised on what happened here just to zoom in for better reading I mean we hit 93 microseconds in the container compared to 20 something on the host and I mean that doesn't make any sense at all so why should the idle scenario have an influence on the container so the thing is that I mean at the end of the day the thing you need to be careful about is that the container needs the appropriate privileges and it needs to have access to the required system resources right so the configuration it's not hard but it's a bit tricky and you need to take care about a few things so what happened here is it's actually pretty simple we didn't have access to a specific device node which is used to disable the C states so what happened at the end of the day is the host did have access so C states were disabled on the host measurement but the guest didn't have access to this device node so C states were not disabled so the the measurements at the end of the day were not comparable right so these worst case latencies have been triggered but just by coming back from a from a C state right so we fixed that in the setup and as you can see also with the idle measurement we've been back to the native host latency I mean it's pretty simple right nothing unexpected but nice to see that it's actually quite easy to to run a real-time application within a container so these are the numbers already quite quite good news the next topic you wanted to look into I told you is the level of isolation so how can so is there any possibility that the host system could influence the real-time behavior on the guest and the other way around just think about a well misbehaving driver bugging buggy driver with locking issues disabling interrupts whatever and this is something we wanted to emulate with a driver which is called blocksys we basically wrote blocksys to break things that's that's pretty much what blocksys does so at the end of the day it's a very small driver it just blocks the system for a given number of cycles just by disabling preemption and local IQ processing that's pretty much it so it's it's just basically to to introduce artificial latencies to a system and see and monitor the behavior then right and this is pretty much what we what we also did in this setup so we've been using blocksys to to emulate a misbehaving driver and well so we introduced this blocksys scenario on the host while the measurements have been running on both systems right so just I mean it's quite obvious if I load that driver block the system I mean you see we've been blocking for roughly four milliseconds it just breaks the system completely right now you could do a guess a wild guess what happens to the gas system I mean we know how containers working I introduced the noise on the host I just load the driver but I mean I think it is obvious that this disturbance from the host system directly makes it straight to the container right because I mean they are sharing that the same operating system current right so if I block that I mean it has a clear influence on the host and the guest so at the end of the day the recap would be that the I mean you can have native latencies native host latencies in the guest but the level of separation has its limitations right for the container setup so that would be the brief summary well going to the next topic switching from containerization to hardware virtualization so one obvious technology we had to look into together with preempt RT was KVM so also I did these measurements on a x86 platform so basically core i7 six core machine I reserved three cores for the host three for the guest and there's a few things you need to know how to set things up so it's a bit like setting up KVM for real time it's not impossible but it's a bit tricky right there's a few hints I so we have limited time so what what I could recommend to you is this what I've been using this tundi profiles from reddit for real time virtual host and real time virtual guest I like I took these as an inspiration on the on the settings that are recommended so if you want to look into some further details and what you could tweak on KVM I would highly recommend to you look into the tundi profiles from reddit but most importantly remember about the what I told you about the proper partitioning of the system right so what I had to do is to get the cores for the guest system isolated so there's a few ways you could do that so I just took the simple way to do this on the command line basically isolating the CPUs enabling no hurtsful and like doing the irq routing in a way that they won't hit the guest operating system so by default routing the irqs to the host CPUs so this is pretty much the main setup trying to get as much noise as possible away from the guest operating system so this is the first step second step you need to configure QEMU so that you could tell where the vcpus are running so then you you want to make sure that the vcpus are really running on the three cores we have isolated for that purpose so you could easily do that with a word manager in the cpe tuning section so there's a pinning you could do for the vcpus you could also tell that for a specific set of vcpus like you want to change the scheduling policy so in this case I just put it to skid 502 priority one so be a bit careful with the rt priority here so I put it rt priority but on a very low level because the thing is that the vcpu will run more than just real-time load rate so if you just would put a randomly high real-time priority you might have a risk to starve out some housekeeping thread which is still running on one of the isolated cpu so recommendation would basically be it takes skid 502 put it on a real-time priority but that's basically how people do it right so these are the most important settings like I mentioned there's also on the host a few kvm tweaks you could do I mean kvm is really flexible so there's there's so many use cases for kvm so it needs a lot of time some time to figure out what you really need to do so once again if you want to see what's really important look at the dundee profiles from head head so they have the most important settings you need to you need to do so this was pretty much the summary of the setup we did measurements were pretty much the same on the host I guess we've been using cyclic test will be exactly same load scenarios and we looked how was the real-time behavior so looking at the host latency also this machine was behaving quite nice host latency of 26 microseconds worst case pretty pretty small distributions of the latencies so also here we could tell that just the fact that we run the guest operating system and we do run them load in the guest operating system does not influence the real-time behavior on the host at all which is already good news and this is already a pretty common use case because there's in the industry actually use cases where you just use the host system for real time and you want to just run a random other operating system just doing your hmi so you have your hmi and the real-time system on one controller right so for this use case we could already prove that this is really working just running the guest doesn't have any real bad influence on the host system well there's bad things you could do you cannot avoid but we'll talk about that later in the presentation so this is the host system so before I switch to the next slide don't be scared looks slightly worse which I can announce but give me some time we're going to have a closer look because it's it's not as bad as it might look at the first class right um okay hold on a second so this is um what I what I did in several measurements within the guest but let's go over that one by one because if you look at the you see that the three vcpus right here so most interestingly cpu0 did behave quite well so I was really able to get most of the noise back out of from from that vcpu so we've hit the worst case which was roughly in the area of 50 microseconds slower than the host but I mean I didn't had hit any latencies in many different measurements which was more than these like 50 microseconds but what's obvious is that um I wasn't able to to get rid of some noise on the other vcpus which was causing some latencies here so the cpu1 was like at roughly 200 microseconds and cpu2 most interestingly you see these three peaks so there's um like there's a pattern in here right so like these are in like a distance of 100 microseconds should should be easy to to trace down the root cause which I haven't done so far most probably still a a configuration issue it is um like from experience what I could tell um it is possible to to isolate the vcpus and have real-time response in the guest but you really need to look that you get the the noise from the vcpus and configuration is a bit tricky right so just to compare that um there's another measurement which I actually messed up another detail so um I did one test run where I forgot to enable no-hots full right and just look at the distribution of the latencies in the guest I mean it's like really going a bit wider on all cpu's so I mean that that easily tells that it's really important to get as most as noise as possible away from the vcpus right but it it's not impossible right so it is at the end of the day that the conclusion would be with the proper configuration and accepting a few limitations you can have real-time response in the guest but configuration is a bit tricky right other topic we need to look into isolation like what I exactly the same I did with the docker set up so just blocking the system on the host and the guest just imagine you have a misbehaving driver once again we emulate that behavior and see what could be the influence so first try was I mean also seems to be obvious here if I block the system on the host I mean it's it's quite clear that it makes it straight to the host and the guest right because at the end of the day I'm just completely blocking the the hypervisor right here right so this will just make it straight to the guest operating system the rest in question is looking at the misbehavior in the guest operating system um does that make it through to the host and um this is actually not the case so having this weird behavior in the guest doesn't influence the real-time latency on the host so even if the guest is really misbehaving doing really bad things host is still working right so at the end of the day isolation is way higher right it's clear if I block the hypervisor then it breaks things for all but if the guests are misbehaving I have a very high level of isolation with kvm right so talking about the level of isolation there was another technology we really had to look into when talking about this topic um this a project which is called shale house um it's a partitioning hypervisor it's gpl v2 only was originally written by Siemens still maintained by Siemens basically supports x86 and arm and just like kvm it makes use of the virtual solution features of modern CPUs right um and it does a real hard partitioning of the system so once again time is a bit limited in this presentation in the slide version I have uploaded there's a few additional slides for you just if you want to do some further reading on what shale house could do and what it can do and so on so I'm going to skip these in the presentation right now so the uploaded version of the slides is a bit more extended but I think I can explain a bit on how shale house works with the with the setup I've picked so I did the measurements on a imx8mp quad core cpu and then I started partitioning the system and what you basically do with shale house first of all you boot a linux system and then you inject the driver which is shale house ko and this basically kicks off the hypervisor and then the hypervisor can take over so the the system you initially booted is what we call the root cell the root cell is a completely working system I chose that I just dedicate two CPUs to the root cell and I equipped the root cell with a primed rt patched system out of the root cell you can now start to create the guest operating systems in shale house these are called inmates and on the second cpu I just created another inmate running linux with primed rt so I've had two independent linux systems running I mean from the user's perspective this really looks like you would have two different boards on your desk right so these systems are running independently you could even crash the root cell and the inmate would still keep running right so the the system is really physically partitioned and just to give some additional noise here I've started a bare metal application on the last cpu just to see if that could have also any bad influence on the primed rt systems running so once again we've had two primed rt systems running one on the host one on the guest measurements taken in parallel so this would basically be the the view from the root cell just asking the hypervisor which operating systems are running so also here we did all the measurements with the cyclic test let's look at the result so this is the latencies on the host system so we ended up with like 45 microseconds worst case latency which is pretty much what I would expect from this imx8 so pretty pretty good latencies here now drum roll how does it look in the guest operating system and this looked actually really really good actually slightly smaller but at the end of the day the worst case latency is pretty much in the same area than on the host so we really in all of the measurements we did we reached native host latencies within the guest operating system so with shale house we could say that I'm interested running the guest doesn't have any influence on the host and we could even in the guest have native host latencies so once again it's it's completely partitioned system right so these systems are running completely independently and that brings us back to the question well then how does it look like with the level of separation right so if I do the same game with the blocksys driver like just emulating a misbehaving driver so how does that look like just uh injecting the noise on the host you could actually see that even if the root cell is disturbed the guest doesn't see any influence on that right so now we could really see that the host is completely independent I actually I could even crash the root cell the guest would keep running right so we have a very very high level of separation here with this right so before we start with the next chapter let's briefly summarize looking into the technologies I mean what what we could basically say is that with all technologies we came to the conclusion that you could have real-time behavior in the guest for KVM it was a bit tricky to configure you might need to accept a few constraints but at the end of the day it's it's it's doable with the real-time latencies for shale house for docker we pretty much hit the host latencies for KVM well there might be worse latencies in the guest operating system right um looking at the separation for docker is what was very limited for obvious reasons right for KVM remember in one direction we could get a disturbance in the other node so we had a very high level of separation looking at the guest operating system and with shale house it was really excellent right so we had complete isolation of the system but does this bring us to a complete independence of the host system and well you see and know for all three technologies we've been evaluating here now the question why would be why because I just told you at least with a hypervisor like shale house you have a very high level of separation so why do I tell you right now that you cannot get complete independence from host the main reason is it really depends on your hardware what you could really do and this is why we also looked a bit into the impact on shared hardware resources here so because if you have I mean we just separated the CPUs right but there are still resources at least on the ARM architecture we've been using so this is a different one but I mean just to to visualize there are hardware resources that are shared like the level 2 cache right and just to try to figure out what impact that really could have I just picked another board out of our qa farm was a quad core a cortex a 53 500 kilobytes of shared level 2 cache and I just tried to stress it a bit and see what influences could have on the behavior so this is basically how the system looks without any any memory stressing quite okay so we have a worst case latency usually on the board I've picked of 100 microseconds now what I did was as always I've been running cyclic test on all the four CPUs just remember usually the latency was in the area of 100 microseconds the only thing now which I did in addition was I've been using stress in G doing some weird allocations on just one of the four CPUs not on all right you could see I've set the stressors running on CPU zero running one malloc stressor with in maximum 32 allocations running in parallel just doing memory allocations in a very high frequency and as you could see I've been really bad so I just picked the exactly the cache size for the size of the application and the impact you could actually see here is that I'm doing the load on CPU zero but it has definitely a very significant impact on all the other CPUs like the latency is going up the six times the number you usually would see right and this is like why I told this no complete separation or separation is really depending on your hardware because there's not actually not much you could do in that scenario and this could actually be any unprivileged process that could do the memory access that which once again brings me to the point you cannot just randomly deploy workload and expect real time to work right you need to have a knowledge of the overall system and also how your hardware works right there's different hardware that could that handle differently but in this case no there's no work around just to prove this theory that this is caused by the shared last level cache I picked a different CPU this was a iMix 8 quad max which has two CPU clusters and the cache is always just shared on one CPU cluster right so we have four cortex A53 and two cortex A73 and both have a like separated cache right so I did once again the same stressing with the memory right so but basically the stressing was running just on the cortex A53 right on two of the cores which basically it's obvious also right here the latencies are going up to a very high value so on that CPU I usually also see values way less than 100 microseconds right so there is a significant impact but as you could see on the other CPU cluster there's actually no effect right here because they have a separate level two cache right so this is a pretty nice example on how shared hardware resources could really have an impact on the behavior could be the latency could be the runtime whatever I mean the the the caches is really a big source for performance right so if you mess that up you're in trouble okay so that pretty much brings me to the summary of the presentation so first of all let's summary the thank you that brings me pretty much to the summary of the real-time capabilities so we've learned that a guest system may have real-time capabilities depends a bit on the technology you are using for containers we've learned they can have the same latencies as the host for kvm we've seen that the guest may have longer latencies and for shale house we've also seen that we have the same latencies on the host but in any case what we've just learned keep in mind that shared hardware resources can really have a significant impact on what's happening and on the level of separation well summary with respect to the separation well you obviously learned that the guest system well may or may not be separated from the host depends on the technology you're using right with containers they are well based on how they are working not well separated full hardware virtualization guests with kvm may provide separation on a higher degree so we've seen that having a misbehaving driver in the guest doesn't have any influence on the ghost so we have a very high level of separation and last but not least with shale house well we've had a very high level of separation right so we could even crash the guest or the host and the other system would still keep running we don't see any influence on the latencies but once again also shared hardware resources in that setup could also lead to a significant different behavior when it comes to runtime and latencies so that's pretty much it what i wanted to tell thank you so much for intending my presentation and i would now be happy to answer your questions hi thank you for the presentation so one of our project we see the similar issue that the shared cache has a significant impact on the latency so other than the clustering did you did you have any suggestion that we could also do in our project because we tried with the clustering and that works that's a good thing but we couldn't manage to handle the shared one without clustering yeah it depends a little bit on the architecture you're using you might want to check really there might be a possibility even to color the caches or lock something which is not possible on the a53 rate in theory at least the rmb8 spec has a feature which is called a cache lockdown where you could like just lock a specific area but to be honest i up to now i didn't find any soc was really implemented right so that that would be a solution so but really check on your soc if there might be some some cache configuration that you could probably use apart from that there's not much more you could probably do in really looking at the at the software design avoid random memory allocations i mean in in this case i've been really bad to the system right i'm just allocating the exactly the cache size just avoid this kind of behavior but there's not much more you could actually do than looking into the architecture so it's not that you could configure the system and you could deploy any random load but once again check what your hardware can really really do but apart from that this unfortunately in a recommendation i can give you that one i think in jail house they implement some cache coloring technique did you try your test with no so that the virtual machine is rates on the cache level no actually i did not try that at all because maybe it could solve this such issue it it can limit but once again it is highly dependent on what you could really do on the hardware level okay thank you but once again in this scenario i didn't didn't try that out questions maybe from remote one one here yeah one question regarding your kvm case and the blocksys did you run the blocksys on all three virtual machines or did you just run it in one virtual machine and did that influence the other two um i had actually i did one run why i had two virtual machines running not just one and like it i couldn't see any influence between the guests when blocksys was running in the guest but definitely i always saw the influence just making it straight to the guest when i was running on the host which once again makes makes sense top that answer the question okay thank you guys