 My name is Ilana. This is a joint talk together with Shua. Our talk is going to be about real-time Linux and safety critical systems. And we decided to call it the potential and the challenges because I think even the title has some terms which people may look at at first glance safety critical together with real-time Linux. Maybe it should be proprietary RTOS. So we are looking at from the angle of the potential of the power of Linux and how we deal with the challenges which they present. Shua from the Linux Foundation, she'll take over. She'll speak in the second half of this talk about her work in this area. Okay, so I'll go a little quickly because it's a little short. As we are all aware, there are challenges which Linux presents for safety critical systems. And we don't have bulletproof solutions, but we would like to present some of the work which we have been doing and how we are facing those challenges and how we expect to be able to deal with them. Part of this is grew out of work that's in the ELISA project which is trying to enable Linux in safety critical applications, a project also of the Linux Foundation. So to get started, what are we trying to achieve? So the first and most important thing is that we are trying to investigate the very, very rich set of Linux features which are provided by real-time Linux with the obvious understanding that nobody is going to apply every single one of those features in any single system. But because of the rich set of features which is available in any particular system, we can do a careful analysis the same way if we put aside, for example, safety critical needs. When you want to design a real-time system, you have to very, very carefully define your real-time requirements. You have to really understand very well your use case, your architecture. You understand the requirements for latency, for determinism, et cetera, et cetera. And you look into the set of features, what's supported and you decide how you architect your system, how you design it, how you implement it, how you test it to ensure that your real-time requirements are going to be met. And similarly, we do the same thing for the safety critical features. You have to do a very careful analysis. What are your safety requirements? Not every bit of data is safety critical. Not every process or task or thread is going to be safety critical. There are different levels of safety criticality and you have to decide when you design your architecture, what exactly is your focus, what are your requirements. And then you do the same sort of process you have to investigate. You look into the set of features which are available and you design your system so that your requirements should be met. And then at the end of the day, you have to do a lot, a lot of testing as anybody knows in real-time systems. You have to do a lot of testing to ensure as much as possible that those requirements are met. Okay, so that's basically what we are trying to do now. We are investigating the various features and trying to come up and this is ongoing work with guidelines for when one option may be more reasonable than others and how we can put together the different features in ways which can support specific requirements. Okay, so that's the goal of this work. What is not going to be in this work, I don't think when people develop real-time systems, they expect there's going to be some magic wand that you wave over the man pages. I don't know, whatever. And then you have automatically a real-time system that does anything real. But I think sometimes for safety critical systems, there's some kind of idea that my magic things will be safe. But it's really basically the same process, good engineering, good system design and a lot of testing to make sure that your system meets those requirements. So we just make it clear because this is something that we've been repeatedly asked in Eliza if we're going to give an off-the-shelf kernel release which is going to be safe. That's something like that doesn't exist and I don't think anybody who's realistic about any engineering work should have such expectations. Basically, you have features, you have to use them in a sensible, reasonable and effective way and demonstrate by testing that your requirements are met and that you have a system which is safe and meets your real-time requirements as well. So what is our methodology? Basically, you have to know very well, understand your use case, you have to understand your expected workload because those details will define how you define your architecture, your first year requirements, then your architecture, implement and test it. And what I found absolutely amazing, I mean it's only pretty recently that I started going to this area in my work in designing systems, there's so much that Linux offers that I put aside the safety attitude for in the beginning that you have to keep things down to a minimum and I said let's look at what's available, that whole huge set of features and then we zoom in and focus on what's absolutely relevant and what can really support what we need in our particular use case. And then you have to validate and this is very important to confirm that your design actually meets those requirements both real-time and safety and then test, test and I could have written test, test forever because I mean that should be pretty clear that testing is going to be very important to prove that not only the real-time requirements are met but also that the safety critical requirements are met. Okay, and this is something which works, I'm not looking at this from an angle of safety qualification or anything like that but from experience when the system is well designed in this way comes with clear requirements, the features which are chosen to be integrated and are configured and defined in an appropriate way that you can demonstrate that they satisfy the relevant safety and real-time requirements, you're halfway there already for also acceptance for safety qualification. Okay, so and that's really the point here that good engineering brings you very far and that could be if you're using Linux or an RTOS and I would say just the opposite is also true with engineering, you won't have a safe system, you won't have a system which meets real-time requirements so it all revolves around that. Okay, now I don't have much time, I mean there's a lot of basic material here and the work here that I've been doing is trying to investigate these features and to read them now in a new way, thinking of them in terms of not only how do these features help us to meet real-time requirements but how they can help us to meet safety requirements. So I mean this is I think I mean this is sort of basic if anybody works with a real-time system that you could either a different scheduling policies or just very briefly explain for anybody who doesn't come with that background and basically what you do in a real-time system you can define priorities to assign to your different tasks and the way you define those priorities are real times tasks which has very strict real-time requirements should have the highest priority possible. So again how you assign those priorities that goes back to the architecture you have to understand what are all the tasks you are working with, you have to allocate the priorities accordingly and similarly those priorities also have implications on the safety requirements as well. Okay, so again we're just taking the same basic real-time guidelines and now looking them in the light of how they impact safety requirements as well. Okay, memory allocations as a general rule dynamic allocations are considered less safe so we try as much as possible what we can to allocate you know compile time or if we necessarily can allocate a static pool from which we do dynamic allocations but only within that protected and safe static pool in any case that also has implications for real-time design because any dynamic allocation has a tendency and again you know it's you can just give those are rules of thumb which again have to be tested and verified in specific use cases but in the general rule dynamic allocations tend to impact okay your real-time determinism and latency etc and therefore we try to avoid it as much as possible so that's a common goal and we have to understand how in a particular architecture system how we actually apply that okay and there are other features you can try to you can make sure to see if you can as much as possible to resolve symbols already a system startup and you can also lock the pages in memory after allocation again and that's relevant and possible it's a good thing to do because that will help also for memory safety and for meeting the real-time requirements okay well isn't it going down okay resource management we have CPU sets we can define use that to define the different features which Linux supports to define separation architecture in the world of safety there's this I think this is focused on automotive safety where I come from there's a concept called freedom from interference which basically means you need to put your architecture has to be partitioned and in a way and designed in such a way that one software component provably cannot impact the execution of another software component okay so that there are plenty of basic tools in features of Linux which support freedom from interference and again we have to look at all these type of features from that with an I odd on meeting real-time requirements as well we can define CPU sets so that we isolate CPUs if we want to have a CPU let's say which is dedicated to a particularly sensitive real-time workload etc we can also set affinity masks all these features again are offered we can use them once we understand what our real-time requirements are what our safety requirements are we can use them in an appropriate way and to to meet those requirements okay I don't have much time so I'm just going to go quickly go over these real-time workloads there are features which help to ensure jitter-free CPUs jitter can also be a problem for safety and again the same features can be used and are used and I know we're using them to to support safety requirements as well on systems okay clock interrupt interrupts timing issues etc all these different features Linux helps us with the TSC is something which is specific for Intel and it's not always relevant it's not always it's not really portable but again it's a feature which exists which has its use cases and there are more generic features for timing again to help support power management system tuning different safety nets which exist which can be removed if we want to some of them how we handle RT throttling and in Shula's work she's going to give some examples more specific examples of some newer use cases here for example and I think I'll finish it out in two minutes because I have to hand over to shoe already but for example the first recommendation normally in real time systems as a general rule again to every rule there is an exception but we would prefer to disable soft lockup and other lockup detectors however in safety critical systems these are features which we may actually need them to support the system safety so we have to balance the different requirements and decide what is the most relevant thing to do another thing which I found hardware errors which are self-corrected by hardware normally should be ignored and especially in the automotive domain we have I'm not going to get to it but there's a we have a slide on handling SMIs what I have found that we have even sometimes the hardware itself has features which support how we handle and how we how we let's okay just to investigate the different type of SMIs they're well defined and to be able to how we handle them and again if we work together with the hardware and the Linux features we can find good solutions for problems which may be very very difficult to resolve otherwise okay so this is what I told you before about this okay this is what I said about combined with hardware future they've be found in automotive grade components to help us to actually define a correct architecture to manage properly SMIs which may be otherwise a very very difficult problem okay so prior innovation debug tricks and some kernel configurations in short we have a lot of features and all these features can be used judiciously they should be seen as assets which can help us to support our safety and real-time requirements designing a little system testing it well is the best assurance that we can give for a good system okay so Yelana talked about how what we are scope and what we're trying to do here and I have another goal we have is be able to put together guidelines for system integrators to how they can understand the system understand the platform understand the workload with on top of running on top of pre-emptility kernel so here is the repository where we have a kernel repository where we have the development branch of Linux pre-emptility kernel so what my goal has been to get the preemptility kernel development branch latest one and then set it up to run preemptility enable preemptility option so that you get pre you're actually running the preemptility kernel and then there's some guidelines on disabling all the group scared because for the workload I'm using I have to disable that to be able to set tweak priorities on the threads so you will see that I'm playing with these two preemptility kernel development branch and then RT tests repository I'm using RT tests as a workload and to run to evaluate and to set up guy develop guidelines for system integrators so that they can do the same thing take the preemptility kernel and put their workload on top and then follow the same kind of things that I have done here to evaluate how their workload is behaving on the platform they understand hardware platform I mean hardware and firmware and then you have preemptility kernel in the middle and then they can put their workload that they should know more about and then if they want to understand how that workload is behaving on this RT preemptility kernel so I have we have I have chosen two tests cyclic test and deadline test and cyclic deadline test those are the ones and you will see that I will be running these with and without a lot of load in some cases I have run these with six kernel compiles happening that means the disk and IO and memory is under pressure and then how does real-time work in the in those cases so numbers the numbers I focus on is a missed you missed by number right there you see so that's is important to me because I want to see how deadline misses is my workload missing deadlines and then this is a quick usage on deadline test and then here you see on this slide I run it on a preemptility kernel and then I take the same workload and run it on a vanilla kernel without preemptility enable so you will see what I'm looking for is missed deadlines so are is my workload missing deadlines so that's an indication for me that I have to go look at why I'm why the workload is missing deadlines so let's go one further and you'll see more the previous one is the run on a preemptility kernel and you will see the run on vanilla kernel so vanilla kernel I mean is that I don't have preemptility enabled at all so this is the main line by the way from Linus is may top of the line main I am comparing main line with equivalent RT because RT development tree they keep merging up so you will see that so cyclic deadline you want to run a one-hour stress test you can't just run it once like 10 minutes and figure out what's going on so you want to run it at least one hour or maybe longer depending on what your workload behavior is and then so you will be comparing those two so the goal here is you're trying to understand your system with your hardware firmware and then preemptility kernel running and top of that you have your workload you're looking for a couple of things here your measuring latency and then you're also looking for looking for any operating system noise and I have tools here that can tell you that and then also you're looking for hardware noise if you enable hardware noise tracer in the kernel configuration option there are two OS nice tracer and hardware noise tracer you can enable that when you build those kernel configuration and build the kernel and then you can and you can look at the how your system is behaving what's the mac measure latency you can measure that and then you can look at the operating system noise and then also hardware noise there may be there may not be much you can do about either one of those but understanding that would be helpful exact for example hardware noise can you reduce that in any way by configuring your firmware and hardware in certain way it would be it is useful to understand what you're looking at and then understanding your workload that's very important because workload is how the platform real-time behavior really is how the platform system hardware OS and workload interact interacting with each other is what makes the real-time to to get your real-time goals it's very important to do that so let's look at what we are doing to understand the workload first of all you probably have some idea on your workload that whether is it allocating memory and locking memory for example you don't want to any real-time a workload you don't want to have allocation stalls meaning waiting for memory and missing your deadline type of thing so you want to be able to allocate and lock the memory and then you probably noticed when I ran cyclic test earlier I had M lock option on so I'm asking the cyclic test to lock the pages and is it think about your workload is it a hybrid workload with RT and non-alty threats and if you have those are the priorities work together for your goal of deadlines and your deadlines and RT behavior you're looking for and the other thing to look at is do you have any dependencies that prevent you between all the RT threats and non-alty threats that prevent you to achieve your deadlines and goals and then of course you want to run a show test continuous also operation test on your workload to understand how your workload is behaving on a longer period of time and and okay so so now you're understanding your system you're understanding your workload and I have I have given you the number of tools that can be used to do both now what what do what do you want to gather you might ask well okay you asked me to run all these tests but what do you want me to do with this so you what you do is you record your system stats you want to do that before you workload starts because get a baseline and while workload is running maybe pick a midpoint or get more readings than one if you want and then after the workload stops and you're looking for these stats you're looking for is your workload fragmenting memory and what are the stats VM stats you're looking for those stats OS pro you can look at those files OS provides those for you and general system health you're looking for am I seeing lots of interrupts am I seeing NMI's and anything that I don't want to see and you want to look at the scheduler stats and then you want to look at pressure CPU IO and memory and so on what does the memory stats tell you so you're looking for fragmentation of course and then the other thing you're looking for is am I seeing any allocation stalls that would indicate that your RT workload or maybe the non RT threads that are that could be interfering and then you might be seeing you don't want to see allocation stalls that's that's what it is you don't want to see too many of them and then you are looking also at compaction failures and how many times compact them and Damon is waking up and then when you are are you seeing when if you compact Damon isolates pages for migration then are you seeing failures in that so you're looking for all of these things and then you're also looking at the page migration success and failure numbers and then you're looking at NRM lock that tells you how many pages are locked so what I do what am I with all these stats what are you paying attention to you're paying attention to is memory fragmented after work load run or am I seeing allocation stalls or compaction stalls these are all that will tell you if your workload is behaving to be an RT workload because you want to you want to look at that and that need that that needs to be fixed as well if you if your workload is not an ideal or a good RT workload so are you seeing a larger number of pray migration failures you want to compare page migration success and failed stats to come compact isolated this number of pages K compact the isolated for regression you don't want migration you mean I mean you don't want to see too many failures page migration should be less than about I mean you're gonna see failures I mean that's a given for various reasons but you want those failures to be a smaller percentage off your pages that are slated for migration so let's see we went through yes if you are seeing more migrate to success and and you want to see that and if you're seeing that that's an indication that you want to look at look at closely this is a good indicator that how many times a comp compacting way wakes up it's a counter that keeps bumped up whenever Damon wakes up so that tells you actually that your compact Damon is healthy that's what you want to see that it is healthy and running and it's not blocked by anything workload should lock memory because it should allocate memory in the initially and unlock it so that it's not taking the memory allocation times it's not spending time allocating memory while the workload is during runtime and depending on your workload size you can estimate and see your MRM lock should be a high number depending on your workload that that's again depends on the workload and you have to compare your workload you have to know your workload and how the your workload is interacting with your platform so yes this is an um this is the test I did with the with six kernel compiles happening in parallel and running cyclic deadline for an hour so you are you can see what I'm talking about the page migrate failures is a smaller percentage of what has been isolated and success are quite high so it's kind of what you want to see so again after memory you looked at the memory stats also look at scheduler stats do they look okay do you and then you want to see your how your workload while you're learning the workload are you seeing mce's interrupts and etc and Ilana mentioned that in some cases ignoring your correctable things that from firmware and hardware corrects is a good idea so that you're while you're running the workload you're not trying to handle something that's been already handled or that your hardware could handle so look at your pressure stats CPU IO memory and if it's a CPU bound workload is your workload do you understand your workload in terms of are they tracking the numbers you're seeing do they track your workload or align with your workload and research usage as well and you can tweak a lot of these things with the securities coverage that file over there you want to you want to you want to be able to only handle the things you absolutely have to thanks me so pay attention to deadline mce's using deadline test we talked about and then tune hardware and firmware as suggested by your vendor turning off for management am I is am I am I is it's you know you want to you want to make sure you're doing the right taking the right action for your workload and adjusting p-states the same thing you want to do the right thing Linux is scheduling that here you have to see that you are t throttling Linux scheduling assumes that the premise is that a well behaving all to work load doesn't exceed 95% of the CPU time so that means there is a five percent strictly reserved for non-RT enforced all the time so for example you have a runaway RT thread what are you going to do you this admin has to have enough bandwidth to be able to go in and maybe take action so that's where the five percent comes in and then you do not need to do tune any RT throttling don't don't mess with that that let it be it's a soft as you understand Linux RT is a soft real time so and then look at this deadline scheduling class that if your workload is deadline based that's what you probably want to use that because it ensures deadlines are met there is work happening in progress work now the deadline scheduling server to there is a delay in enabling deadline server so we're controlling that so that you have the idea is that you have this five percent of the time that we have reserved so can we take advantage of that pipe or person the purpose of the time avoid idling if you are if you're going idling can we take the five percent of the time so there are multiple there is two different ideas that are being talked about on how when deadline server could come in and schedule tasks so I'm following that work and I'm also using both patch sets for my testing that patch test that's in discussion okay so I think I'm kind of on track here I'm insured five minutes warning for me so okay so it yeah I should should I go forward okay future I think Ilana and I will come up with something these are my differences and that's pretty much it discussion questions I have the microphone any questions I'll pass the microphone to you so in the first part of the presentation you mentioned that disabling some lockup detectors is not recommended so maybe repeat the motivation why it is bad lockup detectors safety critical applications sometimes there are strict requirements that we need that kind of support that what the features present but they have a note they have an overhead execution so you have to balance that and that's where you have to balance your requirements for real time versus safety okay yeah overhead basically yeah and the second point was regarding this hardware corrected issues for example I have ECC memory and it is hardware corrected so currently we use it to like as an early warning that like we don't panic but we log this and then analyze the logs because usually it starts with like correctable errors and then with the time it becomes uncorrectable so what is the motivation to disable this as well no I didn't say to disable what I'm saying is that the the application layer doesn't have to handle the correctable errors which are handled by the hardware and we are finding that for an automotive the automotive grade hardware more and more of the vendors are supporting correctable hardware features which are very helpful to so that we avoid the overhead and the performance impact and the deployment implementation of any safety features in the application layer okay software itself and I mentioned the same point with SMIs is something only recently which we are finding that we can be of the hardware get more clues and information different hardware registers there's information being tracked for us that we can much better get a handle in an understanding of where those interrupts are coming from why they are you know why they are generated how they should be handled properly and then to understand what's left for us to do on the application layer okay and a lot of this work together with the hardware is something we should be aware of so that let's say each layer is doing what it needs to do best and we don't have to do work in the application layer which in the software layer okay which could be drivers anything you know anything above the hardware that that is much better handled by the hardware okay so that was the general point and there are a lot of examples you know which you can see especially you should have that awareness is again I know from automotive grade hardware but I think it's probably true in general for hardware nowadays that to take advantage of hardware features which can support the design and implementation okay we focus on what we really need to do okay there is a question from a virtual attendee what is hardware operating system noise this would go back to NMIs and correctable errors and then also some of the stuff some of the hardware related noise that could make the workload miss deadlines any up hardware related activity that could miss have the miss so because when I when you have NMIs and such such happening you go into the format usually goes into well OS has to go and get information on the status of the hardware so that's the kind of thing you're looking for measuring hardware noise I have the microphone here we still have a couple minutes so really what we are trying to do is put together a guidebook for system integrators so that they understand pre-emptality kernel and what are the things that they can look at to figure out the behavior of their workload on a system from Alisa perspective we use generic workloads because we don't have we we are not we are not looking at a specific hardware or specific firmware or any specific workload we have a couple of workloads but we might work on but currently we are looking at for this purpose we're looking at a generic workload which is RT so Ilana do you have one one last comment it's really an invitation to the community I work as you know as a system architect designing software designing these features and it's a new journey for me in our work and but to invite people to to think of the potential and the challenges in particular in safety critical systems not to run away from it okay and I'm not talking about that dedicated micro controller which has very very very strict safety requirements which almost always it's not relevant to use Linux there and it's not needed there's only a very focused features but when you have systems we really need a lot of power a lot of features to be supported the strict both real-time and safety requirements to understand how Linux rather than hinder us it gives us that power and to see how we do it in the right way okay so thank you