 Hello, good morning. Good evening everybody. My name is Shoa Khan. I am we are starting our alisa mini cons at the open source summit today and we have a panel of speakers that will be talking about various aspects of alisa and the sharing the work that we have been doing these past several months. Well, let's start with what is alisa to begin with alisa is enabling Linux in safety critical applications and what we're really doing is in terms of enabling what does enabling mean really so we are looking at assessing whether the system is safe and we need we are understanding your our systems sufficiently we if your safety systems system safety depends on the next you need to understand Linux sufficiently for your systems context and use so that's really the keywords here using understanding your system context and how to use Linux safely on in on that system. So we are looking at a couple of different ways of safety critical process approach to Linux we the difference between. We are looking at the Linux development process and then applying it to how well we can use that development process and safety map it to the safety critical. Standards that's and see how we will we can understand come up with the guidelines to understanding your system and make sure that Linux can run safely on those systems. So what is our mission statement, we are defining and maintaining a common set of elements and processes and tools that can be incorporated into specific Linux based safety critical systems. I mean able to safety certification that's a lot of big words but essentially what it is is we are looking at so this is our development process. And then these are the set of safety standards we have, and we kind of map them to see over to to with the tools use of tools and processes that we have and say, hey, you can certify not certify but I mean you can say this is this is how you can map the development process to the safety elements safety certifications elements in terms of tools and such. So we have several working groups, Colonel development working group, looking at Cardinal processes and difference processes and looking at gaps identifying gaps, and where we can fill in gaps to be able to to meet the safety standards. We have several safety standards here that we kind of keep looking out and then map trying to map, and then safety architecture group is doing the same thing in a different way, safety relevant elements and platform safety analysis, and then mapping them so we kind of go through path to closing the gaps in the way to to say that this Linux can be safely used and define a path forward for identifying gaps and then outlining what is necessary in terms of processes and tools. These are the working groups that we have currently I Cardinal development working group safety architecture working group and tooling and development that's there we are doing a lot of work as well identifying tools to see looking at the tool aspects and then checking those those tools give will give us a way for to map the development process what's happening, identifying how to map these development processes in terms of maybe we have we identified a Cardinal memory a Cardinal subsystem as a safety critical for us and then looking at the development process what's happening there and how do we figure out the changes that are happening and then define a process for being able to translate that development process and then identifying the changes that are happening in terms of put the safety context to them. With that, I am going to hand it off to Christopher to talk about kick off the, the mini talk, mini conference, I mean his topic. Thank you. Thank you. Let me just share. So, yeah, we thought we'd organize the the the conference of also with some lightning talk so what's a lightning talk and lightning talk is kind of to throw a bolt of lightning into the discussion and then see if any thunder comes back so this is really reaching out to everyone on the call to engage. You can use the chat I'll show you in a second what what what we have thought about. So one of the big challenges in enabling Linux for safety critical applications and for pretty much for any any system is always managing complexity. So the complexity is is a significant issue and it's one that we really struggle with on a daily basis, because if you look at the position that an operating system takes in a in a complete system context. Then you will see that the operating system on the one hand provides an interface towards the hardware architecture. So it the people writing the operating system kernel developers usually have a sometimes better understanding of the underlying hardware architecture than the one or the other hardware architect was actually developed the hardware, because the the kernel programmers see the architecture as this active right all the cogs are spinning and interacting with one another. And that is a far more complex picture of a hardware architecture and if you just read the reference manual where everything looks, you know, either this happens or this or this or this usually if you have multi core systems on the modern CPU architecture 1000 things happen in parallel. So there's a level of complexity towards the hardware. There might be a hypervisor layer in between. And there is complexity and an interface towards the application. So what we have been asking ourselves, you know, will it be helpful to the objective of Eliza to somehow get control of the complexity what what can we do. And what one area that that we have been discussing quite intensely is, you know, what complexity levels are brought into the problem by the application itself. And this discussion emerges out of different areas right on the one hand. We continuously have parties joining the working groups and saying why what are you doing right Linux I know of a system where Linux is already used in the safety critical application so so the problem is solved. And then you say okay but how is it used and in what ways it used and what. specificities does the application have that can be exploited. So this is a very crude separation into two classes I intentionally the beginning I we had called them class one and two but that then makes it hard to explain what the classes are in between so we gave them a neutral labeling right a low end and a high end and you can imagine any shade of gray and any rainbow color in between. But really the two extreme is to say we have low safety complexity, then that clearly is a is helpful to bringing to enabling Linux and safety critical applications. So those systems typically are ones where transient faults are less critical or not critical the system has a low pass characteristic. Permanent faults are critical. You have to leave some some safety aspect in the otherwise it gets it gets trivial. The fault tolerance time is long compared to the execution speed of the software on on the on the system. Very very frequently you have a human in the loop which is great because the fault tolerance time interval which is the time between when a fault occurs and when something dangerous happens. It is long so the human can intervene that takes a lot of pressure because human is always ideal to place some plausibility checks on. If he's awake in a context of autonomous driving you can put end to end plausibility checks in place because in many cases you're not moving a lot of data around the system so you can put integrity wrappers around the data by using some kind of hash or CRC function. You not so frequently see mixed criticality integrated on one system where you say I'm really trying to get multiple systems operating in the context of, for example, one Linux instance, and they shouldn't interfere with one another. And lastly you're looking at at lower levels of safety integrity and then on the high end side it's it's transient faults become critical so you need to also have some way of engaging quite tightly with the hardware because the hardware in many cases will detect the transient faults or the software has to do it in both cases you need tight coupling. The permanent faults are equally critical but this time the fault tolerance time is short so you don't have much time usually it could be 10 milliseconds or 100 milliseconds that's by far not enough time to get the human into the loop. End to end plausibility checking in the extreme high end case is is not plausible anymore because you're using huge data volumes, and if you start putting end to end checks over all the data you're moving around the system, your performance just just gets crippled. In many cases you're looking at mixed criticality and you're starting to look at the highest level of safety integrity which is, which is as in the. So, so on the right hand I've tried to illustrate the class L which we have a use case where we're looking at at the moment in the context of Eliza which is IVI telltales. So you can imagine the TFT dashboard in your car and you need to be sure that an important piece of information, a warning indication or whatever is shown on the TFT to alert the driver of something. So you have driver in the loop right so so we're already seeing that we're qualifying for the class L. And typically those that that problem is solved by obtaining the bitmap from the TFT and then running for example a graphics algorithm over the bitmap and looking whether the telltale the exclamation mark or whatever your telltale symbol is is actually shown on the TFT and then taking some corrective action, maybe inserting the driver by by by some chimes or switching off the dashboard or doing other things that there are many solutions that are being considered in the industry. And in those cases clearly the requirements on the operating system itself are rather low that that makes it quite nice to enable Linux for those kind of applications because the bar is is is lower right then you have E gas type applications that's more E gas is is a safety architecture used for electronic throttle systems in the car. It's starting to get get more more sophisticated from from the requirements from the full tolerance time but even in E gas there is still the assumption that there is a human in the loop because at least in the case of a combustion engine and a normal car the acceleration that a car can produce is very limited and there is still the idea that the driver will realize that the car is suddenly taking off and can step on the break and breaks are always designed in a way that that they can they can even the strongest cars can be stopped by by the breaking system. Then you're looking at gateway systems gateway systems you're starting to see it's getting harder because you're not having you don't have a human in the loop anymore you're translating data from one subnetwork into the system to the other. So if you have the safety critical data over the gateway, then you want to ensure that the data doesn't get corrupted inadvertently or doesn't get lost and other or modified or get stuck and you continuously repeat sending stale data but you have an end to end with a valid end to end. So you just need to start including time stamps and other aspects. So the problem is getting more sophisticated and then on the right hand, what I see as probably the most sophisticated safety problem at the moment in the industry is really everything around autonomous driving. So what what and what now we also see is that there is the option of enabling Linux in a safety application, potentially even through qualifications what does qualification mean qualification means you have a highly high quality development process that produces software and the failure modes of that software reasonably stable. So you characterize the failure modes but then you put the mitigation of those failure modes around the software you don't put it into the software itself. And for some application classes and the IVI might be one where that is feasible, you could actually get away with this qualification path. But that enables Linux for a safety application of class L right on Tuesday already explained it's tricky you don't want to have the case where someone enables Linux for one class and someone else just transfers that into a different class where the assumptions aren't valid anymore. And the key question and I think this is where I'd be interested also in getting feedback from the audience if there's someone there from from from an OEM and tier one side who actually wants to use Linux what what other characteristics of the application classes that you have in mind where you really say this is the immediate direct use case that I would have where I want to use where I want to use Linux and I want to use the result of the Eliza project. I think we're always open and welcome to look at that. At the moment we're looking at IVI and we're contemplating on how to take arguments to a higher level of safety complexity. But that definitely is an interesting question and on the lower end of the sandwich I showed on the previous slide. It's also what kind of hardware are you envisioning right are you looking more at at at lower end single core product or you're looking at really highly sophisticated multi core and when when when someone from arm talks about highly sophisticated multi core products then then we're starting with 32 core 64 cores or beyond those those are massive platforms and they introduce problems of their own nature that have to be that have to be addressed. So if I need to now see how this works myself. How do I see the chat window with questions can someone help me do we have do we have. We do not have any questions right now. We don't have any questions OK so so the. The the ask is out there right to the audience if you're if you have specific features or whatever you want to see addressed. Then then getting contact with us because we're always interested in in in hearing use cases. And then putting some some consideration into the effort it takes to address them. If there aren't any questions at the moment. There's still the possibility to be probably going to have a short Q&A session towards the end for if questions emerge. Otherwise I'll hand over the baton to the next speaker. So Gabrielle I think you're next do you want to share your screen and yep. Thank you. Thank you Christopher. And second. Can you see my screen. Yes. Okay so first quick introduction. So I'm Gabrielle Boulogne and I actively work in Eliza and lead the safety architectural working group. And today I would like you to present a use case a problem that we've been discussing now for quite some time within the the architectural working group. So the focus of the working group is to, you know, to to analyze the technical safety requirement that can derive from domain specific working groups. However, as of today, so we are waiting for technical safety concepts to come in and therefore basically we've put together these, let's say, sample use case. And so, and that we've been that we've been analyzing so far. Okay, so here, what you're looking at is in this picture we have an exit six system. I mean the analysis I mean in the working group we are also looking at the arm architecture but here I mean the analysis I've done it like extensively for exit six so we have a machine check exception that is caused by memory read error that is causing a synchronous machine check exception. And this exception is handled by the machine check. And this can turn into two possible outcomes. One is, so this exception happened in the user space. And therefore, we have the signal handler that will terminate the user space application. Otherwise, if in kind of space, if in kind of space today, we go and and panic. Okay. And on top, what we have is, we have a very, very dummy safety function that is patting an external watchdog, every safety cycle. So, so effectively regularly, we have this watchdog path. And as we have an exception coming in, we need to stop this patting either by termination of the application or by killing the system. Okay. And basically, in this case, the watchdog will come out and drive the system into into safe state. And along with this analysis we made a strong assumption that is so we assume that the memory error itself does not affect the proper execution of the error handling code in Linux, because, you know, our analysis is focused on Linux itself. And we don't have control of of a broken hardware. Okay, so now, obviously, so regardless of the usability, let's say of these specific example. I think that during the analysis, there were some challenges that we found, and that today, I would like to share with you, and also to see what is your opinion in this regard. Okay, so if there are no question on the overall introduction and on this on the use case, I can go to the second slide. So, the first challenge is, you know, as I said, the focus of the working group is to deliver safety analysis, according to specific use cases. And so that as outcome of the safety analysis, we can refine architectural assumption, we can identify what are the safety relevant Linux components. And, and eventually also, you know, we can, we can improve the code or, you know, upstream changes to have maybe some safety mechanisms some some code changes and so on. Okay, so now the very first challenge was, okay, now in Linux, we know that there is no architectural description. And, in fact, every Linux developer what he does mainly is he just go and read the code to understand what the code does. Okay, and whereas in the safety world, usually the safety analysis are performed on UML diagrams or, in general, on architecture documents that would describe the system design and its behavior. Okay, so the first challenge is, okay, so there was, right now there is no architectural description in Linux. Okay. So, so what I had to do is, I started from the do machine check handler. Okay, I opened the code. And then I have tried to, to figure out to, I basically used a top down approach where starting from the main entry point that is do machine check, then I try to identify what were the pieces of code that could affect my safety goal. Okay, so effectively the practically speaking I want to identify what were the functions involved in either the system going to panic or into the termination of the, of the safety application. Okay. And so, and there, basically using this top down approach, I basically I drafted an extensive analysis that you can you can find the other link at the bottom. And in doing so, I clearly discarded the parts that I, in my view, were not meaningful, were not significant with respect to the safety goal itself. So for example, I discarded the part about the serial logging. Okay, because I assume that from a safety perspective, we are not relying on a human being reading the serial log to, you know, to shut down the power, for example. Okay. So, and so this is was, this was like was a kind of lengthy process. So how now the question is, how can we improve that. And in general, like probably a short answer that I found this as I did the safety analysis. I realized that there is a lot of information that is currently missing, even in the kernel doc headers of the of the function. Most of the functions are missing the kernel doc header. But even for those that have a kernel doc header, you know, it's not the description doesn't seems to be enough. Okay, to, to, to do like a sort of to do, to support a safety analysis. So, so in my view, like one outcome is that definitely we can use the, this safety analysis to improve the kernel doc header of the function so that maybe as next step, you know, another safety analysis would be quite easier with compared to what to what I had to do. Another challenge is, can we guarantee the safety up termination within a deterministic maximum amount of time. So we can see here that we have two different parts. Okay, so kernel mode exception, we can see that we call do machine check MC panic and then panic. In doing so, we do not rely on the schedule is a pretty self contained piece of code that is running in, in interrupt mode. So, let's say that from a, let's say a real time point of view. This seems to be at first glance, quite deterministic. Okay, obviously we need to run tests and so on but let's say that from an architectural analysis point of view. I didn't find too many risks associated user mode exception is different. Okay, so user mode exception. We see that here we rely on the on on the scheduler effectively, you know, to call the signal and then to to kill the application. So, in this regard, here, you know, obviously we have two options here so from a safety perspective you can either change the user mode behavior to behave the same as a kernel mode, or we need to extend the analysis also to the to the to the itself and the other components involved. So that that would be quite different level of complexity. Okay, and the third point is, as I did the safety analysis obviously we what we did we scooped out what are the most relevant pieces supporting these use cases. Okay, and now going back to the to the kernel doc documentation that today is missing. Obviously another advantage of announcing this kernel doc will be to be able, you know to define testing specification according to a counterpart architectural specification. Okay, so, so effectively within the kernel doc, we extensively explain how a function is supposed to behave, probably the test can be more reliable. Okay, reliable means that so we don't need to rely on the understanding of the code from the, from the test guide that wrote the test, we can rely on a sort of architectural description. And also, in doing this safety scoping of the code, we can also know where we need to focus the, the effort for, you know, static checkers, I'm thinking of the tool working group for instance, and also with respect to code coverage figures that would result as a part of, you know, the test campaign that would run. Okay, so, and, yeah, this is, I mean, this is, these are three ballots of, you know, that I wanted to present today, and that I would like to, to have some feedbacks about. And yeah, and you can find basically all of these, you know, this is a work in progress so that you are doing in the safety architecture working group, and these documented the link at the bottom. Okay. So, yeah, if you have questions, I don't know, I think I see I don't see question and answer I see the chat. We can maybe give give the audience a couple of minutes to pull together a question either in the chat or in the, in the Q&A window. And I found interesting that the use case actually seemed quite simple when, when, when, when we started, when we started with it and it turned out to uncover a whole bunch of complexity. On the one hand, understanding the architecture but another one as I was just looking at this right the do machine check panic, or is it possible for example to do some form of recovery to keep the machine alive. And it's the same what we're experiencing now with COVID-19 from a safety perspective it's easy to shut down. So the safety people all the medical people are saying the numbers are going up so shut down everything because from a from a human safety health perspective that that seems to be an intelligent thing to do. But it comes with a consequence right and it's the same consequence when you shut down a lot of other things become unavailable and new problems emerge and we were having, you know, again to come back to this question of the application class. If, if you're looking at a dashboard, then you might consider to be more adventurous on trying different recovery techniques, because the consequence of those recovery techniques aren't that drastic because you still have the driver in the loop. While if you're driving at 70 or 80 miles an hour on the interstate and the kernel is kind of going into a machine check. And now the question is, is it possible to recover from that machine check or not. If you do recover you have to recover in a in a reliably safe way with a high level of integrity otherwise you might be introducing an even bigger problem by trying to do some recovery that that actually still has some side effect. So, for anyone in the audience who's interested in these tradeoff kind of discussions rather than in right or wrong because it doesn't exist here in tradeoff discussions under constraints these problems are really quite intriguing. Yeah, exactly. And I mean, effectively here it's also it's quite strong like the you know the bound between I mean it's quite clear what is also the connection between you know safety and the real time deterministic behavior. So, and that's a challenge in Linux as of today and so. I think there was one comment from from from Stefan who said you know to me as of today most promising is having supervisor instances running those on the problem solving Linux system as well would be beneficial from cost perspective also decreased hardware complexity yes I agree. So there is supervisor instances. You're able to take those to a to a long way it's questionable what diagnostic coverage you able to achieve at the end with with supervisor instances there's always this weakness right if you make the supervisor window to to broad. Then, there are lots of cases where that are actually critical that are slipping through. And if you make the supervisor window to narrow, then you're getting a lot of false positives and you get an impact on on availability so so the supervisor approach is definitely an important one. But it's equally a trade off a one that drives a trade off. But it could be something by the way that I think is worth also putting on to the shelf of our Eliza discussions are we able for example to put a supervisor framework in place. So let's say that this is how the supervisor framework is managed by the operating system and then users can put their own supervisor functionality into that framework and the Eliza project shows how you can substantiate integrity for that framework so I think that that's a really good point and one we should we should keep on our record. And there was business, the second challenge I think that we identified in this use case right and this is the whole question of the hardware software interface and, and especially when we're looking at such kernel low level features. What does it even mean to provide evidence and proof that this function works as intended. I think one of our intense discussions is right. So, gab obviously identified that after do machine check it calls panic. But then we are facing the problem of, well, what does that even mean. After in panic. There are a lot of instructions that are executed that from a software perspective have no meaning if you don't know the hardware. And then you have a huge interface of complexity in front of you that just tries to explain how the system is halted. That's really one of the challenges that we don't have a good answer yet how to actually present that in a, in a reasonable way, and then an understandable way that others somehow are convinced that this is fully understood how it is executed and they are aware of the consequences. But from a Linux perspective is, I think it is understood what happens I mean just read through the code and. Yes, yes, I agree, right from a Linux from a, again we're talking about software right so from a source code perspective. It's fully understood and I think also from a, I think also the developers exactly know why they implemented in in the way how they implemented it. Because if we are trying to build up evidence and an argumentation that it actually works, we have to pull together strings from from various documents and various understandings that I, which is, which is a non trivial exercise right it's not that you just one page and I've just learned a new feature we can actually invite participants to talk and we do have a hand has gone up. So, I'll see if this works. I'll allow Dave page to join us in the discussion. I think it's not available because Dave is using an older version of zoom. Okay. I have looks like there is one question here. Yes, you can turn on video probably to be able to ask a question I do not know. You have to raise the hand and then we can, we can allow unless we now just learned that an older version of zoom. It doesn't work. So, so Dave if you can either maybe put your question into the chat or your comment, anyone else if you raise your hand, we will actually invite you and you can join us and and we can do this orally without having to turn it into a standard graphic typing exercise. There is one comment I think from Stefan. So, to me as of today I'm going to read that out to me as of today most promising is having supervised instances, running those on the problem solving Linux system as well could be beneficial from cost perspective, also decreased hardware complexity gap do you have a software architecture perspective do you have any comments on decreased hardware complexity. What's your thoughts on that. Yeah, I mean, decreased hardware complexity. It, I think it goes into the direction of what Lucas was saying in the obviously the most the more the more complex is the harder the more is difficult to you know is to make an extensive safety analysis that cover the whole system including the hardware okay so in general. So, yeah, okay so my supervisor instances, I mean I guess so what he means is to have like a sort of a simple supervisor application that run on top of Linux right. So, similar to what we are doing in the title use case for example. So, if that if that is so for me one of the challenges of these. So we've heard these ideas of supervisor. That's quite often. But if I look at this concrete example that that you presented gap right. You want to. You want to make sure that this system shuts down when you get a machine, a specific machine, machine check exception. What, how can a supervisor kind of simplify that but I mean, the machine check exception will be transferred to some some instance right. In our case we assume it's the Linux current you could say well it's, it's provided to a hypervisor. And then, then you want to shut down the system so you want to make sure it, it, it panics adequately. Okay, a hypervisor does that, but then the same question that we just want to answer, we get a machine check exception, we handle it correctly and we shut down the system. It's now just deferred completely from the kernel to the hypervisor but you have the same question. Do you just include complexity without solving anything one of those questions of why build in technology if the question that you have to resolve and the challenge at hand is remains the same, then you didn't gain anything from that. Yes. Yes, sorry. Yes, thanks. Actually, I definitely misunderstood the comment. So, in my specific use case, you know, having a hypervisor instance supervising the MC to be honest, from what I've analyzed. I don't think we'll make much difference because from from the analysis I did. The exception handler is quite self contained the instance of code. And, and I don't see. So, if the hardware is broken in a way that cannot support the MC and the Linux. I don't see, I mean, how we could take advantage. Yes, I agree with you. I think, right, if, if it's broken in Linux, it's probably broken in the hypervisor as well. Right. Right. But, but, but in other use cases, like, so if you look at the. Yeah, so if you look at the Delta use case, in that case is an advantage, but we are missing the goal of Eliza, right. So, so the thing is, obviously, if in the Delta use case, if I say, okay, I run. The telltale drawing application on Linux and then I have a different supervisor that is, you know, that is monitoring the telltale to be the right one, the expected one. Obviously you solved all the problem of freedom from interference and you do not allocate any safety claim on Linux basically so. In that case, practically speaking, I don't think you need Eliza for that, right, you can use it in Linux as it is today. So, here the challenge is we want to be able to have a user, an integrator to make a safety claim on Linux. Okay, and there are different level of complexities, as Chris was saying before, depending on the application there are, you know, different level of complexities that we need to face. But the main goal at the end of the day is we want to have a safety application directly running on Linux. Any other questions. Christopher, would you like to translate what you typed for us. We haven't quite managed to sort out the including participants in the discussion I was curious whether Stefan had had practical experience with using supervisor instances. In the context of of Linux and safety applications that was more or less a question the other way around. Back back to the audience. So, as Stefan says, it is about logical supervisors today, easy to easy in terms of safety to implement as external device with penalty of complexity, hard to deploy hard to debug if Linux could become a safety trusted execution environment. It would simplify things a lot. I agree to the to the assessment. Yes, it's, it's hard to deploy and hard to debug. And in particular the verification is is tricky. But yeah, right, absolutely. The verification does depend on the product. That is in question and that product is different from, you know, instance to instance right. So what isn't that a coupled thing that at some level. The product, people that are well system integrators and product engineers be responsible for. Yes, you have to figure out what is relevant for that system. What are the components that you need in the kernel to be able to support your particular hardware in question because you have device drivers differences. And a one size doesn't fit all in some ways right that is the complexity we are looking at in the in terms of Lisa also especially one one thing that about Stefan's comment. We have been thinking about this intensively as well right, but the problem is of course. So if you have a application running on the Linux kernel, and it's producing some kind of some kind of output. And then you want to just supervise if that output somehow makes sense. And then you determine depending on the input. And then you consider what, and you mistrust you so you did you, you do not trust the kernel, and you'd say okay, this could impact my application. And I just want to detect it. And you very quickly come to the point that you somehow every computation the application makes. Oh yes. So the kernel could impact that computation in some way modifies the stack pointer. It modifies the memory it's working with it modifies it modifies the program counter. So then you say okay to actually supervise this application with sufficient coverage, you are going to transfer each and every result and various internal states to your supervision entity. At that point, your supervision becomes more complex than the application you actually want to execute, and that puts the whole idea of a lightweight supervision in turns that into a into a monster that you can continue to work with right. So, I'm also wondering what kind of applications for which kind of applications does that reliably work. I think the IVI use cases, a nice special case where you might get away with it. But I think, once you go into more complex systems. It's going to be more and more difficult to make plausibility checks on a highly complex computation. So maybe Stephen can say that he has done that and he has succeeded or failed. So, Stephen says no not done yet. So, okay, so yes, please try. And if you succeed. Let us know. Join us. Join us. Or, or, or, yes, or join us building a kind of safety trusted execution environment, which we are building up step by step. We'll start with making sure panic shuts down the system and then go on. So this is a great discussion so I don't necessarily mind keeping it going we have about 40 minutes left in the session. Would you like to go next with the tools. Ilana is not here so I was thinking we can have if we can have a discussion around safety architecture. Okay, but then I'm going. Okay, is that okay. Yeah, that's fine. Is Ilana joining later or he's not going to join. I will, it looks like there is we have a good discussion happening from the safety architecture so I would like to have it going. And then also tie it to your tools and then we can talk about the current configurations a little bit later on. I have to provide a bit more introduction. So, okay, but should I stop sharing the screen. Any other questions quickly. From the audience looks like we kind of wrapped up. I'm not seeing any questions in the chat. Yeah, so we discussed. I think so we started from a system perspective. Christopher talked about how you can reduce the consideration for Linux by just knowing the system and understanding that the system is actually in some way. Some fault classes really don't, don't impact your system. And I think as well. Gab showed that only a specific part of the kernel is, is relevant. And then the next question is of course okay how do you show quality of the kernel for specific aspects. What we want to kind of discuss is how you could provide evidence and arguments and addressing certain bug classes. So, we are kind of starting within with the motivation to show the absence of a specific class in the kernel confidently. So that's the safety. That's somehow the derived safety goal that we want. And of course the question is if all bug classes are similarly important or not. And this really depends on the system property that we are considering. And just to give you an example. If we consider the system that gap kind of showed a null pointer exception would probably terminate this watchdog, and it wouldn't pet over there would terminate the safety application and then it wouldn't pet the watchdog anymore. A null pointer exception in the application or a null pointer exception in the kernel, most likely would just terminate the system and hence be safe. So there are certain classes that are not relevant, but there are other classes that are relevant. So, if we consider multi threaded system and we say well, there could be some concurrency bugs of course these concurrency bugs do impact the functional correctness of those, those applications so we might be interested in removing all concurrency books. But again it depends on the kind of system we're looking at. But what we intend to do is we look at the system, we identify the relevant back classes and the list of bug classes to consider are the common weakness enumeration. I guess sure we'll talk about that a bit more in detail. And once you know those you can think about which tools address that class. And one of the investigations that we started was running those tools to identify the bug classes and find out how to do that with with reasonable effort and collaboratively right so if we look at the kernel and the G lip see how can we do such efforts collaboratively and sustainably. I'll just say two things that I've observed. So one thing is, in fact, you can find a number of people in the kernel community that use static analysis tools. And they look at the results and they provide the patches for all two positives. So, by doing that years over years. Supposingly only false positives remain model though the mistakes that they, they overlooked. If someone new comes into this community or someone with a new tool comes into this run this on this source base. You'll find many false positives right because all the previous and parallel attempts have have found all the true positives and you don't know about their their findings already. So when and the second observation that I've made is that some people actually want to use static analysis tools. And they then just focus on an older released version, instead of incur engaging with the current community. So, again, you find out that there was a static analysis tool you find that there was a bug in a version that's two years old. And that often doesn't help you much because you actually found out, okay, this has this code has been rewritten in the, in the in the present version, or it has actually been already fixed. So these. So these kind of results kind of don't lead to a sustainable activity in the end. And I'll want to kind of raise a number of questions for the discussion. So, I think one of the questions is if, if quality standards that you consider expect the use of static or dynamic analysis tools for quality assurance. And then, do you expect this to be applied to open source components you're using. And do you already apply such tools internally on open source components like the kernel glipsy and others that base components that you might be using. So I guess with that I would go over to the chat. So one comment on what you said Lucas is that current kernel version applies equally to static analysis problems that we find to other bugs as well right. So if you were to find a bug in any, even a, even a bug that's not related to static checkers and code coverage type things. It's you still have to if you were to find a problem in the older revision, still have to reproduce it and submit a fix to the upstream process. That's how the bugs fixes and changes funnel through the kernel development process has to go upstream first and then go into stables just a comment to say that it's not specific to the experience. No, it's not, I don't, I don't think it's I don't think it's specific to to the static analysis findings but of course, I think for other kind of bugs. You have a, you'll come up with a program that makes this bug reproducible. Right. And you'll share that with the, you'll share that with the community and then you can fix it and and backport it. And with the static analysis findings is that we encounter very many false positives and these false positives for an older version can't just be transferred to a new newer version right it's we tried that I think a couple of times and it's, and it's quite tricky to do that. Whereas with the reproducer program it's somehow clear you have a program, you, you run it on no version you run it on new version if it's still precious on the new version it's still still a valid bug. With static analysis findings it's really tricky. So, we have a question for Stefan comment or question I'll start reading again some answer from my side. No, I don't expect analysis tools to be applied to OS software, but practice these those tools do not contribute that much from the standards more important is the development process. Okay, that's it. That's interesting. So, so if I, I'm just going to rephrase the statement and I guess, maybe, yeah, will continue on that discussion but so the expectation is, there is no need to use static analysis find tools to increase quality. That's what I understood. And then the second point was, well I want to have a solid development process, and that will provide me quality for me. So I actually agree with the first statement in some way. But for me, the, the question is, what is then the definition of a, of a development process that leads to quality, because others might then just answer, well, the use of sophisticated tools to find bugs, or find bug classes. But yes, I guess Stefan can answer. I don't know. Was it Stefan. Yes, that's the subject. You're right. What is what is what is a development process that induces quality. Right. And if you now refer back to the standards. We're kind of going in loops, right. Right. Because the standard is going to refer again to a, yeah, to analysis tools, for example. There wasn't something in the chat. No. Yeah, so, yeah, maybe there are other answers. People that have an expectation. Yeah, I guess the question goes back to, what is the definition of a good development process that could vary from person to person is that do you is that the good design, or good implementation or do code reviews or testing qualification, or integration testing because all of those coupled together make it do you have a regression test is so what does it mean. I think that is the question. Yes, and of course we're only looking at one aspect. And of course all these other aspects have to be addressed as well. But yes. Yeah, I'll read the question out as you've spoken static analysis tools reveal not so much process require checks like have requirements being tested on the function functional level has architecture not been violated. Yes, much about testing and integration testing. Okay. Okay, so testing. So the focus should be on testing. Sure with us. So we are to to. So what we are looking at with the CW analysis I post posted a couple of links, you can take a look at that. Common weakness and iterations. We are looking at multiple different angles right we are also looking at testing and fuzzing and how they coupled to what tools we have in the kernel right now to be able to look at these top we have these top 25. If you want to click on that Lucas we can do that too. I just posted a top 25 weaknesses by Mr. So we are looking at and say looking at those common weaknesses and say, Hey, do we have a means for us to make sure that common weaknesses addressed in the kernel development process and kernel itself in terms of do we have a mechanism to it's in the chat. So, so we do we have a detection mechanism for this. And if we do have a detection mechanism for that. Do we in addition to detection mechanism, do we have a mitigation. And we mitigation can be multiple different things like for example mitigation runtime mitigation, and then also during the qualification process. Do we use that use the can this detection and mitigation to be used to during debugging phase or product product qualification process and that we make sure the bits we are putting out don't have the ones that we identified and the other time other aspect is runtime runtime is what kind of kernel mitigation techniques do we have that can be enabled. So look at, can those be enabled also. So not all of the detection mechanism mitigation mechanisms are applicable to a certain product. So that's where the product side comes in. Can we safely enable this mitigation feature for our product. So once you determine that, then you can figure out you kind of develop this detection mitigation runtime mitigation, all of those together and figure out what do we have in the kernel. So for example, Lucas just talked about alluded to concurrency issues right so and then use use after free type issues. So we have a mechanism KM sand mechanism that we use fuzzing, we turn on kernel address sanitizers memory sanitizers debug options, and then we run fuzzing tools on those. We find various problems and we go fix them. Those are the use after free type things that that will show up here in a number eight CWC is it's talking about use after that's a good example of how we are going to be doing it bringing the kernel aspect into it kernel developer process how we can go and fix at the end of the day the way we can do this is by finding problems and fixing them and addressing all of these mechanisms and provide a guideline for Alisa from the Alisa point of view, provide guidelines for system integrators and say, hey, these are all the resources you have available for you. Yeah, I think, and that's, I mean, right. As you said, right so we can see Kazan is kind of one, one tool that you can use during during your integration or we can use it during the kernel development. You face a challenge so we've been using fuzzing also actively a lot. And again, you face a challenge of having the problem of many false positives or some that of course you can't fuzz everything to a sufficient high level of coverage so then you can rely on static analysis and dynamic analysis and various methods, kind of adding up to argue that we have taken a number of mitigations to address a certain one class. And I think, and then, of course, as you said runtime mitigations are possible as well. So we can see the same. Just recently, there, there's a new functionality proposed K fence, which is kind of an address sanitizer that is intended for production use so use in within the product. By being a bit more imprecise, but still allowing to to have not the performance impacts that let's say the previous versions of every sanitizers. I think, right, we're seeing many things kind of coming together to argue that there's some absence of it. But of course, yes, how do you build this together. And I think one of the questions really is them. Okay, how do you collect evidences together. And, and that's here what I'm just suggesting was one type of evidence. And that's for static the assessing these findings from static analysis tools together. And engaging in kind of a collaborative work, rather than in individual attempts. I might have the outlier position here because I think the techniques are really valid. What I kind of also see is there some prerequisites that have to be met. And, and again, you know, you always have all these shades of gray in these discussions, but, but on the conservative side, the techniques that we're looking at and and the code analysis and everything else works quite nicely for code that has more or less an entry point and an exit point and, and can have a very complex mission in between. Like, like scheduling algorithms, those kind of algorithms where I think it gets really hard. And we're also from arms microprocessor site we see challenges is where you start seeing a tremendous amount of functionality interacting with each other in ways that aren't predictable. Everything that's driven off interrupts where you get an interrupt and then you get another interrupt because you then need to make sure that you have your context under control. And the difficulty thing when when you operate with interrupts is it's very easy to lose the context. So so on the on the architecture side of a processor, a tremendous amount of effort is put into validating context integrity. You've lost the context of your processor. You've more or less lost the ability to predict deterministically what what happens it may work it may not work. Nobody knows and and those bugs that emerge out of the subtle context corruption are very very difficult to find. And I think they are probably outside of the scope of what can be achieved with these tests testing techniques. So I think the techniques really really valuable for a certain problem space. But I kind of hear words of acknowledgement coming from from from Lucas. I think you're not going to be able to take these techniques and argue for example the safety capability of an operating system like Linux. Just based on this because all these other interaction aspects. Multiple with interrupts you in essence have infinite entry points and infinite exit points because you don't know when it's when it's occurring right and then you start reaching limits. Yes, but I think so, from my experience of investigation of kernel bugs. I mean, these bugs that you're talking about there, they're certainly the nasty ones that are hard to reproduce hard to hard to even pinpoint and then and identify and so on. But the, I think the majority of bugs that happen are simply software bugs where certain contracts of the functions just haven't been fulfilled. Right, so it expects it expects that right you call certain functions functions in a certain order you use first get CPU then you use put CPU. You, you, you don't use after you free something and so on right so these kind of bug classes you can address. And they have similar impact on the on the functional correctness. So, of course, it doesn't. These tools by no means replace testing. And especially when you talk about. I'd go beyond testing I think for certain types of bugs you actually need architecture measures I think they get almost impossible to argue that you to code analysis and however much testing you do you will find them right. And the question is, can you argue that those cases are residual and you don't care about them and that's where I would say no I've seen. And again, it's due to the nature of my job being in the architecture group of arm. I see a lot of those those problems they're very very subtle there as you rightly said they're almost impossible to reproduce, but at the same time, we know that they exist. Yeah. So we have one comment from stuff and again. He's responding to he's responding to my CW and detection mitigation. Yes, there is now you lost me list is about security not safety. Yes, there are overlaps. Yes, so that's, that's where we as agree to one mitigation measure. So what we are looking at is yes, you're right. The disease are tend to be safe security focused. However, what we are looking at is taking these into a looking at mitigations detection and mitigation that are relevant to safety. I would say in terms of, we have to, because we are going to find it an overlap. The example we took is use after free, which is kind of easy to talk to in a way. It is there is a clear overlap between say safety and security in this respect. Yes, I mean, in some way right security is always security is always summarized as confidentiality integrity and availability. Right. And if you want to say okay where's the overlap with safety. We could say well for some systems availability is irrelevant for some safety systems availability is irrelevant. As gab showed right. If the system doesn't react anymore. You assume an external watchdog that that that identifies that and leads to a safe state, which is a general assumption that you can often make. And of course confidentiality as well that's not relevant in the safety context, at least as long as you look at safety in an isolated way. So the, so the overlap is really just the integrity aspect. Just the integrity aspect is is quite a big class right so there are various bugs that would lead to violating the integrity of your, of your application, and all those bug classes you can kind of combine together with this with security groups and you can apply the same methods as security groups might do the I think the big difference is really just that security group might say I have the mitigation and, and it works. It's fine for me, whereas in the safety area we always are interested in how do I document and how do I provide the evidence that that I did this in a proper way. Which is, yeah, which is kind of a specialty of safety. That's a good point about integrity right that's kind of what we are aiming for in a in by combining all different tools. Example, taking, I'm giving this example as a combining tools different tool sets to achieve the goal of integrity and quality of the K cow use K cow results, take them into funneling into initiating testing through furthering tools. So combining that information and figuring out the power of these two tools to achieve a goal. So that's one example so we're really what we're really trying to do is we have, we have a multitude of different things architecture. How we qualify the kernel with detection mitigation and techniques. And are we going to really achieve 100%. I don't think so nobody will say that. But do we get to a place where we can comfortable with the safety, at least, that's, that's my Yeah, I think that's what I thought is on that. Yeah, I think, right. So, when, when, when I say okay the absence of a certain by class right I mean that's that's not absolute absence it's having confidence that all the methods that are state of the art has been applied. And that you can you understand and you can explain to someone else how these methods are effective in your system. Right. And how they how they in some way how they complete each other. And, and, and with that, you have a good argument. And that's what we're looking at static analysis being one the other one looking at testing results. The other about architectural mitigations, but kind of combining that going to give you the puzzle pieces. So there was one more comment from from Stefan with you know, to what extent have we have we considered, for example, something like soft lockstep and soft lockstep has been around for a long time. Yes. I was always joked. I mean I, I was involved in in the first development of the first lockstep quasi freeze go in, which was started in 2007, and then got certified in 2010. And the big value of the hardware lockstep is really the ability to a certain diagnostic coverage for random hardware folds, independent of of software that creates a big value it comes at a cost. But that in my opinion this is the decoupling is is is the main benefit. If what once you say you start doing soft lockstep. And in some basically two problems, first of all, you need to create an architecture that facilitates replica determinism. So your two software replicas don't diverge too far that you continuously keep them coupled to one another, even in the event of random folks occurring. What typically involves some time coupling and the whole world of time triggered systems kind of feeds on that principle and says we enable the replica determinism and soft lockstep and other nice features that come along with it. The downside is that once you use software based lockstep systems to argue diagnostic coverage for random hardware folds, it's just very, very tedious, because the semiconductor manufacturer produces the hardware and understands how all the gates and everything are connected and done. But then you have to integrate the operating system and the processes on top until you have an entity that's big enough that you can actually start demonstrating that you have achieved diagnostic coverage and usually in a typical automotive supply chain. The integrator at the far end, when you have enough integration level achieve that you could start seeing software locks that execute has no interest in ascertaining the diagnostic coverage of the underlying hardware. So it's almost this challenge of first being able to constitute that that your coverage is given once you've achieved a fairly high amount of integration that might be different in other areas where you have a smaller integration pipeline where you, for example, in some of the industrial systems, I think it's possible, I think in medical systems, it's possible, but in areas where you have a longer supply chain and a longer supply path, it gets tricky. But it's a very good technique and it has a whole bunch of benefits that it adds. Yeah, I agree, but I think the challenge is also right. This, just this communication of, of false models is a challenge, right. I mean, you want to explain, you have to explain to the, to the integrator. And in some way to the application developer that creates this software lockstep or soft lockstep. What kind of falls you are considering in various components, and they have to determine what would be the impact how could I potentially observe such falls in in the application that's running in a lockstep mode. And I think various initial attempts of that have, have turned out to be to result to very strange plausibility checks. And, and, and maybe that was just due to the, let's say, limited effort, but it certainly showed the challenge that is involved in such systems to communicate fault models in this structured and complete way that you can actually build up an argument. And yes, I think we're facing, we're facing that challenge quite often with these highly complex systems nowadays. Yes, I think that's a good, that we are, we have a, we're almost out of time, three minutes more left, but that's a good closing, I think in some way saying that we are dealing with complex systems and to Chris's point that Christmas point that it has to span multiple layers here, and we're thinking about the architecture and then the kernel components you pick for that architecture and even the user space that resides on that to be able to bring all of this together. So, so this has been a great discussion. Thank you, Stefan for your active participation and bringing your insights. It's been very valuable. I posted a Lisa tech, our website there link to that, and then please join us. We are continuing our efforts and we are engaging the kernel company community in various. We have been engaging kernel community directly with the static analysis and then bringing some of the problems we are finding detection mitigation type things and we are engaging. And then also we are reaching out in these forums at various conferences. So please be on the lookout for our next mini conference or our next engagement. Thank you. All the panelists here. Thank you, Lucas, gab and Christopher. Thank you. Yeah, thank you. And hopefully see more of the audience in our discussions. Yes. Thank you engage with us. I'll learn more about it at the Lisa tech site link I posted on the chat channel. Thank you. Thanks. Bye.