 All right welcome back everyone to automotive linux summit Europe, which is part of embedded open source summit Europe A warm welcome to Matthias rosny from capid and he'll talk today about bridging safety gaps in graphics take it away Hello, everyone welcome to my talk But before we start a few words about myself and the company. I'm currently working at I am a solution architect And I have worked for the major part of my career on graphics related projects and currently I'm working at KPIT Which is an automotive supplier providing solutions for infotainment a dust electric drivetrains and other things Okay today I will give you an introduction into a seal and the graphics pipeline I will talk about typical safety gaps in the graphics pipeline And I will show you how to bridge them for safety critical use cases such as tail tails and three of you camera What is a seal it is the abbreviation for automotive safety integrity level and it is a system to categorize risks by severity Exposure and controllability There are four different AC levels from a to D Where a is for the lowest and D for the highest risk and there is also one off-scale level Which is called QM which is for the risks which are so low that the normal quality management measures are sufficient So nothing special is needed for each of the AC levels from a to D ISO 26 26 to gives technically recommendations and recommendations for the development process So a technical recommendation for AC level D for example would be to have parallel independent redundancy But for the lower AC levels like A and B that is not recommended anymore There is another recommendation for example Recursions should be avoided in general for all AC levels because the recursions are a possible source of stack overflows and This ISO norm also gives recommendations for development process testing strategy and other things and If you want to read a little bit more in detail about that I can recommend the document down here That is a very good summary of the ISO norm So how can we categorize a risk that is done by the three items severity exposure and controllability Severity means what could happen in the worst case is a death is it injury is it only material damage The exposure is how often is the risk relevant? Please do not mix that up with probability of a failure. That is something different exposure here means how often are we in a Situation where a certain bad thing could happen and the last point is the controllability That means can the driver detect the failure and handle the things in a safe way? I will give you some examples The electric steering the severity is very high because if the electric steering does not work as expected You might cause a crash and if you're driving fast the crash might be fatal The exposure is also very high because the electric steering is needed permanently while driving and The controllability is low because the electric motor of the electric steering is usually stronger than your arms so as long as you are not a bodybuilder you have no chance to control the situation and Because of a high severity a high exposure and the low controllability that has the highest possible AC level which is AC level D Another example the air condition does not sound dangerous, but let's analyze it What could happen in the worst case if it gets too hot in the car, then you might lose consciousness and You might end up in a fatal crash. So the severity is also very high here The exposure is medium in some countries you need it and others you don't but the controllability is very high So if it gets to hot in the car You can open the window if it's still too hot you can stop driving get out of the car And if you are feeling really bad you can call an ambulance So it's easily controllable and therefore the air condition has only AC level a And another example What is if the Arabic does not open of course the severity is very high because If it does not open and you have a crash you will probably die the The controllability is basically non-existent because if the Arabic does not open there is nothing you can do against it But the exposure is very low. So most people never need an Arabic in their whole life So this risk would normally be an AC level a But we did not consider something it could happen that the Arabic opens without a reason and Let's analyze this case if an Arabic pops up in front of you you will probably be shocked and Maybe or very probably cause a crash. So the severity is also very high The exposure is very high because it could happen at any time while driving and the controllability is low because it's very Improbable that you will react properly when such a thing happens and therefore this risk is categorized as level D We will not talk about specific AC levels later in this talk But it is important that you understand if you want to reduce a risk you have three options You can reduce the severity you can reduce the exposure or you can increase the controllability The severity and exposure usually cannot be changed But in many cases it's possible to change the controllability and that will become important in Graphics related use cases, which I will talk about later Okay Now let's talk a little bit about the graphics pipeline. So How does any graphics output usually work? Of course, there are some exceptions, but here I present the standard workflow How graphics is put onto the screen? And first we draw the content into the frame buffer of a window. The frame buffer is just a rectangular area of pixels Then multiple windows for each screen are merged This process is called compositing and finally we show the result on the screen So how does it look in detail in this example here? We have two applications and Both applications use a hardware accelerated rendering API for example open GL or Vulkan which will implicitly use the GPU to draw the application content into a window frame buffer, which is sent to the compositor in case of Linux that is usually Western and application number two does the same and Western can use open GL to Merge all the windows from a specific screen that will use the GPU and then the screen frame buffer is read out by the Display processing unit and send to the display. So that is how the graphics output usually works and Probably you have noticed that some boxes here are green that are Components which are either acil compliant or can be developed in an acil compliant way For example the display processing unit of most SOCs is already acil compliant acil compliant displays are available on the market and you can develop your own applications in an acil compliant way But there are a couple of great boxes here and they are a kind of safety risk So for example the graphics processing unit is usually not acil compliant There is one exception as far as I know in media Has acil compliant GPUs and also the corresponding drivers are but that's not the case for all manufacturers And also Western as far as I know is not acil certified at all until now So we need to build a safety bridge from the applications to the display processing unit but before I will talk about that I will Also show you a possible shortcut in some scenarios It is possible that Western uses the display processing unit directly to merge multiple windows But there is a technical limitation The number of windows that can be merged by the display processing unit is limited and if the Number of software windows exceeds the number of so-called hardware planes on the display processing units and Western will choose a hybrid compositing approach Where some windows are merged with open GL and the GPU and some other windows are merged by using the display processing unit However for our safety considerations, we must consider the worst case that means we have to go through all the boxes here and I will show you how to make this thing safe So the display processing unit of modern SOCs are able to calculate check sums of certain areas on the screen and So that means the teltis application knows the CRCs of the telltale icons then the telltale icons are rendered into the framebuff of the application that can be done with open GL, Vulkan or some other technology Then the windows are merged By the compositor So for example, we have the teltis, we have the speedometer and other windows They can be put together by using Western and Then the display processing unit can calculate the CRC of the region on the screen where the telltale should be shown So that means for each telltale, we have the so-called region of interest from which the display processing unit can Calculate the CRC and then the CRCs of all the telltales are sent to the telltale application and Then if the CRCs match everything is probably fine if the CRCs do not match then we have a problem and If we detected a problem, then we have three options We can try to recover we can use a fallback or we can bring the system to a safe state a Possible recovery option would be for example to reboot the system and hope that the error does not occur anymore after that and If the error still occurs we can use a fallback solution So for example we could display the telltale on another screen or print a text message on another screen or Many cars even have Physical telltales as a fallback solution So they do not only have the display but below that there are usually some physical telltales which can be used if the Nice display telltales are not available for some reason Okay What about the camera? Here the things are a little bit more complex here. We cannot use the one fits all solution as we have for the telltales Because of several problems. So of course we could calculate the CRC of the whole camera image But that will not work if we add an overlay. So if we add an overlay it will change the CRC The same happens if we scale the camera image It will lead to a different CRC so in case of the camera we Have to analyze each risk separately and find a solution for that So which risks do we have? We have the risk that the image on the screen is not updated. So that means the camera image is frozen and A possible solution is if the CRC of the screen rectangle of the camera image does not change within a certain amount of time For example a quarter of a second the system must react accordingly react accordingly Again means either try to recover try to Get a fallback solution or bring the system to a safe state So of course we can try to reboot the system but if the error still occurs we need a fallback and in many cars we have multiple displays We have the central display we have a display in front of the driver with the instrument cluster And of course if the rear view camera for example on the central display does not work properly Then we can display it on The instrument cluster display, but not all cars have an instrument cluster display some have a traditional instrument cluster with physical needles and Also, we must find a solution what happens if the image is frozen on the other screen as well So that means the last option is to bring the system to a safe state and a safe state in Case of the camera use cases can be just to turn off the displays Because if the display is switched off then the driver will see are something's wrong with the camera Then the driver can use the mirror turn the head or ask another person for help to drive backwards safely Okay, which other risks do we have? We could have obvious garbage on the screen But there we do not need a technical solution because the driver will recognize the error and react properly so now let's assume the following use case we have a review camera and Nothing behind the car. Let's drive backwards Oops, there was a child behind the car as the image was cropped So how can we detect that we do not have a solution for that until now? And there is also the risk that the image could be delayed So how can we detect cropped images? So assuming that we have some kind of overlay on the screen, you know We cannot take the CRC of the whole image But we can use the CRCs of two diagonal corner pixels So we just define a one by one pixel region of interest on the display processing unit for which the CRCs are reported and We also calculate the CRC of the single pixels in the camera application and if the CRC is match then the image is probably Not cropped and if the CRCs do not match then the image is cropped or broken by some other means So here is the data flow again the camera application calculates the CRCs of the incoming camera image pixels in the diagonal corners the image goes through the graphics pipeline and the display processing unit reports exactly those two regions of interest back to the camera application and Yeah, if the CRCs do not match then We have again these three options try to recover which is a reboot in most cases We can try to use a fallback solution if that is not available or not working as well We can bring the system to a safe state But there is a problem when we try to verify scaled images. So the red grid represents the pixels on the screen and Below that you see is the pixels of the camera image and in order to calculate the color of the Top left screen pixel which is used for the CRC. We see is that there are four underlying camera image pixels and When the camera image is passed through the GPU to be drawn on to the screen then the GPU does be linear filtering and The be linear filtering is basically a weighted average of all the pixels which contribute to the corresponding screen pixel and We do not know how precisely the GPU calculates the weight of the Four pixels. So as you see this pixel of the camera image has the highest weight This one and this one have a lower weight and this one has the lowest weight We do not know how precise that is done by the GPU. So that means the color value of the screen pixel is not deterministic and There is another problem in the past some GPUs have cheated so that means if They see that a certain Input pixel has only a low contribution to a certain screen output pixel then the input pixel was ignored completely So that means the color value of the screen pixel is not deterministic at all anymore So how can we solve that a possible solution is to? calculate the average of the two by two pixel regions in the corners and Right back The same value to all the four pixels that is only a minor change of the image That will not be visible to the driver. That is not relevant for safety because it's only in the corners and This change will be very very minimal so we can use that small trick and If we apply the be linear filtering now We see that all the underlying camera image pixels of the screen pixel in the corners have the same color and If we calculate the weighted average of four exactly same values Even if the weights are not calculated properly the result will always be the same color So we have achieved to make the color in the corner deterministic So we have found a solution for three of the four risks What is still open is We must detect if the camera image is delayed, but before before we go to that Let's think again about the image is cropped risk So we have a small problem with the timing of the CRC's so When the camera image comes in to the camera application It will take some time until the new image is visible on the screen So it has to be rendered it has to be composited with other windows And then it's finally shown on the screen so that means there is a certain time interval Where the CRC which is known to the camera application and the CRC which is reported by the display processing unit do not match Then when the frame number one is shown on the screen There is a short time interval where the CRC's match and then they do not match again But things can become even worse So if we have a complex graphics pipeline, it could happen that the processing of frame number one takes longer than Receiving frame number two from the camera which would result in CRC's which never match in order to solve the problem I suggest the following solution We can have a list of pairs of the corner CRC's and the corresponding validity and time and For each set of CRC's which is received from the display processing unit We can delete all outdated CRC's from the list above and then we check if the received CRC's are in the remaining list If not, we trigger a safety warning because it means the screen content is either wrong or outdated outdated We wanted to detect delayed images. So maybe we can use that algorithm for that as well. Let's see at first Let's think where delays could occur So delays could occur between the physical camera and the camera application and between the camera application and the display processing unit so Let's think about that delay here at first If we produce a camera, then we can develop it according to the ASIL recommendations We can make sure in the source code of the driver that the frames are never queued anywhere So that means it is not possible to have a delay Here if everything is done properly But delays can occur between the camera application and the display processing unit and we must be able to detect them so Can we use the previous algorithm for that? Let's evaluate it So obviously the algorithm would work to detect delay corner pixels But what if only the center part of the camera image is changing and the corner stays the same? Then it would not work But here is the noise of the camera sensor might be handy even if the real corner content does not change The camera will send different corner colors for almost each frame So that means we can use the combination of the two or even four corner pixels as a kind of identifier for a certain frame and Yeah, now let's have a look at the mathematics Assuming that we have a low noise camera where RGB alternates only between two adjacent color values in the range between 0 and 255 We have two to the power of three that are eight different color values per corner If we use two corners, we have a to the power of two that are 64 combinations If we use all four corners, we have a to the power of four that are 4096 combinations If the list of pairs from the previous algorithm has four items the probability that a delayed combination on the screen is in the validity list by accident Is only one divided by 1024 Four frames later the probability is approximately one divided by a million and another four frames later The probability is approximately one divided by a billion So that means we are able to detect delayed frames quite reliably and In this case having a low noise camera is a kind of worst case because the more noise we have the more combinations we have here So that means a noisy camera would increase the range of possible IDs for the frame and Even if the camera has no noise at all we can still apply the same algorithm because the camera application can inject some noise only to the corner pixels So that the frames can still be identified So it is basically the same thing as we had when we calculated the 2x2 average in the corners. That is only a minor change to the camera image. Nobody will notice It's not safety relevant. So we can do that safely so we have found solutions for all the risks that we have found if The image on the screen is frozen. We can check if the CRC Changes or not and if it does not change then the system must react accordingly If you have obvious garbage on the screen, we don't need a technical solution because the driver will recognize the error If the image is cropped We can check CRC's of single pixels in the diagonal corners And if the camera image is scaled intentionally, we can replace a block of pixels in the corners with the average to get predictable filtering results And we can use the same algorithm also to detect delayed images But in this case we should use all four instead of only two corners and maybe we can add some more reference pixels So we have successfully built a safety bridge across the unsafe components in the graphics pipeline And that means we can use Linux also in many safety critical scenarios That is the end of my talk. Thank you Do you have any questions? Yep, wait for the microphone Thanks for the talk was very interesting one question I have Implementing those algorithm rhythms would mean that you need to let's say for example direct access to the output of the camera sensor where you can Implement then the checking of the CRC's and stuff. Do you have any experience with when you're using any additional? Graphics decks likes for example a cheese stream or something is there any support for checking the checksums or Would one have to write a custom cheese you are plugging for example or something? Okay, so I just repeat the question with my own words and share the corresponding slide. I hear this Okay, so your question was If we can Calculate the CRC's somewhere here in between or how direct the camera application can access the physical camera and If it is still possible to capture CRC if you use G streamer here in between. Yeah, okay That is unfortunately not possible so we need direct access to the camera driver So that other components which are not considered to be safe are avoided in this case. Yeah, okay. Thanks Thank you for the presentation. I have a question regarding to potential errors on the camera pipeline side So camera pipelines like to reuse the same buffer and I can imagine a scenario where Everything looks looks really good But the camera does not actually fill in the buffers into into the recycled buffers so you may end up let's say having six or eight buffers of a past camera state just queuing to the pipeline and visually creating like updating image, but Just cycling the let's say eight buffers How do you deal with this type of problems? So do you have so is this a simplified? Way of dealing it and you actually have to do Tracking per buffer on those CRC's and looking up that you know a particle of CRC does not kind of keep repeating I go to the corresponding slide first. Okay, I think there was one too far. Okay. Here it is So you mean that the Camera itself fills the buffers in a wrong way or exactly So basically you could queue back the buffers to the camera that hey, please give me a new buffer But the camera just keeps the old content. So if you don't zero out the memory you will Possibly get back. Yeah, that is true. So the memory must be zeroed out. Yeah, so you zero. Okay. Thank you so much So just one question here You are asking the DPU to give you CRC for different corners based on different situations Exactly. Yeah, so that that means you need to tell GD the DPU to give you that data So, how do you ensure that DPU DPU has received that request and sending you the CRC of that particular part that you are requested So that is just a matter of the corresponding DPU driver So there must be some communication between the application and the driver and then the DPU will send the corresponding data Yeah, so there can be some delay of the delay and the DPU might send you a delayed CRC That should not happen because the DPUs of modern SOCs are designed in an ac compliant way And even if there is a delay then the algorithm that you're seeing here would Detect it. So of course it would be a false positive, but we would not come into an unsafe situation Right. Okay. Thanks. The gray bit in the middle Like are there people who want to safety certify that as well or is that just there's no reason to do it? Pardon? Like so like say Wayland Yeah, so like all of the gray bit in the middle is currently I'll just go to the slides that everybody sees what we're talking about Yeah, so we can use that example. Yeah, so Like is there I don't know is that an end goal to actually try and safety certify the other bits in the middle? Or is there just no need? Like for someone to come along and try and certify Wayland So you mean that western and the other boxes here might become green in the future? Yeah, that's the question. Okay. And is there a need for it? Yeah, so There's not really a need of course it would be good to have it in an ac compliant way because then We might not need so many dpu based checks as I have presented right now But as far as I know western is not easily compliant and also if I have a look at the Wayland API I have some doubts If it's a good idea to certify it because in my opinion it's way too complicated. So for Safety critical stuff apis should be intuitive to use So that's the risk of using apis in the wrong way is pretty low But in my opinion, that's not the case for Wayland. So I think western will not be ac certified quite soon And the other two boxes The gpu and the corresponding drivers They are ac certified for nvidia as far as I know but for other manufacturers That's not the case. I think other manufacturers will follow. I think those boxes will become green in the near future as well and Yeah, what will happen with? western or Wayland in general I'm not sure Anyway, we can build a bridge over the unsafe components. So we are not So we do not rely on them even if they are not safe. We can still make a safe application Thank you Thanks for a talk a great overview and approaches First one note about the air conditioning. What if it's called? What if it's called outside if it's called outside, then you will probably not lose consciousness That's not a safety risk. Yes, but if it's minus 30 or 40, then it might be pretty life life threatening risk But the question itself is about you mentioned these numbers probability that we need for errors, right? One over 400 4000 and so on so on How you define this border and who actually defines this border Which probability is acceptable there? There are some tables on the internet which define which failure rates are Acceptable for a specific ac level Okay, so that's Defined by the ISO right by the standards or some of these things I'm not sure if it's a part of the ISO itself But there are recommendations which are publicly available which failure rates are acceptable. Okay. Thank you How do you And the the CICs are passed back to the application in hardware or is this part of the driver So in the NC application must communicate with the driver. Yeah So the parts of the driver need to be certified then. Yeah, exactly Okay, and the telltale application it's run by the scheduler or I mean the operating system also needs to be certified Exactly. So basically we have three options So one option is that the SOC has a special safety core So that is the case for many SOCs in the meantime So that means the telltale application can run partially or completely on the safety SOC We can use Virtualization so that means we have a hypervisor and the main system runs on linux And a safety critical part runs on safe rto s for example And the third option will become relevant in the near future So as far as I know Linus the linux kernel itself is going to be ac certified by some companies And then we can even run safety critical things on the linux kernel Okay, if there are no more questions, I think we can go to lunch a few minutes earlier. Thank you