 Hello, everyone. I'm Igor Stopa, I work for NVIDIA. And today I want to talk about safety for the Reynolds kernel. There will be a brief introduction, and then I want to talk about a wish for better isolation within the kernel. We will see how we could create different contexts within the kernel and how to use them. And I will try to draw some conclusion and then show what I'm planning to do next. So this is an experiment I'm running. It's not finalized yet, but I thought I could share what I've found so far. And it's about protecting the kernel from self-interference. What I hope that you can take away from this talk are ideas for how to cope with interferences and also how to choose what to care about and what not. And possibly have some ideas about how you could improve your hardware or ask your hardware vendor to improve it. So everyone wants Linux. Unfortunately, Linux is not the best choice at the moment for safety, primarily because it's a monolithic kernel. So what can be done currently is to try to detect internal interferences, but there's little that can be done at the moment for preventing them, or at least limiting their effect. What do you find typically in a safe system, which is usually Linux? You find full redundancy, typically, meaning that you have multiple systems which are trying to run simultaneously. That's of course expensive. So we would like to avoid it unless we really cannot do without that. You have a safe monitor, which is some sort of small system, which we trust is very reliable and it's in charge of putting the whole bigger Linux system back into a safe state should anything bad happen. And that's something that probably you cannot do without. Then you can find the hardware message passing. You have some safe applications and one way to ensure that there is no undetected corruption is that this application can talk to each other or talk to something which is outside the system by using some error correction, check something which can pinpoint if there's been damage of some sort in the information being exchanged. One technique that is used is to implement the user space device drivers. They are safer for reasons that we will see later, but it's not really Linux. Linux mostly has kernel device drivers, and it would be nice to use those. As I mentioned, we can have also some level of self-monitoring in the kernel, meaning that we can define for certain subsystem a sequence of states and how those subsystem must evolve from one state to the next. And we can try to verify that this sequence, this graph is correctly executed. And it's something that is useful in some use cases. The problem mostly is that one has to incur in the penalty of describing these states and then implementing them. So, can we do somehow better? What I'm thinking is mostly, well, let's drop the full redundancy, at least in most cases. I will come later back to this, so why I think it can be done. The safe monitor, I don't think we can do without, but it's okay. How the message passing, it's also part of the design, and I think it's acceptable. It's acceptable requirement for the applications. For the user space device drivers, well, let's try not to use them, or at least to have a very good justification for that. In kernel safe monitoring, well, it's useful, certainly. Maybe we can focus on trying to identify which cases we want to support. And then what I'm trying to propose is to introduce in kernel hardware barriers so that we can introduce multiple contexts within the kernel. And the idea is that this can help for the points I previously proposed to alter. Well, why do we want barriers? First, let's just have a look at terminology. Usually, in safety, you talk about safe part of the system because it's involving some evaluation which goes beyond the implementation. There's a set of quality criteria and analysis that are applied to this component. QM instead is just quality managed, and what it means is that it's developed properly, but it's not supposed to be involved in any safe use case. So, when we have a Linux system, well, from a very high-level view, we have applications which can be safe or not, and then we have the kernel. Applications might talk to each other through a shared buffer, for example, and then applications perform system calls. System calls are nothing else than the means that the application has to ask the kernel to do something that the application would not be able to do itself. Safe monitor that I mentioned earlier, and you can think about it as a watchdog, for example, in its simpler implementation. So, either the safe application fails to notify the safe monitor everything is well, or the safe application detects a problem and asks the safe monitor to do something, and the safe monitor will, for example, power off or reset. It really depends what is considered to be the safe system for that use case. If we look inside the kernel, every application has a counterpart which is a kernel thread, and this kernel thread might tap into device drivers, and some of them might be related to safe use cases, some not. You always have a kernel thread as a counterpart for the safe application. Then you have, I don't want to call it miscellaneous, but the core of the kernel, the part which deals with memory management, file system, basically the housekeeping that keeps the system up and running. And then you can have, or you should have these internal monitors. Notice that it's also talking to the safe monitor outside, because the idea is the internal monitor can also be, it's like a counterpart of the safe application from within the kernel. Now, we have normally a memory management unit, MMU, which provides some level of isolation. The purpose is that, for example, if you have a bug or something wrong in one application, it cannot really leave its sandbox. The problem with the same event inside the kernel is that it can go and cause troubles all over the place, because it's within the same sandbox. We just expect that one of the safety mechanisms that we discussed will either detect the error or will fail to keep alive to the watchdog, and therefore the monitor will put the system in a safe state. Of course, if the system has been designed correctly. But again, can we do better? It would be nice if we could have some asynchronous way of detecting if something is wrong and even better if we could prevent it from happening, so that we wouldn't even have to go and expect the safe monitor to put the system in a safe state. The game there is the fact that, first of all, we should have a better resilience in the system, and because we can define events which are considered to be real, don't care from safety perspective, we have a simpler system to analyze from a safety standpoint, and that's really a big thing, I think. So, this is just my opinion. Two possible solutions are either the use of more advanced functionality, harder functionality like memory tag extension from ARMA or Cherry Morello. These options are still fairly new. They require special tool chains. They're not exactly cheap or as cheap as not having them, and they also require, they have a need for a tagging logic, because tagging is useless unless I know how to apply the tags. The option B is to use DMMU. DMMU is present basically in every Linux system, and it's already there. It might be slower, but for this exercise I'm not putting speed as a paramount parameter. I'd rather evaluate later on what are the effects. It also requires some coloring, and guess what, it's more or less the same, I think. So, in this case what I'm referring to when I say coloring, a way to classify memory allocations based on their intended content, and the idea is that, for example, I can apply different properties, so that, for example, QM code cannot go and alter or compromise safe data. The idea is we want to come up with a strategy for tagging, for coloring our system and try to understand what are the trade-offs between the granularity of this coloring and the performance, and possibly come up with the idea of what hardware features might improve the situation. So, for this I went for the DMMU, simply because it was already available. I could try everything with QMU, and this was just easier. As said, the isolation mechanism is more or less irrelevant from coloring perspective. DMMU granularity can be seen as a sort of special case of empty, empty typically as a finer grain, and also when it comes to partitioning, I'm trying to keep it simple to not overdo it. So, if we go back to the system I described earlier, what we do like to do this, at least that's what I would like to do, to have a safe context, which is really what is associated with safe application, a core context, which is supporting the overall function of the system, and the QM context, and the idea is that the QM cannot affect neither core nor safe, and that core is also limited in what it can do to the safe context. Well, as I said, I'm trying to keep it simple, so three partitions for those reasons. But there's nothing that would prevent having more other than it's just more coding. And this is what we would like to have in an ideal situation where there are limited interaction between these contexts, and for example, the safe context might affect the QM context, but certainly the QM context cannot affect the safe context. Now, someone might be wondering, is this a microkernel? Maybe not, or at least not in my mind. What I mean is the fact that this is not designed to be always on feature, it's something that most of the Linux user would probably not even care about unless they have a safe application, and it wouldn't even have to be a fully-fledged microkernel. It would just try to protect those portions of the system which cannot easily be detected otherwise, or which has a particular relevance based on specific use cases. Furthermore, it might look like a microkernel because I'm using the MMU in its most basic implementation. If you had something like hardware tagging, probably it would be more transparent, and I will get back to this later. So, just to simplify what is the scope of these partitions in this context, Core does not belong to the typical partitioning that one might have with a safe application, but on the other hand, the idea is to not change too much Linux, and the Linux has so much functionality which belongs to Core that I thought you might be preferable to start with this approach. And the reason for Core existing and not being part of safe is the fact that safe might have additional need of protection even from Core. One question that can arise or might arise or I've seen arising when I was doing security, not safety, was, well, we can just fix all the code. At least for safety, I don't think that's a good enough answer because the content of the core context is more or less well defined, but what might constitute safe or QM, it really depends on the application. In one application, a certain device driver might be QM. In other application, it might be safe. So it's not really possible to hard-code it. And the expectation is that one will invest resources in the safe context. Therefore, we would really like to have the maximum return of investment in what we do, especially if you think about having to revalidate code base changed because there's a kernel update. So I'm sorry for this, but just to put everyone on the same page, I will try to make it as painless as possible. Because I'm using the MMU, the MMU is a hardware component which provides, from this perspective, a few key functionalities. Process isolation meaning that one process cannot interfere with the other one. Kernel to process, isolation, the process cannot go and interfere with the kernel other than, for example, rising assist call. The MMU provides virtual to physical mapping and scatter gather functionality which I will explain better later. And it also enforces attributes like write protection, execution properties and access in general. The MMU works on a granularity which is measured in pages, a page that's nothing more than a chunk of memory with a certain size. In part it's configurable, in part might be also implementation dependent and it's aligned to that size. All the memory locations within the same page have the same properties. This is just an example of what the MMU does. If you think about the left-hand side of the screen, you have the virtual memory where you have a memory region which looks contiguous but then it is in practice mapped to non-contiguous memory on the physical side or you might have a physically contiguous chunk of memory which is mapped to different address. The MMU can do this for you. Still looking at it but from a slightly different angle since I mentioned this translation process, the translation happens through the page tables which is a way of implementing translation rules. Page table is nothing more than a sparse tree and each node of this tree is a page in itself and it also encodes those properties like can I execute it, can I write it, can I read it. Linux supports up to five level page tables at the moment. Not all the hardware does. These Linux has means to collapse them. So going back to the MMU the MMU since it performs this translation contains inside a cache which is the translation of a set buffer or TLB. In the perfect case we have a cache hit and everything goes fine. If we are not so lucky then we get this page table walk meaning that the MMU does not find the translation rule inside its own cache and needs to perform several accesses to the page table to figure out how to map the virtual address into the physical address. So this means that for example if I can map a group of physically contiguous pages into the from physical to virtual TLB cache. Which is nice. If I am doing scatter gather I will use more slots. It's called TLB pressure and there isn't much that can be done about it because after a while the system has booted memory gets fragmented pretty badly. Some implementation of the MMU allow for example to lock down certain entries so that they cannot be overwritten. If I know that something is gonna happen very often I can try to prevent it from being evicted from the cache. In a full implementation of a page table you have five levels and this is how they are named and on the right side the column it's represented the addressable other space from within the node of the tree. In practice if you look at how a translation from virtual to physical happens which is what is called page table walk we have page here which is the reference to physical page in memory and then we have the virtual address and we start using combination of physical address for knowing the page and then the piece of the virtual address which gives the address within the page the offset within the page and there we find another entry which points to another page and so forth till we get to the final location we wanted to reach and that's the page table walk. The Linux kernel already uses mapping properties to prevent overwriting for example of code and constant data. What is left to protect or what we should care about is writeable data the data which is not already configured to be read only. If we look at this from again a different angle we can see the page table and if we consider this context I mentioned earlier safe QM core we can see that mostly they end up somehow like this and that's not very nice because we would like to have something like that where they are grouped in sub trees because at that point what we could do is we could for example remove completely one sub tree from the context that should not access it for example if I remove the sub tree which belongs to or which maps the safe memory from the QM context the QM context will never be able to reach those pages and similarly this can be applied also to a portion of the sub tree that's just the representation of what I said and these are the three maps that ideally we would use as you can see I have marked with white arrow the fact that there are three different routes PGT1, 2 and 3 because that's how you can represent the different page tables unfortunately the kernel already defines its own memory map because it has regions where you can have pointer to init calls initialize data, non initialize data address ranges which are reserved for runtime allocations and whatnot so we cannot do that exactly in the way I showed but what we can do is we can apply that partition to each of these sections for example as I mentioned you can see the allocations as link time and runtime link time are those which have known size at the moment of the linking when the binary is created and runtime is what happens when the system is running if we look at link time allocations for example looking at the VSS which is non initialize data and then the data section we can partition also those and for example if we define them as aligned to multiple of one PMD we can create sub trees inside them the kernel uses a linker file and this is for example how to introduce additional sections in the linker file that's kinda easy the not so easy part is populate the section once it has been defined there are two methods one is the obvious one meaning that I go and tag accordingly all the data that I want to put in a certain section but as I said the very same data in one case might be safe or in other case might be QM so I don't want to go and patch it one alternative method that I come up with is that in the process the object files I can prior to linking rename the sections that are used inside the object file for example I can have a list of object files which needs to be mapped into safe and I can rename the VSS in data as as shown or I can do the same also for the QM section I'm not sure the method works I don't know if this will not find some counter example in the code base where it's not enough we'll see for on-time allocations I haven't done QM alloc yet I just think that we probably try to allocate different slabs but again I have to try it for Vimaloc the finally good news is that all the Vimaloc allocations they take range where you have a start and end so and usually this range is Vimaloc start and Vimaloc end but if we define some ranges like core Vimaloc start, core Vimaloc end we can really obtain for Vimaloc this sort of nice partition in which then allows us to the metadata I painted it as a core but this is really similar to what I showed earlier where you can just prune a subtree now the question is wasn't this enough? No sadly not as I found out for example V3 most of the times is performed exactly when it's invoked but in some cases for example if you are inside the interrupt context V3 will just queue the memory which needs to be freed and at that point if we try to create a link list of memory which has been allocated for whatever I could have started splitting all of this but I preferred to introduce full memory map meaning that at some point I accept the fact that all the memory is mapped what I try to do is to keep this map active for a very short period of time and that's how it looks like one of the advantages of this approach is the fact that because of this I always have a system which is it can be used it can run and it can be verified so I can always validate the fact that what I'm doing is not crashing it also can help with performance in the end it's a choice and I think it's something that each one who's creating a safe system needs to validate or decide whether it's worth or not to convert something to full or not of course there's nothing for free and all this play with the page tables has a cost it's required to flash it it'll be very often performance maybe not kids but it depends how much you are relying on fast content switches if your application most relives in user space then probably you would not even notice it it depends that's where hardware tagging might help because in that case the backend implementation would not require multiple pages multiple page tables it would require only one page table there would not be all this tlb trashing but now that we have all this context what do we do with them this is how normally the boot looks like and this is how I am planning to change it meaning that instead of having just one boot sequence then we would have you can think about it as a three separate boot sequence one for each context they cannot interfere with each other and once you have three shells one for each context then in practice it's like having three different user spaces of course all of this needs to be implemented in a way that can be backward compatible as I said I don't want to permanently turn Linux into a microkern one problem is the stack every thread has a stack the stack would be a really nice candidate for all this hiding but at least on arm64 while I was trying to make it work I realized that the scheduler needs to have access to stacks of the previous thread that was running and the next one sometimes it just happens that they are not belonging to the same to the same context so how else can we protect them what we can do is or rather why the problem with the stack and the reason why I'm proposing to protect it is the fact that it's very difficult to detect when it's corrupted if it becomes corrupted and that can happen also to your safe context one first step is to switch to V-map stack this is already supported by the kernel I have to increase a bit its size but overall it's supported what it means is that you go from having a normal pool of stacks which are used when threads are created and destroyed you have a pool where there are traps and there are kernel pages so what it means is that on the right hand side you can see that the corruption of the stack in that case will not go beyond the one which is being corrupted simply because then there will be a new exception when the corruption tries to go beyond the page and the access address which is not mapped other option is to also enable for all functions stack canaries the stack canary is currently enabled in a normal kernel for about 20% of the functions and it's nothing else than a specific additional local variable which is initialized when a function starts and then when the function ends there is a test which is inserted by the compiler that the canary has value has not changed this is basically how it works the idea is that if there is a corruption it's large enough to hit also the canary and therefore the check of the end of the function will detect but of course you might be unlucky and you have a small corruption which doesn't overlap the canary well, that's life other thing that can be done is to improve the or increase the aligning and the optimization at the compiler level why because all this stack churning also adds overhead and it's proportional to the amount of function calls we are making so if we can make less function calls we have less churning and also we need smaller stacks to allocate this is an extra strategy which is to use pointer authentication pointer authentication again is something fairly new you cannot assume that your CPU will have it and in practice it adds check some validation to the pointer so for example here without pointer authentication you cannot notice that your return address in the stack is being corrupted with pointer authentication so you should I'm not sure if for every implementation you can prevent the damage but at least you can detect it so you can think about it as an alternative canary which is hardware accelerated so having all these stacks mapped as Vimalok in practice doesn't mean that we need to expose all the Vimalok allocations all the time because what we can do is we can partition the Vimalok address space so that for each context there is a subrange which is used only to allocate stacks so for example in this case I have a map where I have fully represented the core allocations and then I can also see the stack for the QM and the safe context which are anyway protected already in the way that I just described conclusions finally so far I haven't had any show stopper in a sense I was hoping I would find one because it wasn't exactly pleasant but no luck there are very many places where it's possible to add some hardening again if you start from the situation of using purely upstream kernel and you do something to improve its robustness anyway improving the situation it doesn't have to be perfect it's up to you to assess if it's sufficient or what doing that brings to the table but you can think about it this also really gives idea of how having proper coloring supported by the MMU would help because at that point problems I mentioned would go away at least from a personal perspective and also in terms of having to rearrange data here and there will it work for you well it's your judgment I hope at least the philosophy is something that you can take home again I think we need to start asking hardware vendors for better solution trust that you'll be so much but I really think that even with the pure MMU approach it might be enough in some cases it's up to you to evaluate if it's good enough for you I'm a bit shy at the moment of publishing the code simply because it's really hackish I'm just in a mode where I'm trying to get it to work so now in some cases for example I have completely ignored the abstraction that the kernel does when managing page table so in my case I'm using ARM64 and went straight for the ARM64 way of doing page table I should improve that but I hope to be able to release it by the end of the year and publish it to upstream mailing list and announce it in the Elizo forum and there's a lot of stuff to do next but again I have been thinking a little bit about all of these and there doesn't seem to be anything completely impossible the question is mostly how big will be the patch set and what will be the reaction from upstream I don't know I'm also here presenting this because I would really like to understand the level of interest in this approach which is more about changing the kernel rather than trying to understand if it's good enough or not and thank you for staying with me so far that's my email address if you want to write if you have questions I don't know if there is any time left ok, yes I'm wrong but basically I think you can go back to the I don't know the simple diagrams the one with the three ensembles this doesn't help so anyway so it seems that what you've done it prevents any possible interference between the safe context and the QM context when you are in a certain context it sees only and if it tries to address for example I said that I am in QM context and for some reason either malicious or not I got an address which has leaked to me from safe and I tried to access it but the branch of the page table is just missing so I will get a new exception yes yes yes yes the solution that you proposed it reduces the possibility of interferences between QM and Ezio doesn't remove all of them because they can still interfere through the core context yes, let me just for the sake of repeating so the question was is the interference prevention between core between safe and QM but is there a possibility of like second level interference through the core well it depends my approach is to try to partition so that for example if you have to do STPA or other types of analysis at least you have boundaries which you can rely on because they are hardware enforced and you can analyze those and you can decide from where to start what you have to analyze so the idea is mostly if I can push as much as I can to QM then I can use my energy on safe and also on core is there mostly because if I have for example let's say a safe application which allocates a buffer purely for exchanging data that buffer has been allocated the core has no business with it so it shouldn't see it in its own other space but for example stuff like the scheduler do you want to treat it as safe it depends because for example you could assume that if the error is so macroscopic that the external hardware monitor will notice that hey you're not scheduling the application I care about now I'm going to reset you that's really up to you how much you want to invest for example for having higher reliability well what I marked as core is something where I wasn't sure how to answer so I just marked it as core because if you ask me is the scheduler safe or QM I would say half and half again it depends on the use case so I'd rather start with the possibility of putting stuff there and defining the rules of interaction than not having it actually initially I didn't have core and it was a huge headache in order to be able to define where should I put something and core kind of solved my problem because I could have this additional classification Multicore is the default why don't you use hardware partitioning hypervisor well because first of all if you have a hypervisor first of all it's not always granted that you have a hypervisor that's an additional constraint what my primary goal in all of this is to define actually the points inside the kernel where I want to change context so at that point what is the mechanism which supports it whether it's a hypervisor, whether it's the MMU whether it's some tagging I don't care because that's more like implementation specific if you think about it you could have a sort of generic version which uses the MMU as I'm doing now and then every vendor with ARM inter could have its own implementation and could have one way to accelerate it or not. I'm still interested in at which point do I change color of the memory and that's not, I think, irrelevant to the how it's backed say this in some use cases either you do not have hypervisor support for example in ARM you might not have yield 3 sorry yield 2 or you might have something else and I have to stop but I'm happy to continue discussion later in the corridor thank you