 Hello everyone. I'm Igor Stopa. I work for Huawei at the Cyber Security Privacy Protection Lab in Helsinki, Finland. What I want to discuss today is a sort of part of a presentation I had last year at the same conference. I will try to make it so that there are not too many references to the previous one, but I will mention when there is some connection. The goal is to try to improve the protection of critical kernel data. What I want to present are, first of all, the problems that are involved in this endeavor and the solution that I came up with. At least the first part will be a high-level description and then we will see various possibilities for the back-end implementation. So why critical kernel data? Actually, I was lucky enough that some of the previous presentation have already kind of broken the ice about this. But my impression is that the kernel data is really not at the same level as code, for example, when it comes to protection. So it seems obvious that when we take away the easier part for attackers to succeed, then the attacker is forced to go down the harder part just because it's the only one which is left. The typical way to do this protection is to use the system MMU. However, that's where several problems arise. As I said, if this sounds familiar, it's because it is. Various entities have been doing this already for quite some times. Some of them are actually just in the business of doing security. Some others do it as a sort of side effect. And the Linux kernel already does implement some of this at the fairly coarse grain. And as I said, I've been hacking at it for more than one year, so you might have stumbled in some of my mails. As I said, not everyone is actually interested in getting this merged into upstream because it's their core business. And others might just not be willing to put the effort in it. Plus, some of these implementations do not work as a standalone solution related to the Linux kernel. So that is also a possible obstacle because then one has to work the solution into multiple projects. For example, KVM could be an option. And last but not least, the performance impact of some of these solutions can be really overwhelming. I suppose not everyone would be willing to buy a phone which boots in 10 or more minutes. I've been doing that. If we look at the problems, there are several classes actually, not just individual problems. If you look at how we classify data, how data is used inside the kernel, there are multiple ways or multiple angles to look at it. You can look at it whether it's going to become immutable after a certain point of time so you can consider it okay. I might have a transient, but after this transient data will not change anymore. Or it might change indefinitely, maybe not with very high frequency. But you can never consider it as constant. Plus, the way the memory is allocated is also playing an important role in this. You have usually parameters which have a size which is known at compile time. So you can, for example, put them in a read-only after-init section. That's what is happening with many kernel parameters. But in other cases, like IME measurements, they keep on coming. So you do not have a fixed allocation for that. The way also the memory is modified might make a difference in the way it needs to be protected. One example is if you load a Selenux policy DB, you just have a burst when data is loaded from file, processed, stored in memory, and that's it. After that point, you can consider it as a constant. And also the granularity at which data is allocated and treated can play a role in the classification and way of protecting it. Coming to that, as I said, the MMU is the primary mechanism for enforcing the protection. And the MMU, no matter how good it is, it has some interesting constraints. First and foremost, the fact that it works at page level. Now, we can have different page sizes, but that still doesn't change the problem that MMUs has a fairly coarse granularity. So that's something that we have to consider in our solution. Plus, not all the properties that we would like to consider, as we have seen in the previous slides, are represented by current generation of MMUs, as far as I know. And MMUs, again, as far as I know, they also have settings which are always reversible. So you can never say that you have protected something for good. If you assume that your channel is getting compromised, then the attacker gains or can gain the same level of capability that you had, and it can unprotect the memory that you were hoping to preserve. That's why, for example, there are strategies where the last stage of protection is delegated to a hypervisor instead of using the kernel. Because the hypervisor has a separate context which is not directly accessible to an attacker who has compromised the kernel. Then one more dimension of the problem is the fact that the kernel, for very good reasons, makes a large use of pre-coded data structures. And these are really pervasive. And what we would like to do is to somehow preserve the spirit of these data structures, but somehow differentiate between those which are meant for being used for protected data and those which are meant to be used for non-protected data. And as I said, there can be very interesting outcome if you don't do it right. Like you have very sensitive data which suddenly becomes too slow to be accessed in a reasonable time. Now, one problem of dealing with a real life system, like it would be a mobile phone in my case, but not necessarily just that, is that you can have different types of concurrency which can be exploited by an attacker. For example, you might think that you have a core which is performing, let's say, secure operation where it's modifying some sensitive data and somehow this core is interrupted and the context it was working with is left available to the attacker. Or another case could be that one core is performing this operation, but the context is not accessible only to that core and the attacker can take over a separate core to perform a sort of a race-based attack when it just tries to modify the interesting data as fast as he can, hoping to hit the window when the data is writable. And then we also have a special alising where the same physical memory can be accessible through different virtual addresses with different properties. Some of those are intended, fixed map and fixed map. Actually, the fixed map is really meant for doing that. So what you want to do is to be able to have some sort of filtering where you decide upfront what could be mapped differently and what not. And finally, but I will talk about it more later, there can even be attacks on the page table which really try to modify the properties of the memory being accessed. The next problem is you might be successfully protecting some data, but you are just moving the target away. So if the data cannot be altered directly, the attacker might try to modify the metadata which is tracking the data you are protecting so that suddenly it behaves differently. It's seen differently by the kernel and certain types of operations become possible. Yeah, for example, you might have thought that some memory was read-only. You have marked it as read-only. You are protected it as read-only. But if the attack can successfully re-qualify it as writable, then a class of write operations will succeed on that. Or another approach would be that the memory is forcibly freed so that there's a wealth of dangling pointers left and the attacker can try to reallocate that page and that address is to modify implicitly all the interaction from the pointers. So what I came up with so far is several different ways of declaring how memory should be treated and that depends also on the way it is allocated. On the right hand, sorry, on the left hand side, you can see what is available currently in the kernel and basically two. You can declare some data as constant or you can declare it as read-only after in it. But this is something which is allocated at link time. What is still not available is a possibility at link time to declare some data which you want to be protected to some degree but can still be modifiable. And on the right hand side, there's a completely new set of declarations which are meant to provide support for a dynamic allocation. I will go more about them later. So where is this coming from? The maybe trivial observation but I think fundamental is the fact that whenever you have some memory which is classified through some metadata, the metadata can be attacked. But if you start classifying the memory based on its address in a way which is defined in the build time, then it becomes much harder for the attack to succeed because the property of memory is intrinsic in its address. And that's how data is usually accessed through pointers. Because of the constraints I explained earlier from the MMU, there are also limitations about granularity. Data must be aligned to page size and it must be multiple of page size because that's how the MMU works. But if you satisfy those constraints, then you can exploit some interesting fast verification like what I put in the example at the bottom of the slide, where basically on the fly I can always perform a quick verification before accessing memory in a certain way. So link time allocation are easy because as long as you prepare your linker script meeting these requirements of alignment, the linkers will do all the allocation for you. What is more complex is the handling of runtime allocation because those are less codified in the way Linux manages memory. But what really would be nice to do is to be able to compare quickly one variable against the beginning and the ending of a section which contains it to see or to infer at runtime what are its properties. And this is where really the linker scripts comes to help. Now, what is difficult about runtime allocation? In some cases, like, for example, on ARM 64, you cannot just on the fly change properties of a memory page without first tearing down whatever mapping was preexisting. And there are also size constraints like you cannot use Kmalok because Kmalok has size limitations. And similarly, Vmalok is not very suitable because Vmalok does its best to optimize the way it allocates memories. And the only thing Vmalok knows is about the higher and lower range of addresses it can use. So this is the high-level solution I was talking about. If we carve out a sub region from the Vmalok address space and reserve it for performing these protected allocations, then what we can do is let regular Vmalok work on the remaining part of the address range and do ad hoc partitioning of what we have carved out. It is surprisingly fast, or it was surprisingly easy. Vmalok does not support being told directly what is the allowed address range that you can use for allocation, but I think it's called underscore underscore Vmalok node range, which supports these parameters. So the idea is exactly that to create sub areas within this portion which was carved out and encode in each of these address range properties that we want to associate to that data. Other constraints are that if we want to consider possibility of releasing this memory, we want to see what gets allocated inside each page so that, for example, a specific use case will not be interleaved with another use case. Like if you have a driver A and driver B, you do not want them to share the same page because at some point driver A would like to release this memory so it cannot be locked down by driver B still using it. One reason for not just using plain Vmalok is the fact that depending on your allocation pattern, you might have allocation of very small chunks of memory performed frequently, and Vmalok will always try to allocate a whole new page, even if you're using a few bytes of that. And the problem there is that it's likely to really trash the TLB. So you will incur to performance problems if you do not somehow compact the smaller allocation. To preemptively protect the allocation against a class of attacks which might try to trigger use after three patterns, instead of having fine-grained allocation and releases, the best strategy is to just keep the memory allocated until the whole use case is completed. So, for example, a driver, a module might start allocating memory, use it, and then release it once it's unloaded but not before so that it's even harder to exploit the user after three bugs because that memory will never be available to the rest of the system for the allocation. This leads us to having memory pools. The memory pools properties are inferred from the region where they are allocated. And this property, for example, is the data associated to a memory pool read-only, is it right rare? I call it right rare because it's something which incurs into a performance sheet so it cannot be performed very frequently. Can the memory be destroyed or will it stay allocated as long as the system is executing or what's its minimum alignment? In this case, the VMA is the basic allocation unit. It might in some cases be one page size but in other cases it might be larger. And the problem we have now created, a new problem, is the fact that this metadata is the potential new target for an attack. For example, an attacker might be tempted to turn a read-only pool into a writable one. So, it would be nice if we could use the same concept we have used so far to protect the metadata itself. And if it were once, we're not using it again. So, the next approach is we can also encode those properties for, okay, this is a pool structure. You can see that there is some property which is encoded in its address and some property which is inside the part of the structure, their fields. So, those are the ones we would like to write protect. So, the idea is pools cannot be considered as constant but they could be considered as a write rare. So, they can fit nicely in the area which we have already reserved for statically allocated write rare memory. And we can further subdivide that so that to each range you can associate some additional properties. As you can see, this is basically a replica of the previous slide I showed. The only difference is that the previous one was referring to addresses used for dynamic allocations while this one is referring to addresses used for allocating pool structures. So, one constraint is because of this design, you cannot dynamically create or delete pools. But typically, a pool is meant to support a larger use case. So, a device driver might need one, two, three pools, and it might need a variable amount of memory from each of these pools. But the number of pools it needs should be quite well defined. So, I don't think it's a big problem, especially not if you are dealing with a system which has been designed for a very specific use case. And that is, in my case, for mobile phone, it could be embedded device, but even if you think of the cloud, it's normal behavior to have different types of VMs in the cloud for specific use cases. So even those can be treated as really custom implementations. Other problem is that some memory is wasted, probably not so much. Usually the amount of data that one wants to protect is not huge. And anyway, the memory allocator is already packing data together. So, there is probably some padding left, but not much. The maybe biggest problem is the fact that now we are constraining certain type of allocation to fit within a memory address range. And for the vast majority of the use cases, this should work. But there's at least one which I have identified, which eventually will not work with it, which is IMA, because IMA is constantly allocated more and more memory for storing measurements. One possible solution there would be to set low watermark and plan a reboot of the system. In general, a system which is deployed for a real use case needs any way to survive a power cycle. So that shouldn't be a huge issue unless one requires a really high uptime. And now we have maybe the more interesting type of metadata based attack, which is the page table. So as I said at the beginning, every time you protect something, you are basically moving the target somewhere else. So if we have managed to secure the data, secure the metadata, what is left is the mechanism which is supporting the mapping of physical pages to get our addresses. And that's the page table. The simpler case is you have a physical page mapped somewhere with some properties, the attacker manages to go to the page table and change the page table entry to values which are most suitable. Mitigation there is applying the same strategy also to the page table. So right protect also the page table. That will have a certain speed impact. But if we think about protecting pages in the page table, which are used to map data which doesn't change much, it's going to be very infrequent from changes to the PTEs. So that should not be a huge impact. The last year case is when a page, physical page which was mapped at a certain address with certain properties gets remapped somewhere else with different properties. For example, you have the read only page and it gets mapped as writable as some different address. That's how the fixed map works, but nothing prevents the attacker from creating own fixed map. The problem here is that if we start to protect all the pages in the page table, because this can happen at any address. So if we start to protect in that, then we are basically adding overhead to every modification of every page table entry in the page table. And that's probably not acceptable. One mitigation for that is to apply again the same strategy which is for a system which has a specific purpose, it is possible to perform profiling and figure out roughly what's the maximum number of physical pages that will be used in the page table and allocate them from a fixed address in physical address range, which would allow for a quick verification and consistency check between the mapping in the virtual address space and the physical pages. Another option would be to keep at least most of the page table unmapped most of the time. The MMU does not need the page table to be mapped. The MMU will access it anyway. The problem is that some existing Linux code relies on properties through the page table, so not every part of the page table might be easily unmappable. But that at least would reduce the attack surface by making it harder for an attacker to just write a new address, a new mapping in some random place of the page table. And of course we could also try to perform a combination of those. Now, this is the high level thinking behind this solution. The backing can be implemented in different ways that I will try to describe. So the first one is to just use the kernel as it is on regular hardware and basically implement what I've described so far. The solution that was actually recommended to me on the mailing list was that physical pages would be mapped as read only in the primary address space and then there would be secondary address space where these pages would be mapped as writeable. And the secondary address space would be activated only when modification is needed. This is good in the sense it doesn't have additional dependencies. Once you have switched address space, you do not have to worry about modifying code because you have just changed the properties of the mapping and the code will happily run on the new mapping as long as it can write what it wants to write. But the big problem is that since all of this is just happening in kernel, and as we said, the hardware can be always reprogrammed to behave differently. All of this can be undone. So if you are pessimistic enough to assume that the attacker has gained read-write capability in data, in kernel data, and has enough time to look around and figure out where it should write, he can basically disable all of this. Another possibility would be to use better hardware, which I hope is coming soon from different vendors. For example, we could use memory tagging, which intrinsically tries to call or data based on which code can access it, so that even if an attacker manages to gain read-write capability, the moment he uses a certain piece of code to modify data which does not belong to that code, the modified data will be tagged differently. So the next time the legitimate user of that data will try to access it, that will cause an exception. But again, we are kind of moving the target, which is if the kernel has a tagging mechanism supported by hardware, the attacker will probably try to target the tagging mechanism to bypass the tagging protection. Yeah, the context can still be subverted, because everything is still happening within the kernel context. Another option, which is what we are using in Huawei, is to use a hypervisor as a back-end. So at that point, there are two contexts. I'm not claiming the solution is perfect or bulletproof, but at least it makes it harder for the attacker to modify a certain part of the context, because that part of the context is accessible only to the hypervisor, and the hypervisor is accessible only through specific calls. And when the hypervisor is invoked, it can still try to detect if the kernel is still saying or if it has been compromised. So there's a possibility to perform additional vetting. Problems related with this solution are that switching context between kernel and hypervisor is not free. There is a delay, and also the hypervisor does not have the capability of running the same code that the kernel has, and that might also include the licensing problems if your hypervisor does not have the same license as the kernel. One possible solution which is what we are aiming for is to have a hybrid hypervisor kernel mechanism where the kernel still tries to perform most of the vetting and detect attacks, and the hypervisor does a sort of double check which means that if the kernel can detect that something is wrong, for example, a user space process has somehow from its context managed to generate some illegal right, it can try to kill that user space process. The hypervisor would be left as a sort of extreme measure if the hypervisor detects some error which the kernel did not, at that point it can even reset the device. But it's something that is supposed to happen very frequently. Yes. Finally, there is a new set of interesting possibilities which are becoming more and more available in recent times, which is to start to design also custom hardware and see if we can simplify the code. In many cases we, at least I have the feeling that we are playing with bits here and there just to try to cope with the limitations that are imposed by the current generation of hardware, but if we could implement different hardware we could probably simplify the code. And this is where FPGA may be coming to the rescue. For example, as I mentioned, trying to implement irreversible protection. Yeah, the big problem is that you must be able eventually to create your own new hardware. Not everyone can do it. Yeah, final consideration is no matter how hard is your solution you might still have to come to compromises because if you have a big data center maybe you can just scale it up, replicate it. It's not going to be cheap. But if you have a battery-powered device you might not be able to just throw more hardware at it. You might have a different sort of constraints where eventually you cannot put many more cores or you cannot crank up the speed. And you also have speed constraints. So I hope I have at least if not proposed a complete solution try to explain what are the problems that are present in this field and where the direction might take us. It's really something that I think is depending on what is your use case and what are the compromises that you are willing to accept. And thank you for listening to me. We have a bit of a long time but maybe one question quick question that we can still take before moving to the next talk. If there are no questions then let's thank the speaker again.