 Hello, I'm Igor Stopa. I work for Huawei at the Research Centre in Helsinki. I've been doing this for almost two years. Most of my work has been about memory protection, and that's what I want to talk to you about. This presentation is a sort of diary of a journey because I'm still working on this. Some of the material in the slides is kind of already outdated. I wake up every morning finding, as I say, you're wrong here. I will fix it. If you think that I'm wrong or something that I haven't found out yet, please let me know. So let's see where it all started. I work for Huawei. Huawei, amongst other things, does mobile phones. Not surprisingly, I care about mobile phones. In certain areas, you do not have access to the Google Play Store, so if you are a user, you have to find from somewhere the application and install it. That opens the door to every sort of malicious attack that usually uses from the Google Store might not see. So it's kind of taken as granted that there will be malicious software on the phone at some point. What we would like to do is to at least prevent the worst from happening that user information is leaked. We have seen a set of attacks. They mostly go against the Linux because it's one of the security features used by Android and it seems to be the common stepping stone of all of these attacks. So the idea is if we can somehow prevent that, we have already hampered a class of attacks, no matter what's the vulnerability that might be exploited in a specific product. But then the next question is obviously, can we use this technology also for protecting something else? There's also, we can call it a sort of side effect, which is right protection does not only work against attacks, but it works also against bugs. The difference is that for bugs, everything is good, everything is fine. You do not have to really aim for total coverage. Whatever you get is good. But if you have an attacker, then you have to look at the whole chain of trust basically. You might have a 95% coverage, but that 5% is what the attacker will go after. Working with upstream is rewarding, but it's also tough. I have been looking for an example and I have to say that it's good because looking for an example also forces you to question your own choices. The initial example was the SELinux and LSM hooks. But the sad reality is that the SELinux has a very complex data structure and LSM hooks are constantly moving target. I mean, I think every other one or two releases there is a huge patch set trying to alter them. So it wasn't good for me to keep it as an example because I couldn't spend my life rewriting patches for LSM hooks. Ainters, Mimi, thank you for proposing the IMA list of measurements. It's so much simpler compared to the SELinux data structures and it's kind of more stable. It also was useful to point out that the API I had chosen at that point was not good enough. At least I couldn't make it work. So that also provided a bit of variety in my analysis of what can happen. SELinux policy DB, which is what I was protecting initially, has a sort of initial transient during which it loads a bunch of data crates, a lot of structures, allocate stuff, but after that it's quite. You can mark it read-only and nobody will notice. First of all, that is not the typical case for kernel code. IMA is a much better representation of kernel code where you might have something which happens seldom, but it does happen all the time. For example, with IMA, if you are looking at measurements for files, every now and then you might want to add a new measurement. It might not be performance critical in terms of that specific addition, but it means that the memory cannot become completely read-only because the way the measurement list is implemented is through a link list. So every time I add a new measurement, I need to modify a pointer in that list. And this takes us to the right-rare functionality. Initially, it was only for dynamic allocated memory because that's what IMA was doing. It allocates a bit of memory and then adds it to the list. Yeah, what's the meaning of right-rare? I think it's highly subjective because what is rare for me might not be rare for someone else, and what is acceptable for me might not be acceptable for someone else. So in that sense, the idea is that it is provided as a mechanism, and then every potential user has to make a choice analysis on whether it's suitable or not for that specific use case. More learnings from IMA. Not only dynamically allocated memory needs to change every now and then, because, for example, if I have a list of dynamically allocated memory, there is, in reality, a statically allocated head, which I need to alter at some point to append something. So right-rare needs to work also there. And since we are talking about lists, what I initially started with was a sort of really bare-born version of right-rare, where I was dealing directly with pointers, but it turns out that this is extremely painful. What you want to have is some sort of higher-level abstraction, and guess what? You want it to look a lot like the disruption that you would use for non-protected memory, which in practice means having an alternate version of the functions that the kernel already uses. So let's have a quick look at what is available right now. This should be a surprise for nobody, but I just wanted to list it. So you have a few ways of protecting statically allocated memory, but the sad thing is that when you move to dynamically allocated memory, there's nothing. That's where Pimalloc, which was my initial idea, came in the picture because SC Linux uses dynamic allocation for most of the memory it requires, because SC Linux doesn't know how big would be the set of policies it needs to accommodate. And as I explained earlier, also statically allocated memory needs to have a right-rare support. By the way, I found out this morning that PRMAM is not a good choice because it can be confused with PMM, and it could even be used in the same context. So if you have a proposal, you're welcome. The difference between read-only and right-rare seems obvious, but it's not, I think. At least it wasn't obvious to me till I had some soul-searching thinking. Read-only is final. So when you say make this read-only, that's it. There's no way back. So you are just saying, I am giving up any future choice, the possibility of modifying this. But when you do right-rare, periodically you are touching something that should not be touched. So the question is, how can we be sure that the one trying to modify it is a legitimate? However, from a perspective of hardening, I think it's still usable or useful. Quick overview of the protection mechanism. So typically X86 ARM-based system, they have a MMU, and the MMU is what can be used to protect memory. MMU works at the page level. So what doesn't work is if you just allocate some memory from a page and want to make it read-only or right-rare, and from the same page you get some writable memory. It can be made to work if you really, really want it with some hypervisor trick, but it's going to cost a lot. It's much more convenient to split at the beginning writable memory from protected memory, so that whenever somebody tries to modify protected memory, this will trigger an exception. So the MMU comes in the way only when something bad is happening. In the normal case, the MMU will not generate an exception. And we have two ways to do this. One is only with a kernel, and the other one is a kernel plus something else, which can be hypervisor, T, you name it. It doesn't matter. It's just something which is not kernel. The first case with kernel only, the really bad thing is that the protection can be undone. So current hardware does not have any one-way option to say, I do this once, and I will never take it back. So if the attacker manages to run or to modify data at kernel level, the protection can be undone. However, even in this case, from a hardening perspective, it still reduces the attack surface, which I think is better because we are moving from having basically every bit of memory as a potential target to focusing on the page table, for example. In the case with hypervisor, it's kind of easier if you happen to already have a hypervisor handy because the kernel is much more limited in what you can do. And basically, the hypervisor is what the kernel can use to relinquish permanently its capability. Downside is, of course, you need to have a hypervisor or capable hardware. And if you think about IoT class devices, that might not always be the case. But there are some cases where the hypervisor is there already, for example, in data centers, cloud providers, some mobile devices. But even your regular laptop or PC supports running hypervisor usually. So why not? It could be used even in regular distros as an additional hardening form. I'm borrowing some naming from Git. So the base of this protection is what I call PRMM. So the requirements are, reads must be fast, or at least there shouldn't be any noticeable decrease in read performance. And writes for the right rare case should be acceptable. We already discussed what acceptable means. It's probably subjective. And since Linux doesn't run only on system with system with MMU, it also needs to work when there's no MMU available. And for the dynamic allocation case, I'm using Vmalloc as backend. This allows to be sure that as long as there is a system memory, the allocation would not fail. And it uses logical pools for the perspective of having properties that are assigned to the pool. And then every VMA which is allocated for that pool will have the same properties. It's a kind of trade-off between Kmalloc and Vmalloc because Vmalloc basically, every time it's invoked, allocates one or more pages and gives those. That means that every page will use a TLB entry. So if you allocate five times 10 bytes, that's going to cost you a lot in terms of TLB trashing. On the other hand, Kmalloc uses huge page mapping typically, so that basically is free. This solution is kind of intermediate because it reuses the free space that was available from the previous allocation. So if I allocate 10 bytes from a page, the next allocation will use whatever is left from that page. The implementation of write rare for static allocated memory is not so different from read-only after init. The major difference is that read-only after init can piggyback more on constant data because they kind of go hand in hand. Actually, at least on some architecture, they are very protected simultaneously. Write rare intrinsically has a different write history, so it cannot be treated the same, especially because of ARM64. ARM64 uses huge mappings, while for write rare, I would like to use as small as possible mappings to minimize the area which becomes writeable when I have to modify something. There's a version which is what I have posted to the camera mailing list without hypervisor, and as I was saying, it's supposed to use small pages to prevent completely getting the system stuck. It does not disable interrupts on the whole system, but only locally on the CPU, which is executing. What it does is it gets a new random address, remaps the page which is very protected to this random address, which should be harder for an attacker to identify, performs the change, and then destroys the mapping. All of the functions implemented in line with the hope that this reduces also, in that case, the attack surface from an attacker, meaning that it should be possible for the linker and the compiler to do better optimization if all of this is in line. The hypervisor case is easier in a sense because I don't have to worry about mapping. The hypervisor can have its own mapping. It doesn't care if the kernel is having an interrupt or not. It's still good to inline the functions which get to the point when the hypervisor is invoked. This is what I have implemented so far of plumbing. There is some discussion going about whether it should be like this or not. I do not have an answer, I guess that's the point of asking for review. Anyway, my takeaway is I need a implementation of the basic functionality, mem copy, mem set, assignment of a pointer, handling of atomic operation. These are nice side effects. For example, now that certain type of data which we know is supposed to be more valuable, that's why we want to protect it, lives in certain areas, we can use this knowledge to enhance hard and user copy coverage, because most likely that data we don't want to, well, it's already very protected but probably we don't want to let user space even read it. Currently, I'm not really releasing any memory, so it means that the use of the free attacks are kind of impossible on this because even if there is some bug which tries to use memory which was freed, it will still be there because once it is given, it's never taken back. There's a different use profile of the TLB. I haven't measured it, so I cannot say how much good or bad it is. I suspect it's going to be a bit worse, but I do not know how much. This is another thing that has received crossfire from various directions, but I will talk about it anyway for now. What I wanted to achieve was to have a compatibility between the, let's say, normal read-write version of a certain data structure and the write-rare version of it. I wanted to recycle the reading code because since it's read-write-rare, that doesn't say anything that cannot be done with reading. The key part is that more of this needs to be atomic, in the sense that I want to make it so that one page remapping is sufficient. I do not want to have some data crossing a page boundary, some simple type of data. That was a lot of words. This probably explains it better. On the left side, there's the typical version of the data structure. On the right side, there's the PR version. In reality, it's just a lot of code to access the same data in different ways, but at least it should give the compiler the notion of what is write-protected and whatnot. The alignment is such that one of these pointers does not cross a page boundary. I have already converted and they seem to work various types of lists. What I might be doing next, if my proposal is not fully completely rejected, is the object cache. Why? Because, for example, if I want to apply write- rare to the AVC cache in a CDinux policy DB, that one allocates and releases a lot of nodes. So I need to, I cannot just eat up memory forever. Say this, there are two ways of using write-rare. One I call the chain, which is probably the most obvious one, and the other one is looped, which is something that I found in Samsung code. Chain basically means that you start with statically allocated pointer or some data structure and then you have one or more dynamic allocations which follow. And this is really a chain of trust and you want to protect it all, because if you leave even one link out, that will be the target of the attack. This is more complex, but, for example, it is used by NOx for protecting LSM. So let's say that you have a data structure with some fields which are writable and some fields which you want to protect. So what you do is you split it, put in a specific area the part which you want to write protect, and only structures of that type will belong to that area and keep elsewhere the writable one. The writable one has a pointer to the counterpart which is right protected, and the right protected part has a pointer back to the writable one. The cost of this is that before the referencing what I put here as a P1, it needs to be validated because attacker could write anything there. So this is a case where write-rare has also read overhead. Basically, I need to verify that that P1 points to the area where the structures might be, and that it's also aligned to where the field that points back should be or the parent structure. But it's a way of having a sort of floating protection that doesn't require the whole chain that I showed in the previous slide. So, again, once more, choosing one or the other really depends on the use case and how complex it is to provide a full chain versus taking the overhead of validating before reading the pointer. This is an example of protection code. As I said, the validity of this is volatile, might change soon, but for now, if you look at it, the green part of the code is what I have added, and I think it doesn't look too alien. There's an initial modifier to specify that the variable is write-rare after in it. There's a declaration of a pool. The pool is a way to specify properties for the memory that will be allocated, and then it shows that instead of writing size equal 5, I use WR underscore int to set that value. The reason is that this does the magic of writing through the secondary mapping. PicAloc is just allocating some memory, and then the last instruction is writing to the actual pointer the address of the memory that was allocated. This is a bit more complex, and I'm not sure I should linger too much on it also because it's part of what is being currently challenged. Mostly what I would like to say is that I am trying to reach a point where it is really explicit if I'm doing something with protected memory rather than not, while the comments I got from various people from upstream are that I should make it happen behind the scene. I don't know. Maybe it's me trying to get someone to convince me that I should write less code, which is not bad. This is a similar case. In practice, the takeaway is again where I have to read, I can just look at the same data structure from a different angle, and I can reuse the existing functionality, so I'm still using hlist unhashed. While for writing or altering the list, I have a new function which is PRHlistDell in place of the normal hlist. And the assignment also of the null pointer happens to a true specific function. Current limitations, I still have to get it to work with ARM64. I have it working, but it's through the hypervisor. I do not have it working without because I probably need to create a separate section with different mapping, but I hope I can tackle that. I need to write a lot of full-back code for the no MMU case, and then I cowardly avoided writing any test cases for RCU and Atomic because they're really complex. At this point, I was just trying to get something which would kind of work from end to end so that I could use it as a base for a conversation, and I think I got that far. Now there's more grueling part of getting it to work for real. As a disclaimer, I'm not claiming that it's all good. There are things that I know do not work or can be attacked. As I said, right now, even with the hypervisor, you do not really know for sure who is asking you to modify something. So if the attacker is smart enough, it can use the system against itself. I've actually seen this sort of attack described in some slides against a Nox phone. So I'm not saying anything new. The metadata used is also could be used for an attack. So that might be possible to protect, but I'm not sure how yet. MMU page tables are also something that could be attacked. But again, the goal is if we could reduce the attack to surface, the MMU page tables, for example, that might be incentive to hardware designers to think about something that would make it harder to attack those. And there's the usual problem with the randomness that if someone manages to drain your randomness pool, then it becomes easier to guess where the next right will be mapped to. There are performance limitations. If, for example, I want to append or insert a node in a list, I have to deal with four pointers. Right now I'm doing it on them one by one, also because first I wanted to get to the point where there's good confidence that the right-rare base mechanism works. But the point is I am going four times through remapping or possibly hypervisor calls. This has a performance cost. Ideally, if I have something which is used a lot, and I think lists are a very good example, then it might be worth to promote them in a sense to become intrinsic right-rare functions instead of depending on some more basic function like a sine pointer. What's next? Yes, I have to fix the static right-rare after init for ARM64 and the fallback. And the test case is then the problems I just described. Probably vetting the call part is something doable. The metadata is questionable because it's not only about the metadata that I'm producing with Pimalloc pool, but also the metadata that comes from the Pimalloc areas. And that seems to be a completely different class of difficulty. This is the optimization I mentioned earlier. The drawback of that is that basically whatever becomes sort of intrinsic function then needs to be implemented also on the hypervisor side. So it would be nice if there was some sort of data library that could be shared in both cases, but I don't know how much that is possible. And assuming I have any spare time, this is what I would like to do. SC Linux and the LSM hooks, they go kind of hand in hand. Well, in reality I think protecting any Linux module requires also protecting the LSM hooks because as I mentioned when I was talking about chains, LSM hooks is again something that is always in the way and it's useless that you have a very hardened module if it can be just unplugged. For SC Linux, the policy DB is kind of easy. I've already done it, but it's not something that I have submitted upstream. AVC, I have tried it without right rear and I managed to make it become 10 times slower, so that's also not going to submit. Containers might be a nice use case. Unfortunately, I don't remember the name of this person, but someone was trying to submit patch for LSM hooks to try to improve the support for containers. So you can have read only hooks for the main execution environment, but you could have right rear for everything else that is in a container because that's something that you can destroy. Does it work? It seems to, although again this morning I was pointed out that I'm doing something nasty with interrupts disabled that I shouldn't. I have to look into that, but there are also two types of, does it work? I think so far it's also reasonably usable that it can be used to convert existing code. I think that that's the major point that if I come up with something that works but nobody wants to use because it's insanely complicated then I haven't achieved much. For sure, I think we can agree that it uses the attack surface and maybe even more it makes the system less prone to undetected memory corruption. Hypervisor, yes, is the magic. I think it's coming one way or the other in reality. A lot of Linux applications do use hypervisor for very different reasons, but it's there, so I don't think we can ignore that it's there, so we should take advantage of it. There are, there is a possibility that control flow attacks can sidestep all of this. I don't have an answer for that other than I don't think it's possible to make it completely bulletproof. One thing that I think is good is the fact that it's opt-in, so if you feel that your system can use it good, if not, well, maybe next time. Thank you. As a reference, okay, I suggest that you do not look at my patch set because this is really ugly, but what I said to the mailing list is a bit less ugly. For your convenience, I have put, these were just one huge turbo, so what I've done, I have downloaded the publicly available source code from Huawei and from Samsung, and I put it on Github. You can see there some of the things I mentioned. If you really want to see how we are protecting as Linux now, you can see there, and you can see the LSM protection from Samsung. And that's all. Questions? Thanks for talking for this research. You mentioned the problems and that you can't understand whether the caller of the right function is valid or not. Yes. Maybe the control flow integrity can give partial answer to this, so it enforces that only the callers from initial source code are able to call your function, and it restricts the number from which you can call the writing function. Yes. The control flow integrity is one thing definitely doesn't hurt. One problem is that I've seen various implementations, and each of them can be kind of sidestepped in some cases. So I think it's the same problem with security in general. You might have something that is nice to have, but you cannot trust that it will fix all the cases. But yeah, for sure, anything which hampers the alteration of how the flow of the program should be is good. Thank you. Maybe just a small reminder for poor asks. Questions please introduce yourself, but people can find you if you're interested too. David Helms. I'd like to know what are the vectors by which this overwriting is actually happening? I've heard stories about Android having secret files which provide access to memory. Are these being used? I'm sorry. It's true that I work for a company which makes phones, but what I work on is what you've seen. So there might be something, but I cannot answer this question. So historically there was Dev Exynos Mem on certain devices which allowed you to do that. There have been other things that vendors have put together where they've not thought very strongly. Sorry, Mark Rilland. One question. Hi, Taiko Anderson. I'm actually one of the people who complained about, but I like the work in general. So I just was wondering, you could use CFI, you could also, even as just a first attempt, do like a sort of a set FS style thing where you say, okay, I'm about to enter some code which is going to cause a fault. We should allow this to be mapped writable and then unset it later just as like, I know that this specific little bit of code is going to write to not writable memory. And I know that this other little specific bit is, but in general, now I will protect against all of the other write anywhere primitives that exist outside of these little specific bits of code. So you get a lot of this without having to wait for CFI or anything else. You are proposing a sort of white listing of functions which can... Yeah, I mean, I guess I was thinking of it's just like a global bit that you enable and say, I'm about to do some activity that's going to cause a fault and writable memory, but it's okay. Do the fault, remap it as writable, do your access, then go back to the regular code and then unset that bit. So then in normal execution, if there is a right to this place, then it faults and things panic or whatever. So what will protect that bit? Like something else, I don't know. It's just another level. I mean, so again, if the bit itself is right protected, then you have to be able to call a function and write. Yeah, well, there are... I mean, it's kind of becoming a philosophical problem. Sure, yeah, totally. But the problem or the reason why I'm not so fond of something global is that basically you are giving a real clear point of attack that is easier to identify. Yeah, but you also don't have to duplicate every data structure in the kernel, so... No, well, I'm wondering, is there a better way? So please take my proposal as something which is a conversation start. I'm not saying this is the truth and I think I know better. I'm just saying this is a way to do it. How can we do it better? The pain is, as soon as you start protecting larger and larger structures, you end up going, oh, wait, I need all of the list handling functions dealt with and now I need all the atomics and yeah, it's trying to find the middle ground I think is where it's going to be tricky. So I think it better stop here so maybe we can take more discussions. So let's thank the speaker.