 Hi everyone! Thanks for joining this talk. I'm Sebastia Estrunt and I'm here to tell you about MPK-PKS kernel compartmentalization, or rather how we can use memory protection keys to compartmentalize the Linux kernel and make it more secure. So let me start out telling you a bit about my background. So at the moment I work as an offensive security researcher at Intel hypostorm spear, which is a team to focus on offensive security research and developing mitigations. Before joining Intel I was a PhD student at Fusek Frey University in Amsterdam, where my research topics include operating system defenses. I worked on side channels of transit execution tech, so I was involved with real MDS. Besides that I've been working on fuzzing and compilers in general. So nowadays my day-to-day duties include working on static analysis of microcode, looking at operating system defenses such as this one, and in general just looking at new hardware features. So now we can use them for security. So as you might have guessed from the title of this talk it's all about compartmentalization. So why do you need compartmentalization in Linux kernel? So the first observation is that nowadays there's a lot of third-party code that runs in ring zeroes that's not necessarily upstream in the main kernel tree. So think of things like drivers or eBPF packet field as all running a ring zero. So suppose there's some kind of bug in any of these programs running a ring zero, say a memory error, an attacker could potentially disclose all the memory available on the machine just from a single bug, which is not really ideal. So the observation here for a lot of these things it's not really necessary to be able to access all the memory when you're running these kind of contexts. For example, eBPF scripts there's no need for them to really be able to access the direct mapping access all memory. And then there's obviously the elephant in the room, transit execution text. So these guys, even despite there not being any actual memory errors or bugs in the program, an attacker could still abuse this kind of third party code that runs in ring zero and execute gadgets from them to disclose memory through transit execution gadgets. So what I want to propose today is to implement a kind of kernel compartmentalization using new hardware features to make it lightweight. Step back to transit execution attacks again. Since I mentioned the elephant in the room like Spectre Meltdown and MDS and friends, these have been quite a headache for the kernel community to probably mitigate. So there's so many of these things that's like already wrapping your head around what they all do and like probably checking if something's probably mitigating is kind of a pain and doesn't help that some like more heavyweight blanket mitigation that they have been proposed are really hard to implement. So one example I can mention is the core scheduling mitigation, which is partially upstream in the kernel. Basically what it does it pins a certain security domain to a core. So any process that runs on whatever like SMT threads or hyper threads on that core need to be in the same security domain. Obviously a big part of core scheduling would be to protect the kernel so you can have two co-located threads, one running in user space and one in kernel space and then like leak stuff across threads. This part of core scheduling that hasn't been upstreamed yet as far as I know because it's simply the cost performance costs are really high and it's really tricky I think to get right. So as a result different vendors are simply opted to disable functionality such as some VSDs I disabled hybrid training altogether and most Linux distros have disabled unprivileged BPF code after Spectre BHB was made public. So I want to go into too much details on Spectre how it works but essentially what the attacker can do is mis-predict the branch predictor so that you can access out of bounds array access in the speculative execution and then leak the secret that you obtain through it from that through some micro-architectural side effects basically in the cache. So you can access some out-of-bounds things, arbitrary memory access, leak the secret that's accessed there into the cache and then yeah you do a cache attack to to leak the actual secret. So in general these kind of attacks have the following format you prepare the system by flushing some buffer then you do the attack which accesses some some secrets and leaves a micro-architectural trace for example in the cache and then you do a timing attack to see okay for example which cache entries have been accessed here so you have the flush after you've flushed all the buffers made the system into state that is known you can time any changes in in the state and then see what secret has been leaked. This is a very high-level overview and a hand-wavy overview of how this works but just to be on the same page to see like if you want some more details there's a lot of other presentations going in depth on how these transit execution attacks work. So what I also wanted to note is that compartmentalization is already present in Linux kernel there have been a few efforts like endeavors trying to do some kind of compartmentalization in the context of security so the first one that comes to mind is kpti or kernel page table isolation which was the original mitigation proposed for for meltdown. More recently Google has kind of taken this idea made it more like generic so it can be used for a lot of mitigating like these kind of similar transit execution attacks. Simply it works by having different address spaces so kpti had like different address space for user space and the kernel because they want to avoid kernel leaks through meltdown asi address space isolation can do this inside the kernel especially for paths from the hypervisor so that's when someone does access from a virtual machine like in a cloud provider for example that they can't access in arbitrary memory if there's some kind of new speculative execution gadgets available. Another example of some compartmentalization is kernel lockdown or secure boot that kind of just locks down the system so that people can that even the root user can't modify so suppose the root user is compromised they can't get like arbitrary access to everything and then there's the third kind of kind of compartmentalization everywhere I can think of is confidential compute so think of hardware features like SGX that allows kind of confidential processes to run where the host the machine that's running it can't mess with whatever is running in that kind of compartmentalized SGX enclave and more recently this this idea has been generalized to tdx or scb where you can run a whole virtual machine with kind of this hardware guaranteed walled garden around it so that suppose you have a malicious hypervisor a cloud provider they can't read what's going on inside your virtual machine so this is just an example of how like these kind of compartmentalization things are actually used today another thing I want to notice that many of these are quite like heavyweight and typically reserved for virtualization where you like enter exit the virtual machine which is already quite expensive so and it doesn't happen that often so even if so you typically like need to switch address spaces or something like that which is quite expensive but if it doesn't happen all too often it's the overhead is like it's acceptable but suppose you want to do this inside the kernel for like these more smaller domains than yeah the overhead of doing context switching like switching address spaces becomes quite significant so in this talk what I want to do is propose like leveraging some new hardware features to kind of achieve something similar like similar compartmentalization but especially like similar to KPTI ASI but much more lightweight that brings us to memory protection keys so there's two different implementation of protection keys for Intel like PKU for use space and pks for supervisors protection keys for supervisors what will be focusing on for the kernel compartmentalization in essence what this harder feature allows you to do is override the permission in the page stable entry so you can suppose you have a writable page that's mapped as like writable accessible what you can do is for a certain like domain like you define domain key you can override all the permissions for all the pages that fall under that domain and disable write access for example simply to write through writing to an MSR so the benefit of this is that there's you can change page stable permissions or achieve something like similar to to KPTI ASI without having to change address spaces so no need to invalidate the TLBO change the CR which makes it a lot faster than than any of like yes fully switching address spaces so these hardware features like PKU's or protection keys for user land have been user mode have been available has been available for few generations and protection keys for supervisor has recently been released on fourth generation Intel Xeon scalable processors let's delve into this hardware feature pks will give you like a kind of overview how it works so already mentioned you have like these domains of protection keys so essentially what you can do is for each page table entry in your page away at you insert the protection key in like so some of the bits that have previously been reserved now contain like a protection key so normally if you think of a page table entry you have like you have these few bits right with some information so execute disable you can do read write user supervisor and then the actual like the physical address like how to do that address resolution is in the rest of the bits and so and some other things but essentially what we use a few bits in here that define like the protection domain of a certain page and the nice thing is like so it's four bits essentially in the page table entry that defines for which to which domain a certain page belongs and then what we can do is override the permissions that have been defined in the page table entry through an MSR so you can say access disable for certain protection key or write disable so as I already mentioned there's four bits that we used to associate the page to a like page they want to domain which gives us max 16 keys and then we have an MSR with 32 bits because you have a write disable access disable for every single domain just just to give you some in-depth information on how this works and like so so essentially what you can have is you have a you assign a certain domain to a page and you can overwrite the page of entries in the corresponding index in a in an MSR so for a more detailed explanation of how this all works have look at iris and ricks linux plumbers talk from 2021 they explain how they implemented this for the linux kernel so how can we use protection keys for supervisor for compartmentalization in the kernel so one of the things like okay what do we even want to compartmentalize uh so previously pks the protection keys for supervisor have been used to protect some sensitive areas in the kernel from the kernel itself so simply like if you're suppose there's a memory error in the kernel you don't want it to be able to leak certain like super sensitive information since a script of keys so previously like the since pks has been used to like basically like yeah deny access to to certain small parts of the kernel or disable writing to something like page table entries like the page tables or persistent memory so for compartmentalization we kind of want to turn this around we don't want to protect small parts of the kernel from kernel itself we want to protect the whole kernel itself while executing in a certain selective small part of code so just to visualize it right so we normally when you want to protect some secret you say okay uh map a certain secret area with a certain protection key and disable access to that when you're not in a critical section now what we want to do is in the general case like okay have access to all the memory in the kernel but if we're executing in some critical like some section that doesn't need access to all kernel memory such as ebp f4 so some other small parts then we disable access to all the normal kernel memory and only keep the minimal set of pages that we need so such as the stack and the data that it really needs access available another way of putting it is we move from a denialist approach and allow list approach I think that's the kind of the easiest way to see this so I identified a few targets where something like this would be possible right where we can make like most of the kernel memory non-accessible and just keep a small set like limited set that it actually like just needs to be access be able to access in that context so one of the things looked at was where I implemented this for ebp f so ebp f is this virtual machine basically that runs inside the kernel in ring zero that can intercept like it can hook into function you can have it run as a response when you get a network packet and bunch of other things but it's a very like it's a limited virtual machine that only should access like a very yeah a limited set of memory so here it's something it's a nice obvious candidate to kind of try this kind of compartmentalization where we just allow it to read arbitrary kernel memory so suppose there's a bug in there like then if an attacker can somehow try to like control a point to read arbitrary memory we get a fault if we try to read some something that's not some arbitrary memory kernel memory another thing I looked at was using pks for drop-in replacement for address space isolation so I'll talk about this a bit more later in the presentation but you can essentially like gives like similar guarantees as address space isolation using pks without having to switch between address space and doing a full context switch so I also like yeah I've been thinking about some other thing targets that we can do this kind of compartmentalization so if and if you have any suggestions where we can can try to like demo this out then we'll be super happy to chat about that later so let's go get into how does this actually work how can we do compartmentalization with pks as already mentioned like you have basically have these two domains so for clarity just let's just define these things it's like we have a privileged domain and a non-privileged one so you have a key zero the privilege key and non-privileged key is key one in privileged when you're privileged like in the privileged domain you can access all kernel memory like so basically everything that's yellow and green here should be accessible when you're in the like privileged domain if you're in the non-privileged domain so think of the that's the more restrictive one where you're in like inside ebpf or kind of enter like address space isolation section then you're only supposed to be able to access like the things marked in green or mapped on the key one so really the minimal set of memory and if you access it like when you're in the non-privileged domain and try to access anything for example here like the kernel text you should get a fault so how okay as already mentioned like if you're in the privileged domain you should be able to access everything so that's fairly simple we just keep the normal like mappings as you have in the kernel that's like normal kernel operation so it's supposed to be in the like some memory management code we obviously want to be able to access like we need to be able to access a lot like all the memory so then we're just in the we're in the privileged domain we don't need to do anything special like we keep all the access disable write disable bits to zero like we don't want to all write any page stable permissions whenever in the privileged kernel domain simple enough but then supposed to enter like a ebpf script or something else and we want to switch to the non-privileged domain where it's more restrictive and we can't access arbitrary memory anymore then what we do is simply for the for the pk0 so the one that maps all the like all the non-essential like all the yeah basically all the kernel memory that's not super strictly as essential for the critical section that we're in we said like override the write disable and access disable bits so we set them to one it's essentially this this tells us okay anything that's yellow in this picture here will be unaccessible while only the green stuff is still accessible so for p key one we don't override the permissions for pk0 we override the page sale permission so it's all disable so that's that basically gives us access to the minimal sets of memory that we need here it's obviously like it's very important to keep in mind that when you're executing stuff there's a minimal set of things that you always need to have mapped so think of like the just normal like the stack so if you operate like yeah just execute normal code like it needs to have a stack access another thing to keep in mind the interrupt stack so if you get the interrupt while inside the non-privileged domain still you need to be able to handle it so there's a few like these pages that really need to be to be mapped to actually have accessible so just to visualize how to switch between the privilege mode and the non-privileged mode you basically have these write disable access disable bits in the in the register you if you're in the privilege mode they're all zeros once you move into the non-privileged mode you set for pk0 so the one like the default one that all kernel memory is associated with so just override the bits so say make them non-accessible while we always keep the pk1 like the just for that's should be accessible in the non-privileged domain they will never overwrite the permissions there which has a certain effect okay once we move into the non-privileged mode we disable access for any pk0 memory but still keep access to the pk1 memory so the green stuff is still accessible the yellow stuff is not accessible easy enough right so getting this to work is there's a few challenges like how to get this compartmentalization to work so in general there's a few problems so for like ideally we would like to do this kind of compartmentalization all over the kernel right like as I mentioned drivers before but turns out this is not really not easy because in a lot of places the memory access are not really localized so you need to know beforehand which pages should be accessible inside your like your non-privileged domain so suppose you have a driver that accesses like memory all over the place in the kernel you need to make sure that all allocations that are made for that are actually mapped under the right domain key so that's something that's yeah it's a bit tough in the general sense there's a solution for this right so you can temporarily like disable the like this the defense and like allow temporary access to arbitrary memory so like if people are familiar with smap this is essentially how it works right so when you're in a section that really needs access use a mode use of memory there right with smap then you just simply disable the feature but yeah then so in general like yeah it's memory access like overall like overall in the kernel not really always localized so you can do this compartmentalization everywhere and then like the other question is how can we determine the areas that need to be accessed from within a domain which is also not really easy so you need to know like beforehand what memory like is accessed when you're in a certain when you enter into your non-privileged domain so for memory that's allocated well you're inside like the non-privileged domain it's quite easy just like you make sure that everything that you allocate is mapped under the right domain key but suppose it accesses like some pointers from somewhere else and tries to like access some struct with pointers in it like yeah you need to make sure that all this stuff is still is accessible so in the general sense like implementing this for something like the compartmentalization for drivers is not really feasible so that's my start of it like these more self-contained things like ebpf or as a dropping replacement for address-based isolation so yeah that's that's basically the gist of it so ideally yes we should be able to do it for everything I think in practice it's it's very hard I think to to do this in a general sense so let's have a look at the the first use case that I implemented where I've tried this out for some pks ebpf or bpf isolation so as already mentioned bpf it's a kind of a virtual machine that runs inside the linux kernel at ring zeroes it can run these kind of byte code bpf byte code that typically use for like packet filters and all kinds of other things you can do it like to for use it for profiling like instrumenting function calls in the kernel and so on the nice thing about bpf is that it's a very restrictive environment so like the byte code can't really do like access arbitrary memory by itself come back to that in a bit so there's some caveats there and this goes through an internal verifier to make sure that like the byte code actually follows all these very strict restrictions so that like there's a verifier that checks that the byte code is running is actually valid and doesn't do any weird stuff but what I want to note here there's been a few like known cvs in the verifier itself so actually it's been possible to create out-of-bound accesses in bpf despite the verifier so it's a few edge cases that they didn't check properly and I did a quick search for cvs so there's been like 80 bpf related cvs in the past few years so some of these things they slip through right in the bpf implementation so it would be pretty nice if you can just have a blanket mitigation see like okay even if there's vulnerabilities in bpf verifier we still want to make sure that you can't access arbitrary memory in the kernel from bpf and then as I already mentioned earlier like the transit execution at x is kind of a big the big elephant in the room here right so since you were able to like load user provided like byte code that's actually running in ring zero this was a perfect target to use as like running the transit execution gadgets so fine like you can define your own gadget that runs in ring zero and then like the speculative access whatever memory and then leak it so bpf yeah ebpf had to mitigate this stuff like so actually like when byte code is loaded like it inserts some mitigations like red pulley and other stuff so depending a bit on the mode how it's running but there's a bunch of mitigations for transit execution attacks in bpf and obviously these things have a performance overhead so the nice thing here is if we can do like this blanket mitigations could perhaps get rid of these like the piecemeal mitigations that are for the different specter variants just just a thought but yeah so like as I already mentioned like db like ebpf or bpf it's a small self-contained vm so it's a perfect target for for isolation compartmentalization and as a note there's like bpf can run in two different modes well you can also like set some different capabilities but in general there's like privilege mode and unprivileged mode the difference is like privilege mode is that these filters are trusted by root and unprivileged mode could be loaded by an arbitrary user the ladder has been disabled nowadays in most distors by default because there was simply no good way I think to mitigate this for for certain specter variants so in general how bpf works is as follows you have a bpf bitcode program that you load into the kernel with the bpf syscall in the kernel there's a verifier that verifies this bitcode checks that it actually follows all the yeah all the kind of requirements it for like verifies it that there's no weird looping things going on and so on and that there's the memory access all fine then there's a jitter so when it's around like it can like the bytecode is jitted so that it's very efficient and then you have like your bpf program that runs it can access so-called bpf maps so that you can like store memory like as you can see it as a kind of a heap for bpf programs and these maps can then again be accessed from user space to retrieve information that your bpf program operator so think of it like it can keep some counters on see how many times there's a certain function in execute and then you can read this map from user space and print out like some statistics so as I already mentioned like just a quick detour here on like ebpf mitigation so like just to hammer down the point that these are like there's a lot of these trends to execution attacks that need to be mitigated ebpf so first of all specter d1 so this was mitigated through like using some array masking and also having the verifier go through paths that normally can't be executed so kind of verifying speculative paths I guess uh for specter v2 uh kind of the it had to do some things for like indirect calls so basically can add like red boolean uh the red boolean mitigation for indirect calls and and so on and needs to have like some kind of fencing around indirect calls so it's it kind of this introduces a bit of overhead uh yeah specter e4 the same thing like it does some fencing for for this variant of specter specter bhp as a side effect of well it's a result of that one being disclosed unprogressed ebpf was disabled by default in most distros it's already mentioned so there's a whole talk on bpf and specter mitigation so if you're interested in this topic have a look at the this presentation bpf specter mitigating transient execution attacks I just wanted to like list all these different things you see that okay it's it's actually like the jitting needs to modify the actual code that's generated to mitigate transient execution attacks so okay on to how we can isolate ebpf we put this pks compartmentalization talking about so here on the right you see like the same picture we had right so we had like the bpf program loaded into the verifier jitted the program can run and then we have like this compartmentalization with protection keys that i've also shown you before so basically what we need to do is we need to switch into the non-privileged domain when we enter a bpf program so fairly simple when we call the function bpf trampoline enter or any of the other like there's a few different ways of entering it like depends on what the best intercept point is but here we switch into our non-privileged domain by writing to the msr and when we exit the bpf we switch back to our like our privilege domain like normal kernel so that we have access to everything so for this when you enter bpf program you have two extra msr writes uh so the thing to keep in mind is that so bpf like like ebpf has grown quite radically over the years so it's it can do a lot of stuff right so there's it can access so some different memory directly because it's something that you really want to use so like the the current um basic current thread is executing it can read uh some like the information from from the task thread struct uh and basically like there's a bunch of uh the helper functions also in bpf to be able to access the different kinds of commonly used kernel data so basically just two approaches how we can uh allow this normal bpf functionality that access like some more restricted kernel information data uh how we can allow still allow this so first of all is we can map all these pages that we need to access under the correct p key so basically you can ask like you can map the task struct and so make sure that they're always mapped under this protection key one so that they can access it there another approach is to dynamically disable the protection once once you go through one of these ebpf helpers that can can read these kernel data structures so obviously the first approach is way more performative than the second approach we always have to keep switching like the msr content so it's not really ideal uh so basically you have like these helper functions that can that can really read different things in the kernel there's also like some helper function ebpf to read arbitrary kernel memory but uh then you simply need to do this dynamically disable the protection but it's not that commonly used uh the second kind of memory that bpf programs can access already mentioned like the bpf maps so these obviously need to be mapped with the right permissions like under key one then uh there's some things known as performance buffers and per task storage so they're all very similar to this it's just like memory arrays that need to be actually mapped under right permission uh so yeah i already mentioned that some of these bpf helpers that can be called from the kernel need to access arbitrary like can access arbitrary memory data these shouldn't usually not be called by by most bpf programs so but so it's i think it's fine to dynamically disable the uh mitigation for these like similar to like smap but yeah that's just something to keep in mind to keep compatibility so ideally i think the the way to go forward this is to define a special like more restrictive bpf mode which when you use like this pks mitigation where it actually can't use this at all so in general like the memory that bpf programs can access kind of restricted so there's like a few types of pointers that like can can be like in registers during the bpf execution so you can access like a bpf context the the map that i was talking about so in the map values then can access like the stack uh it can access network packets so these you need to make sure that these are mapped under the right uh protection key so that's luckily these are all allocated with like some specific allocation function so it's pretty easy to it was pretty easy to modify it to be able to like access these things from a bpf context and same thing with uh sockets so yeah that's what the first one i implemented it's actually uh it was fairly like doable to get the get the cup and running uh for for like the most like bpf scripts that was uh testing this out with and so just some observations from uh from implementing it and running so the overhead uh for this mitigation for you bpf is obviously the right msr like when we need to enter bpf we do a right msr when we exit we do another right msr however like writing a msr it's not super expensive like so i think it's like matter of like the ratio between like how big the bpf script is versus like the cost of an msr right but most if they do a bit of work like the msr right is like really negligible so i did some initial benchmarking basically for this and uh so i i did a bpf filter that traces all syscalls that are run from the kernel and then i run like ellen bench this known benchmarking suite for the kernel and over like overhead of doing this right msr like on internet the bpf filter was like about one percent like very noticeable so it's uh that's actually promising um another thing that we could do like if we have this function like if we have this compartmentizations to disable the bpf mitigations for for spectrum friends so since can't access arbitrary memory like speculative execution anymore it's something that we can we could think about disabling the mitigations to just get the performance speed off so basically like what we've seen here is that uh using pks is a kind of a nice way of achieving this kind of compartmentalization so a nice thing is that it's very compatible with existing compartmentalization techniques such as asi i'll get back to it in a bit but you can basically get the same kind of security primitives as when switching like address space all together so i just want to point people to go goes address space isolation presentations since i've been referencing this a lot uh from linux plumber conference in dublin 2022 where they based like it's uh it's uh yeah generalization of kpti where you switch address spaces when you go into a like a address space isolation critical section so you basically uh oh yeah it's very similar to what i mentioned before right here with pks compartmentalization keep a minimal set of memory maps that you actually need to be able to access so in contrast to like uh switching address space at pks like it's it's pretty lightweight it's just msr uh right rather than uh than doing a full address space switch full context switch and and it's so as a kind of side effect of it being like it's pretty flexible you can use it in a lot more places than like performance is less of a thought here like it's not super expensive to to do this switch so then uh let me also talk about like besides epp fi did another implementation so dropping replacement for address space isolation using pks uh so as i already mentioned basically like it's uh the idea is pretty simple you split the memory into privileged unprivileged domain in the kernel you have two sets of page tables um the privileged one is like the normal kernel mappings just as i mentioned with the pks epp for implementation and they have the unprivileged which has a minimal set of page pages that you need to be able to access so essentially gets something that looks like this so like on the left you have the privileged address space that connects all the memory under the right you only access like certain parts that are really necessary uh once you enter into this restricted uh like domain kind of for you do an as i enter call which switches that uh address space does a contact switch and as a as i exit the yeah switches back to the normal uh privileged one uh and yeah of course any member that you need to be able to access in this uh inside this domain here inside the unprivileged one needs to be mapped with uh yeah you need to keep track of that it's actually mapped in that page table uh in the unprivileged one so you basically add the x of flag to k malloc and v malloc so if you look at how it works you have this as i enter that like if you're like in privileged kernel work so for example the memory management scheduler then once you go into some some other workload you enter the the unprivileged domain so you switch the cr3's and then you do your workload so for example uh some hypervisor work right and then once you exit need to go into this privileged work again you do an as i exit it's very similar to what i mentioned with the pks but this is uh you switch the address space like the cr3 register instead of writing msr so basically what i did is uh implemented like the an as i like uh dropping replacement with pks so it's exactly the same as like it's very similar to as i but instead of doing a cr3 switch i switched the msr as i already discussed before uh so there's no need to keep two sets of page tables we can do everything in in one we just need to make sure that the uh domain key the protection key is actually the right one so instead of mapping something the other page table we need to if you do a mapping with this flag we need to make sure that the protection key is the right one so the gfpx non sensitive that's provided to k malloc then we need to make sure that this map maps with pk pk1 uh yeah so basically it's uh i need to modify like a malloc to use to basically handle this logic and use specific like uh slabs or arenas basically to for anything that should be like this non sensitive uh data so one downside i need to mention is that pks cannot override execution permissions also really you might already have noticed this you can override access uh you can disable access or write but not execution so the security guarantees are slightly different than address space isolation like slightly lower but in practice it's for data attacks it's it's the same guarantees uh yeah so basically if you looked at the previous one where you switched the cr3 here like the implementation i did like it's dropping a replacement but instead of doing the cr3 switch you switched the msr which is a lot cheaper than full cr3 switch so in conclusion uh i've talked about like how can compartmentalize the linux kernel with protection keys for supervisor uh so to start off with noting there's a lot of third-party untrusted kernel code that runs in ring zero uh so memory errors or even transit execution gadgets can allow memory disclosure so what i presented here is a kind of framework to use in protection keys for supervisor to to compartmentalize the kernel and make data non accessible and the first use case of implemented prototypes for here uh ebpf and address space isolation crop in replacements uh and the take-home message here is basically it's a lightweight uh alternative to switching address spaces it's only writing a msr rather than doing a full cr3 switch so a lot easier like cheaper and no like need for doing any tlb management and so on so it's it's kind of a nifty way i think of doing it and another benefit is that we can probably get rid of some mitigation in some cases right if you want to speed up i haven't looked at that yet what the actual security implications are there but it's something that is for sure can keep in mind that we can get actually get the speed performance speed up with this so that was basically all i wanted to talk about during this presentation feel free to reach out to me on twitter or by email and i'll be here to take some questions now after the presentation thanks hey uh sabastian great talk um quick question for you uh if i'm on newer kernels uh i know in bpf tracing programs uh you can uh just i you can just dereference memory uh directly from the program and i believe that that'll compile down to something that uh does not indirect that reference through a bpf pro breed i'm pretty sure that that's only possible in in tracing programs currently uh i was wondering if you looked at that during your research and if you had any mitigations for that yeah thanks great question uh yeah so as far as i can tell it goes through this specific wrapper uh so what i've currently implemented is just disable the mitigation at that point because it's yeah it's an arbitrary memory reference it's uh i don't see a way to actually like keep it supported uh with with this mitigation place but yeah as you said it should only be for tracing programs it's already a subset of possible bpf programs so yeah i think great questions i think it's a limitation but yeah see the only solution is to simply temporarily disable the mitigate the defense in the case of the uh ebpf isolation prototype um do you think it would be possible to essentially trigger the jit to generate like a right msr gadget which would then be used to put you in the pkey zero domain essentially yeah so that that's a great question so i think that's related to the question was asked in chat here if uh you can see that so uh from the jit from uh bpf i actually don't know if it would be possible to generate that but uh let's assume that that you can generate something that uh yeah so somehow get it then yeah it would for sure be possible to to generate that um so as i already mentioned i think here that it's not really a mitigation that can prevent the control flow hijacking or anything else like that so uh yeah from that specific case like i i haven't looked into whether it's possible but uh so i would think it's improbable to like be able to generate that from the jit but uh yeah you know there could all could always be bugs to be able to generate this stuff so that's one way to circumvent this for sure thank as far as i understand uh the wr pkru is not protected so how can you protect the unauthorized process execute the the instruction to change the register value and also do you have any plan to uh apply this to the different architecture like amd or arm okay that's uh two good questions so the first one uh yeah as already said like so it's kind of the threat model here is that you have an attacker that has a can find the information disclosure gadgets right so if if you have an attacker that already can hijack the control flow execute arbitrary instructions i think you're yeah it's kind of a lost cause if you're in ring zero and have an attacker that can do that so it's not something that i think you can realistically tackle with this uh for a second part of the question to uh yeah what i want to extend it to like amd or other things so if other vendors support like similar harder features like mpk it should be pretty trivial to port it right so but yeah i haven't looked into it since yeah obviously i work at intel so i think the focus is or getting to work on intel products but yeah it's for sure something that could be extended to other uh vendors um question so there is a limited number of bits and limited number of keys yeah will every like if you have multiple ebpf program will every program allocate a separate key or you you want to use the same key for everything and if so if the same same oh sorry right and if the same mechanism is applied like not only for ebpf let's say for some other uh like aci so they will also share the keys or there will be one hard coded key for aci another hard coded key for ebpf yeah so that's a great question i think so currently what i have like as i mentioned i have like two domains right so i have two different keys the privileged and non-privileged one so basically all ebpf programs would share the same domain they should be they like would be able to access each other's memory uh and that's yeah by design because yeah we have a limited number of the main keys who would have more like you could be able to do something like that but for this purpose it was not really necessary because i just want to split up into like uh privileged and non-privileged domain so yeah but that's i think that's a good point and it's something that could actually i guess be used for some kind of more hierarchical compartmentalization thank you i think that's it for questions thank you okay thanks everyone for listening and sorry for not being able to tell this in person that would be great to be there to meet everyone but uh well it is what it is so but yeah thanks for the great questions everyone