 All right. Hello! That Mark Brand and me have been working on. I'll start by talking about some fancy ways in which you could potentially extend RCU as kind of a mental stepping stone, and then I'll go into the actual mitigation stuff. So the motivation, why I think that a user-different mitigation is important. The design we came up with, some of the pitfalls limitations of the current design, and then in the end some performance numbers and aspirational ideas for long-term development. So this is just like mental stepping stone. It's not necessarily something you should actually be implementing. Some of this is maybe not a great idea. So when you're working with RCU, you often have this scenario. You're inside an RCU recycle section, and you're holding an RCU reference to some object, and now you want to call something like KMLOg, but you can't do that because KMLOg can block. There are some classic options for dealing with this kind of scenario. For example, you can make a breach-by-loop around RCU-D reference and ref-count increment not zero, or you can do some optimistic GFG no-way path and then have a separate slow path, but also this is kind of ugly because you need to code extra slow paths for this stuff. So it would be kind of nice if we could instead just unconditionally increment the ref-count on the object when we know that we're about to block and then terminate the RCU recycle section, and then after the walking operation, start a new RCU recycle section and decrement the ref-count again and then move back in the original context. So for this, we need some ref-counting API that allows us to increment the ref-count after it's already dropped to zero, and the ref-counting API must basically guarantee that an object only gets free of this ref-count is to zero for an entire RCU grace period, and we can provide some things like that on top of RCU head relatively easily. So basically what we want is we have this ref-count which initially starts with s and non-zero value, then at some point the ref-count drops to zero at which point we schedule the RCU head. If the object stays at ref-count zero, we eventually get the RCU callback and then we can free the object, but if the ref-count is incremented back up from zero at some point in the meantime, we need to cancel the RCU head basically to get back to our initial state. Except RCU heads can't be cancelled because they're on a singly-think list and stuff, so you can't just remove things from the middle of a singly-think list, so instead we need to keep track of whether this has happened. So we have an extra state bit in the object, as well as the object has been resurrected, meaning that it has come back up from ref-count zero and then if so, we wait for the next RCU callback to occur and discard that callback. And then if we are at ref-count zero, we schedule another RCU head, we schedule the RCU head again and then if we get a callback, we know that we can actually free the object. Now, when you're looking at the code we have now, I think it looks kind of nicer, but there's still a small potential issue, which is that if you have a K-maloch invocation like this, it's usually not going to block. It may block and we have to write code that can tolerate that, but usually it won't. And if we are unconditionally always around the K-maloch doing this ref-count increment and decrement stuff, that's potentially going to cause like cache line contention and stuff, so it would be kind of nice if we could avoid that without encoding an extra DLP no way path. And I think that is theoretically possible. I'm not saying that this is a great idea, but just as a mental stepping stone, we could basically reuse the idea of preemption notifiers. So for this, we would have to change RCU, the RCU core to support this use case. And basically the idea would be that instead of lifting the ref-count up before the K-maloch, you just register, you just have a little RCU pin data structure on your stack in which you register that you are currently using this object foo. And then you put this object on a linked list that hangs off of the task structure, something like that. And then you tell the RCU subsystem, okay, from now on, it's okay to get preempted. And then if you actually get preempted, the RCU stuff that's happening inside the scheduler could walk through the single linked list that hangs off the task and lift up the ref-consets of all the objects before terminating the RCU recycle section. And then after you're done with the blocking operation, you could reverb all the state. So yeah, this just like as a mental stepping stone, give this a mindful verse of the talk. And now let's switch to something completely different and start with the motivation of why I think, and start with the motivation of why I think that use of the pre-mitigation makes sense. So if you're thinking about security, about bugs in like a security context, you can kind of categorize them into bugs that have locally scoped impact, which security people tend to call like logic bugs, and bugs that have more of a global impact like for example, memory corruption. So we've had local bugs like for example, in the VFS code, there were some missing checks in the path traversal code where you could have a bad interaction with renaming that would allow someone to path traverse out of a container. Or in Ptrace, we've had a box and like the Ptrace trace me checks, for example, that would allow you to get access to a privileged process. Now, these are pretty severe bugs, but they have an impact that is fundamentally limited by the importance of the subsystem they're in. And that means that you can potentially find these issues. But on the other hand, we have bugs with global impact that are just memory corruption. Like for example, in the Putex code, there was some code that used iGet to take a reference on an iNode, except you're not allowed to do that. And then that leads to user freeze. We've had missing locking in the core dumping path that could then raise what views of all the operations. And so these have impact that just affects the entire code independent of how important the subsystem actually is. And if we're comparing performance issues with security issues, I think that performance issues are pretty great. Like performance issues, you tend to notice if you have them. If you notice that you have a performance issue, you can run profiling to kind of try to figure out where they are. And if a code base has not been optimized a lot yet, you might be able to get some pretty large performance wins using relatively small optimizations. On the other hand, security issues tend to be invisible in many cases, and they can hide almost anywhere in your code base, as we've seen on the last slide. So I think that maybe it would be a good idea if we could turn security issues into performance issues, as long as we preserve these nice properties of performance issues that you can actually fix the performance impact without crippling functionality. So if we're talking about user-to-freeze, I guess we need to have a simple example of what we're actually trying to mitigate. So here's a simple pattern of what a user-to-freeze kernel exploit might look like. So the scenario is we have an object A that we can access as a member after the object has already been freed. We can write an arbitrary value into this member. So as an example, we need to first allocate this object A, then get this object A to be freed again, and then we'll want to allocate a new object B at the old address of A, and we'll want to choose this object B such that it has an interesting member that overlaps with the member we can write through. So for example, we could choose an object B that has a function pointer at the same offset. Then we can choose something like a pointer, choose some gadget, and it's managing gadget and kernel code, and then write this pointer through the dangling pointer A, and thereby corrupt the function pointer in B, and then we trigger a call through this function pointer in B, and we get kernel instruction pointer control. Now we have a bunch of mitigations in such a kernel that can kind of make an attack like this harder. So for example, we have attack source reduction, like a second bit as a Linux, which prevents you from being able to allocate this object A in the first place. So if that can block the such an attack, that's pretty great. After that, we have the step where the attacker needs to allocate an object B at the old address of A, and we have mitigations like for example, in newer ARM64 hardware, this one can be memory tagging, which kind of makes it hard to get the object B to be allocated at exactly the same address as A. Then we have the step where we have to choose the type of object B, such that we have overlapping members, and struct randomization makes this really painful. Then we have KSR, which kind of tries to protect against us being able to choose a pointer gadget in color code. Then at the end, we have the step where we are triggering a call through this control function pointer, and then we have CFI as mitigations that tries to stop us from doing that. Now, CFI is not something an attacker actually has to necessarily deal with for an attack like this, because the attacker can instead of targeting a gadget in color code instead of targeting a function pointer in structure B, the attacker could choose a different structure B that has a pointer to some data but find the same place, and then the attacker could instead of triggering a function call through this member of the trigger like reads and writes that treated as an opaque data but found something like that, and that would mostly be like similarly bad. If we're looking at all of the remaining mitigations that we have here, apart from the attack service reduction stuff, all of this is more or less probabilistic. Now, there are different degrees of like probabilistic protection here. For example, memory tagging is relatively, makes it relatively hard to figure out the information that you need in order to break the mitigation, like much harder than for example, KSR, and struck randomization probably even if you can leak the randomization that's still going to be a pain to actually exploit. But yeah, all of this is basically probabilistic. So, I think when trying to mitigate security bugs, it would be a good idea to have the mitigation as close to the actual bug as possible. So partly because as we've seen with CFI, if the mitigation is too far removed from the actual bug, an attacker can potentially just choose a different path of exploitation that bypasses the mitigation, and partly because if we have the mitigation sufficiently close to the bugs, maybe that makes it easier to do performance optimizations at a later point that allow us to get rid of the mitigation overhead if we have fixed specific localized issues and code. So it would be great if we could just mitigate the actual bugs, like reference counting issues, locking issues, and such that lead to user freeze. But that's really hard or even feasible to do in like normal C code if you don't have a lot of annotations to tell you what's actually going on. So instead, we have to mitigate the immediate symptom instead, which is that we have memory access through dangling pointer to memory that has been reused in the meantime. Now I'm not saying use of the freeze, like using memory after it has been freed, because from a security perspective, memory before it has been freed and after it has been freed isn't really all that different as long as the implementation doesn't put internal information into the freed memory. But only when the allocation has been reused, we get this effect where the use of the free access causes data to be corrupted or interpreted incorrectly. And my design goal here is to try to provide a deterministic protection in software against use of the reallocation with the target environment kind of being desktop x86 systems. Now, the basic design that many use of the free mitigations like for example, hardware ascent and memory taking use are fair pointers. So the basic idea here is that instead of just having a pointer being a linear address, we put some extra information to the pointer that can be used to detect use of the freeze. So the simple version that for example, hardware ascent memory taking do is we put an additional little cookie in the pointer, and then we put cookies on chunks of memory. And whenever the code tries to access the pointer, we check whether the cookie in the pointer matches the cookie on the memory and if not, we crash. A difference is the hardware ascent memory taking use these cookies for probabilistic protection. I would like something that can actually also provide deterministic protection. Fat pointers sound as if the pointers get bigger and some designs are the case, but I think we really should strive to have a design where pointers stay the same size because otherwise, lockless pointer updates get much harder and we risk turning existing data races into like pointer tearing issues that gets complicated. And also we just be using much more memory if our pointers were bigger. So just like hardware ascent memory taking and so on, our head pointers would still fit into 64 bits. Now, when you're looking at a bit of code like this and think about how we would have to instrument that, we have this pointer argument here, and then inside this function, we have three accesses to the pointer. And so the trivial implementation of such a mitigation would be that every time we have a memory access like this, we perform some access check. So like that would look like this. The issue with this is that every time we do a check, we have some overhead for the check. So it would be kind of nice if we could avoid that. So it would be nice if we could just add the start of the function to a check to verify that the pointer is still live and then inside the function just keep using this decoded raw pointer that this operation provides us. Unfortunately, this kind of breaks apart if this kind of introduces race conditions. Like for example, if this other function that we're calling here decides to block and or even decides to free the pointer that we're using, then the access below that in the loop is going to access a free pointer and the check that we did before doesn't help us. So what we can do is to go back to this idea that I introduced with RCO earlier, we keep track of all of the objects that any given task is currently using on a stack by having these pin data structures in the stack frames and letting them form a link list that hangs off the task truck or something like that. So and so then if we use rcu like delayed freeing, we can make sure that nobody's actually referencing the object anymore. And we can optimize this a little bit instead of having a single pin structure on the stack for every object we're accessing, we can have one pin structure per frame that contains an array of these object pointers. Instead of using a variable inside current, we can use the per CPU variable and then switch it on a task switch, just like a stack protector as it for example. I was initially considering using all unwinding instead of using this link list scheme. So just using the normal exception unwinding infrastructure and then having some extra information in the unwinding metadata that tells us where these pins are located. But that's kind of difficult because you get problems anytime the unwinding is unreliable. It gets more complex because you know all this infrastructure with the exception unwinding and you need to do stuff like org unwinding under the line keyword which is really not pretty. So with this design, we're doing these like object level checks and we're instead of checking every single access, we're just checking once, we're just doing one check of per object that they're referencing a function. So and we also need some storage for these reference counts if we need for this RCU ref counting scheme. So it would really be a good fit if we could have some per object metadata structure instead of tagging like fixed size chunks of memory. Now the most straightforward way you could design this would be to make the head pointers look like this. Instead of having a linear address in the bottom half of the pointer, you'd have a base pointer that points to the head of an allocation memory and then you'd have an offset inside the pointer that tells you where inside that allocation the pointer is actually pointing. So then you can find the metadata using just the base pointer and find the actual data as base pointer plus offset. But this comes with some issues. For example, if you turn a linear address into base pointer plus offset, you need more space for that. So the bits and the pointer get kind of scarce and you could also get into difficulty if you want to reuse physical memory for other things. So for example, if you want to take a page that used to be a slot page and now use it as an anonymous page for user space, suddenly your metadata, the memory that used to store metadata is now a user space memory and if you get a user free access, the metadata checks for people from the user space memory. So that's bad. So instead I decided to go with the design for a physical instead of storing a base addresses. We store object indices that then index through metadata table. So this is the advantage that we have much denser identifier space and we have more space in our pointer supply and it means that we can much easier reuse physical memory for other purposes. And it means that if you run out of possible cookies, like we have the 16 bit cookie field on the pointer and we've used all of the 16 possible cookies and we don't want to reuse them because we want deterministic protection, then we can just use a different metadata table entry with a different object identifier to refer to the same physical memory again and we just waste a little bit of metadata memory and not much else. The biggest advantage of this is that it comes with extra memory and direction. So especially if you have something like a pointer chase, this might double the latency caused by memory accesses. To integrate this scheme with the slap allocator, we can make use of the fact that in struct page we still have 32 bits free for pages that belong to the slap allocator. And for every page that's used by the slap allocator, we can reserve a corresponding contiguous chunk of entries in the metadata table and then those 32 bits in the struct page can be used to refer to the starting index in the metadata table. So then we can go from the struct page to the metadata table using this field. We can go from the metadata table through the actual data that refers to and using the raw pointer slot inside the table and we can go back from the raw address to the struct page using what we had pages normal. And then if we run out of possible cookies for one of the entries in the metadata table, so if the cookie is depleted, we can point the metadata table entry to a different place in the metadata table, a fallback entry, and then just use the index of the fallback entry in our pointers instead. So with this scheme, we can kind of split the metadata identification space into 2 to the 30 normal entries and 2 to the 30 entries just for this fallback stuff. So that gives us an abnormal entries to handle either like about 8 gigabytes of KMLOG8 allocations. You usually don't have a lot of those anyway, or something like 440 gigabytes of buffer allocations which are much more frequent. So I think this is fine. The tricky part of the fallback entries, because every time you've done allocation and freeing 2 to the 16 times, you have to throw away one of the fallback identifiers and you waste the 16 bytes of memory that it used. And if you repeat this 2 to the 30 times, which is the number of fallback entries you have, then you completely exhaust the fallback identifier space. So that's up to 2 to the 46 allocator poles. So even in a very pessimistic example, if you're like allocating every 100 cycles on a 2 gigahertz CPU, that would still give you enough identifier space for 40 days. So I think that's completely sufficient. The slightly bigger problem is that you're actually leaking memory with this. So every allocation basically leaks 2 to the minus 12 bytes. In this pessimistic example, you'd be leaking something like 402 megabytes per day. I think that's not really a problem on desktop systems because in practice, they perform allocations at like orders of magnitude lower rates. But if you think that this actually is a problem, there are some bonus slides that I won't be able to fit into this time slot that describe how you can kind of work around that problem. This is a terrible slide. I would have replaced it if I could replace the slides. But basically, as I've said, we need to perform RCU-based. We need to, because we're using this RCU-like scheme for tracking, which things we're accessing, we also need an RCU-like scheme for freeing allocations. And RCU always involves this global synchronization stuff where all of the other CPUs have to check in and say, okay, I'm not using anything anymore. I'm not in a grace period or stuff like that. So that kind of introduces an overhead that we don't want. There's an optimization we can do, which is that when we allocate an object, we can store in the object's metadata on which CPU we did the allocation. Then every time we access the object, we can check whether the current CPU matches the CPU on which the object was allocated. If not, we wipe this number from the metadata. And then when we're freeing an object and we see that it still has a CPU number associated with it and that is the number of the current CPU, then we can accelerate the freeing because we know that only this current CPU can be using the object. No other CPU can currently be using it. So here's a state diagram of what this delay-free machinery would look like. So initially, I haven't allocated an object on the left side. Then the user invokes a K-free. And at that point, we go into the rest of the machinery. So if the object has a ref count that's non-zero, we just put the object into floating state, which means it's not on any freeing queues. It's still being referenced by an inactive task. When the ref count of the object drops down to zero, we put the object onto the perceived CPU queue. And once an object is on a queue, even if the ref count goes back up from zero, we don't remove it from the queue because just like in the RCU case, we just have a single link list. We can't remove things from a queue without processing the entire queue. So when we're then doing our local free list processing, we can go through our perceived queue queue. And for anything that has a ref count that's bigger than zero, we put it back into floating state. Then if the ref count is zero and it has only ever been accessed from the local CPU, we can free it directly, which hopefully happens in most of the cases. And then if in all other cases, we need to move the object over onto the global queue. Then we kind of keep track of how many things are on the global queue. And if we've accumulated a bunch of objects, we kick off a global synchronization. So at the start of this, we move all of the things that are in new state on the global queue over into old state. And then we ask all of these CPUs to please check in and tell us that they have at least once turned all of their live references into ref counted references and back. And whenever an object has its ref count elevated back from zero, this old flag gets turned back into a new flag. And then once all of the CPUs are checked in, we can kick off the global queue processing where anything that is still in old state with ref count zero can actually be freed. Right. One thing that's kind of tricky is if we have code that looks like this, where we have, here we have a pointer argument. But the pointer argument is not always accessed by the function, only if we're actually accessing this loop. So if we have a parameter that is non zero. So if we put the the access check at the start of the function, then if someone supplies us with the bogus pointer, but the count zero said the point has never actually accessed, but would still be performing performing an access check. And then if the access check fails, we'd be crashing the kernel, even though the point is not actually being accessed. So that would be kind of bad. And we should do that. On the other hand, if we put the access check inside this loop, we have the performance problem because we'd be doing an access check for every loop iteration. So the way I've solved this is that we steal an idea from a pointer authentication scheme and say, when you're verifying that a pointer is still live, instead of throwing an exception like panicking, if the check fails, we just return a non canonical pointer from the check. And then if anything inside the function actually accesses the non canonical pointer, then we get a crash. And if it's not being accessed, nothing happens. Oh, and there's, there's like a small caveat with this, which is that if you have a pointer that is loaded from memory before the pointer actually becomes valid, then you have issues. This would be a particularly big problem if pointers could be reused. Like if you could like read a pointer from memory, then the pointer becomes invalid, it becomes valid again, then you do some comparison to check whether it is valid and then you access it. And then you get like a bogus use after free warning. But because, but with our scheme, we're never reusing these fat pointers. So I think this might probably be fine. The current implementation has a bunch of limitations. In terms of coverage, it's currently not watching anything in idle tasks at all, including interrupts that have an idle context, but should be relatively easily fixable. But the component doesn't do it. It's disabled for task truck. And it's also disabled for all constructor and RCU slabs. Because these, this is like a special feature of this lab allocator, where you can basically initialize, keep an object partially initialized across freeing and reallocation and sort of perform use of the free, use after free access to objects after they've been reallocated. So both for this mitigation to work for those slabs and also for things like ASAN and memory tagging to be able to work on those slabs, we should probably provide a different implementation of these mechanisms that work better with like memory safety instrumentation. Also, the current prototype does not cover anything other than slab allocations. So it doesn't allocate, doesn't cover like stack user freeze, doesn't cover struct page or like pages in the linear map, like anonymous pages, file pages, and VML allocations. At the moment, it doesn't even cover KML large, although that should be relatively easy to fix. Also, it does not, it does not cover references that are coming through like IOMMU mappings, like, or page tables of other things in the hardware like that. It might be interesting to think about what if you wanted like full 100% coverage, what the infrastructure for tracking this kind of references would have to look like. And a very big limitation of partial mitigation with partial coverage like this is that if you have a use of the free and an object that is covered by the mitigation, but this object has references to other things that are not covered, for example, if you're appointed to a struct page, then if you have a racing use of the free style issue, then you could have a scenario where the object is currently being torn down, like the object is still considered live by the slab allocator, but you've already dropped all of the references on the things the object is pointing to. Then at that point, something like a struct page that you're pointing to could synchronously be free. And then an attacker who is racially dereferencing this page pointer could then turn this mitigated use of the free into like real use of the free on the struct page of points too. I think that for a mitigation like this, it would make sense to provide the programmer with some way to remove the performance impact of the mitigation if they invest sufficient time into writing a high quality annotated code. So basically, we'd ask the programmer that if they want their code to run faster, they should prove to the compiler that certain aspects of the locking and such are correct, like lock balancing and that certain members of structures are protected by locks and so on. And then that would allow us to not check certain things in all use of the free mitigation. So one way you can kind of think about this is we need mitigations like this to make C code fast. On the other hand, we don't need mitigations like this to make Rust code, for example, fast. But if we don't want to invest the time to rewrite all of our code in something like Rust, maybe it would be cool to figure out whether we can have like a sliding spectrum between C and Rust, where the more like rusty you make your code by putting annotations on it and stuff, the less impact the memory safety mitigation needs to have. But I don't have like any concrete plans or anything for this stuff, just like very hand waving. Okay, so let's get to the terrible, terrible performance numbers. And for this, I'm going to have to switch over to presentation stuff, because I didn't have those ready by the time. The slides for you. So let's look at memory overhead first. I tested this on a machine with eight gigabytes of RAM. And I mostly felt memory was like fastest in cash stuff. And you can see at the bottom of the slide that most of the memory was used for like, that most of the allocations were like dentries and buffer heads and eye notes and stuff like that. So the total amount of slot memory usage here was around 380 megabytes. And the mitigation was using something like 17 megabytes for its metadata. And so the overhead relative to the memory use bus slot objects was something like 4.4%. The overhead relative to the total memory of the system was only something like 2.0.2%. But you could argue that that number is kind of cheating because it counts a physical memory that's not actually covered by the mitigation. So yeah, maybe like 4% is more like the real performance overhead number here. Anyway, it's not too terrible. Now, let's get into CPU overhead. Now, CPU overhead really depends on what you measure, of course. So let's start off with a truly terrible benchmark, which is building the kernel. I did this with like a tiny config and make with as much parallelism as I have of course on the machine and hard BFS caches. And they are with like with the mitigation fully enabled, I got something like 8% CPU overhead. So maybe that's not too terrible. But of course, this is a benchmark that is very heavy in user space CPU execution. Now, maybe more interesting benchmark is for example, get status because that does like a lot of kernel heavy stuff. But it still doesn't have like a lot of IPC and doesn't do many kernel allocations and stuff. So then I got something like 40% overhead when just testing the with all of the infrastructure enabled but without without without actually handing letting the slab allocator hand memory over into the mitigation stuff. So all of the pointers were still unencoded. And none of the delay free was happening. And with the mitigation fully enabled, I got like 60% CPU overhead, which is not great. But we can do worse. So if we have like producer consumer pattern, we have where we have like a micro benchmark with a Unix domain socket, one CPU is sending single by messages to the other CPUs consuming them. And we have and then we're very much exercising this like global free machinery. And we get terrible cash locality. And partly because the global freeing stuff is reducing is worth an engaged cash locality by itself, partly because our metadata, we have packs like metadata for four objects into a single cache line. So we're going to get cash and contention on this metadata. And with this, I got like 160% CPU overhead, which is really a bit small. Yeah. So I think that in conclusion, that memory overhead for this kind of thing is not really a big problem. But CPU overhead is like really hard. It's really problematic unless you're just running something that just has no that that is very user space heavy. And I think that lowering the CPU overhead to something that's like more reasonable would probably require a lot of more like lifetime annotations and such things on common code. All right. So with that, here's some links to the code that I just uploaded. And now I guess it's time for questions. Okay, I don't have a lot of questions. Someone asked, what is the common config option for structuralization? I have no idea, but I can look it up. ECC underscore plug and underscore run structure does that. Okay. Someone asked, am I planning to upstream this? That's a very good question. So I think it would be great if we could have something like this upstream. But at the same time, I'm aware that at the moment, the performance impact of this is probably not really something many people would want to deal with. So I think that before upstreaming something like this, it would be necessary to at least make some small steps in the direction of providing annotations that let you reduce the performance impact of the stuff to some degree and then apply those annotations in some of the more performance critical parts of the kernel. Like maybe some parts of the VFS or stuff like that to make it more likely that people will actually try something like this on. Yeah. So at the moment, I've tested it like in TrueMU and I've tested it on a physical AMD workstation, which is where I've tested it on. But yeah, I wouldn't recommend running this on your server and it would actually attack that at this point. Someone asked, how does this play with subsystems that use extra bits and pointers on tunnel tagging? So that should work fine because other mitigations like memory tagging already use these upper bits in the pointers and some mitigations already use these upper bits in the pointers. And anything that X arrays that puts extra bits that uses pointer bits for other purposes normally puts these extra bits in the lower part of the pointer. And all of the fat pointers that I'm returning have an offset in the low six in bits, so they're actually more aligned than native pointers. So that should be fine. Someone's asking whether I would want to discuss this at the conval summit to come up with solutions. Maybe. I don't want to commit to anything at this point. I've been working on this for some time now, so I think in the foreseeable future I'll probably be working on some different stuff for now. Someone's asking whether this interacts with slab debugging. So I think it shouldn't interact with it too badly. Like basically the mitigation is kind of glued as a layer in between the normal functionality and the API that ConvalCode uses to interact with slab. So when you're making a location you go through the normal slab machinery and then at the end, just before the point gets returned, we turn into a fat pointer and register with the mitigation that the pointer is now allocated and stuff. And when you free a pointer, we basically redirect from K free or whatever you're calling over into the mitigation machinery. And then after the mitigation machinery is done and says, okay, the pointer is unused now, only then we actually put it back into the slab machinery. So it should mostly work fine with slab debugging, except that one thing that probably doesn't work is that slab has a feature where it will save the stack traces when you're allocating and freeing things. And with this mitigation, the allocation stack trace would still look the right way, but the freeing stack trace would just point into the mitigation machinery and not actually the place where the actual freeing happened. Someone's asking whether I'm talking about sparse annotations or something different here. I think you'd need annotations that are slightly more complicated than what sparse provides at the moment. So sparse provides some log balancing annotations and does some level of verification on those. So you'd need annotations like that, but you'd also need annotations, for example, on structure members that say which logs protect these structure members. I think Klein has some annotations like this for some basic log verification already. So it might make sense to put something like that over to the corner. But that's somewhat basic level of verification that isn't, I think, that I think isn't really designed for things where you, for example, can access a pointer either through ASI or under a mutex and where you have more complicated-looking scenarios like that. Someone's asking whether I've measured how much of the CPU impact actually came from like extra CPU operations or like cache misses and TRB misses and stuff. I don't really have precise numbers on that, but in the performance slides I showed earlier, you could see that if I didn't actually fully turn on the mitigations, such that all of the pointers that were flying around the corner were actually still raw pointers, I still got like 40% or something like performance impact already compared to 60% with mitigation fully enabled. So clearly like this first 40% are not attributable to like TRB to like cache misses or anything like that. Someone's asking what I think about using these like cleanup attributes that GCC and Klein provide that you kind of do things that look kind of like C++ destructors where you get an automatic function invocation when a function returns or when something goes out of scope. I think in like user space application code that's certainly very nice to have because it lets you avoid coding extra error handling paths and stuff. I'm not I'm not entirely sure how I feel about doing something like this in the kernel because it's a good fit if you just have some cleanup operations like freezing some stuff or whatever that can just happen at any point, but if you have cleanup operations that require you to still be holding some lock or something like that, it gets more complicated like basically if any of the cleanup steps you're doing have interdependencies and can't just be reordered arbitrarily and I'm not sure whether maybe you risk the compiler just randomly deciding to at some point put these in a different order and maybe break things that way. So like I'd be kind of cautious about doing that. Someone asked what would be other examples of annotations that would be useful. So not really an annotation about memory safety, but an annotation that would be kind of helpful, for example for mitigation might for example be an annotation on a pointer that says this pointer is only rarely written and we don't and it doesn't matter how much storage we're using to store this pointer so you can and we're only accessing this pointer under a lock so you're allowed to so the compiler is allowed to duplicate the pointer and then have like the fact pointer stored alongside the already decoded raw pointer and then the compiler could maybe use that so that it can avoid doing the extra decoding step when it reads the pointer and knows that it's just going to use this pointer for like direct memory access and stuff without handing the pointer off to anywhere else. So that might be like useful if you're not able to provide annotations that you remove the instrumentation overhead completely. All right, I think we can wrap it up. Thanks everyone for listening. Bye!