 Howdy everybody. I'm just getting started up here. All right. My name is Case Cook. This presentation is on the state of CFI and Linux kernel control flow integrity. I've got my slides going in the screen share, which I think is a button on your screens. And if you can't see that for whatever reason, you can look at the schedule and get the PDF of these slides from there and follow along. Anyway, I'm going to go ahead and get started. So this is basically the agenda for the talk. So we're getting into what is CFI, what we want, looking at the implementations that Clang provides, and talking a little bit about what's happening with the Pixel phones and the Android ecosystem generally. Sort of gotchas we ran into in the upstreaming process and a little bit of an example of how you can do it yourself. So the main question is sort of why should anyone care about this at all? So actually, a lot of memory correction text tends to be running code, you know, attacker chosen code. And CFI is one of the ways that we make sure that the order in which things are expected to run, it doesn't get violated. And, you know, flaws come in a lot of different ways, a lot of different situations that we end up being able to manipulate the control flow. So being able to actually block what's runnable has quite a bit of value. So looking back long ago, there was, you know, actually writing directly to kernel code. You could just as an attacker, if you have memory corruption and control over where you're writing, you could just write directly to kernel code and change how it operates. In that situation, the target has to be executable and writable. That is kind of ancient history. Protections existed for this. Of course, if you are trying to write to memory from user space, what's writable and executable? Well, user space is writable and executable. It's under control of the user, but not kernel because of memory protections. But if you end up with an exploit where you can start writing to kernel memory, suddenly you can write to everything from, you know, in the long ago days. And this is obviously a simplified view. There's a lot of kernel data and there's the text and modules, which has the executable code in it. So forever ago, there was no permissions at all. And then finally, we got sort of the non-excutable stuff changed. So you couldn't run data, you know, you can dump stuff into keep or stack and then just run it. So with the addition of the NX bit or similar mitigations, that entire section went away as far as what you target as an attacker. And finally, we got read only as well, which says, okay, it's not writable and executable anymore. So the kernel code is not writable. So even if you can target it, you can't change it. And then, but that left user space, you could dump your code into user space and run it from the kernel. And so with that separation, suddenly you couldn't write and execute from anywhere anymore. But this opens the door to the main current attack method, which is call into kernel code. The kernel has all the functions you want as an attacker. So those are the ones you're going to go after. So you can write to stored function pointers on the heap or on the stack and then manipulate what's going to be called in what order and just hijack indirect function calls. Now, this begs the question, what do I mean indirect function calls? So direct function calls are the ones that in the assembly call one thing, they have a hard coded target. So the example here is there's some function that calls do simple, but it is always calling it. It can't change without rewriting the actual memory there of the code. And if you look at the Intel assembly of that, you can just see that it is calling specifically a hard coded value. An indirect function call is where you've loaded a value into a register and you're calling the register value. On the left here, you can see sort of an example of a function table, do something simple, do something fancy. And then where you save that, you want to call into it. So if you're able to manipulate that saved actions list, or otherwise corrupt memory there, you can call basically anywhere in the entire kernel. And this for purposes of authentication considered forward edge. So this is the control flow forward edge you're going towards the new function. And just to show you there's a couple different places where you can manipulate this with function pointer over writes, you can by over writing the heat memory, or you can be over writing the stack memory. And move on to backward edge is a stored function pointer on the stack, right? We're returning from this so this leaf function do simple. And that's the backward edge control flow or return to where we came from. And that can also be manipulated on the stack. This is the core of the well known return on your programming or RAP attacks. So the question is, all right, so we've got all these stored function pointers scattered throughout heap and stack. What are we looking at? So from user space again, obviously we can't write to the kernel. So no problem. But then what do we got? All right, we've got the heap and the stack in the kernel is writable. That's no good. And we can also get to user space from here as well if we're calling from the kernel. So as mapping Pam, turn that off. So we're left only with heap and stack again in the kernel as being a targetable area. So to break that down, we've got writable memory and even stack and we have some kind of memory, memory corruption that over writes this to our function pointer. And our forward edge looks sort of like that in assembly and our backward edge is just a return. And we can write to, sorry, we can execute basically any executable range in the entire kernel. Any byte we want, including for variable length in the assembly, we can call into off byte addresses as well. So the goal of CFI is to ensure that the forward and backward edges are actually going to expected places that they are effectively a subset of all of the possible kernel code in the entire, all possible entry points into kernel code. But that's kind of a lot of entry points, which we'll get into and the backward edge as well. So if we validate indirect function pointer to call time, one of the ways that this has a way to chop down what can be called into is to look at the classes of functions and to look at the prototype of the function to say, okay, this is returning int, it takes an unsigned long and a start file. And you can sort of generate this uniqueness, this hash of those prototypes. And therefore, different prototypes can't be mixed, you know, in the foo bar example here, you know, one returns void, one returns int. So in all places that is expecting to call a function that looks like foo, you cannot instead call bar. The help from hardware here, because this is all what we're going to be mostly talking about software emulation, the help from hardware is not great. So BTI and some of the other, you know, like n branch things in the CPU instruction sets that are coming, they just mark an entry point at all. So instead of being able to call any byte in the kernel, you can call any function in the kernel, which for most purposes doesn't really change an attacker's situation. They're not going to do gadget attacks, they're just going to call directly to a full function. And probably they were going to do that to begin with. So having, having, you know, calling any function as a limit is not much of a limit, but being able to only call matching function prototypes narrows the attack surface of what you can call pretty, pretty strongly. And so what do I mean by that? So again, compared to that prior slide, it's a question of, all right, we have all of executable memory, but from this one position, this one bit of code, no matter what you can write, you can only call into the subset of functions that the compiler identified as being callable from that position. So it hugely narrows what is what is reachable from a given location. Now, implementing this with Clang's CFI is interesting. It requires link time optimization. The idea is that it needs to be able to see the CFI code, needs to be able to see the entire program at once to figure out all the prototypes and all the destinations. And that's, that's actually a pretty, pretty large bit of work to get LTO working. And with that, what Clang does is it collects all the same, all the functions that have the matching prototypes into jump tables, and then verifies at indirect call sites through that jump table that it's okay to call that function. And I can get into that because as the piece I had wanted to see a while back is like, how is this piece actually implemented? So we're going to take a little bit of time to go look at the assembly without any without Clang's forwarded CFI. So from my prior trivial example, there are these these two functions, the simple and fancy that we're calling from action launch. And normally we're going to take our arguments and calculate what we need and call directly into those functions. And off we go. Now, if we add Clang CFI, and those are command line options for enabling that, what happens is a jump table is created for these two functions because they share the same prototype. And you can see that it's, it's literally a jump to the functions that are targeted from there, as well as interrupts in between to catch anything that jumps into the wrong place. And I'll walk through what the added assembly here does. So the first thing is we're going to load this the start of the jump table. So we can start the verification. This is before we've made our call. We're going to subtract the address we're trying to jump to. So we're subtracting this from the copy of the function pointer. And then rotate it right by three. And now this is not a shift. This is actually rotate. So if there is anything above it, it'll end up in anything in the top bits, they'll end up in the low bits. And so we have to have an exact, the address needs to be exactly correct for this to work. And this this hex three means we're looking at eight bite chunks. So that's, that's why the three and rotating exists. And so our index into this table must be zero or one. So if we are above that, again, with any of those bits that got rotated around, then we're going to fail and go to this, the D2, which will trap. Otherwise, we continue on and we call into all into that table and jump to the correct function. And as a bit of an aside here, the, the type names in here are actually the embedded function prototype. So it's a type mangling from like C++. You can actually see what each of these tables, what function prototype the table represents. So the question is, are there better implementations than this? There's, I think, two worth mentioning. So for improved speed, instead of jump tables and days of checking everything action can do hash bytes before function start return destinations. This is what packs teams wrap does. But this isn't ultimately compatible execute only memory. So because that code stream is actually trying to read the hashes out of the text segment. So once we get to the point where we've got CPUs where we can do execute only memory where attackers are not reading back the kernel to find gadgets and things like that. This approach doesn't really work. So we don't get as strong an overlap. Once we get execute only memory, we won't be able to read gadgets. And then if things are, you know, SLR, you can't see where you need to jump. Anyway, combining these mitigations makes some quite a bit more powerful. Another change would be nice to have is sort of chopping up the existing prototype analysis that's been going on. There was KCFI that was done to look at call sites, try to further narrow the scope of what was a valid call. Since the kernel already has a lot of void return, void argument functions, there's a huge chunk that can be called from any place that calls a void void. And to sort of take a look at this more closely, Android did an analysis of the targets at indirect call sites. So you see, you know, about 55% of any of the indirect calls that are made in the kernel, they allow less, you know, five or fewer different functions can be called from there, which is a huge reduction in the attack surface or the function call surface that an attacker can make. And this sort of spreads out. But then there's another small peak at the end where, you know, fairly large, you know, 7% have more than 100 allowed targets, which is kind of huge. So if there's vulnerability near those, you haven't limited what functions you can call into particularly well. So trying to bucketize that and say, well, just because I call a void void function here, I am only expecting to call any of these, you know, five different ones. So there's a stronger bucketizing that could go on there. So this is this is mostly been about forward edge here. But we also have the backward edge protection, which is returning. So we want to make sure that the stack address that got saved there, the saved function pointer on the stack didn't change or we're going to where we're expecting. So we can handle this with a separate stack. This is called a shadow call stack. And this piece actually is best done in hardware. Right now, we're looking at Intel CT and arms pointer authentication for this. That provides actually a much faster way to do this. But there is software emulated version for for clang. And sort of looking at this, there's an idea of, well, first note I should make is this is is all arm examples right now, because the x86 implementation of this and clang turned out to be pretty slow and have race emissions. But the idea here is that you can split from the compiler, the local variables and register spills that you need into sort of the traditional stack. And then your return address goes into what should be a secret stack. Now this is obviously a problem because if you are if an attacker is able to expose the destination, where these secret stack is held, then obviously they can start overwriting return addresses again. The idea would be to reserve a single register to be sort of a second stack register stack pointer. And I note also in here that the return address is actually still saved to the traditional stack. And that's so that back traces and unwinders can still do their job without knowing anything about the shadow stack. Just a question is what does this look like? So if you dedicate a single register and turn on shadow call stack, you end up storing in here, you can see the store of the stack pointer. So x3 is getting stored to location of x18 and stack pointer. So we end up with a copy of this store it and load it back out on return. So we left the old code, the old right of it, and we just overwrite. So whatever got overwritten here just gets restored from the shadow stack. Now what's kind of using here is that this means if you end up with a stack return pointer overwrite on the traditional stack, you have no, there was currently no check that it got changed. So you could add a check in between here before you return and say, hey, did x30, did we get the wrong value into x30 on our second load? But that adds a bit of overhead. So we haven't done it. In hardware, this looks very similar. And you have instead the signed pointers on ARM where we sign what we store and when we pull it back out, we validate the bits that are there. And we fault on that and end up, if the validation fails, what got written ends up landing in the non-canonical zone. So when we return, we end up jumping into the void. So what is happening right now on pixel phones, which use clang for everything in their builds? Samuel Tolbanen and a lot of other people was doing a giant amount of work to enable LTO, CFI, and SCS in Android and the upstream kernels. Pixel 3 and later have all of this enabled, both forward and backward protection. And there is now documentation, well, recommendation, I should say, in the Android compatibility definition that says CFI is strongly recommended in both directions for the kernel as well as I think of user space, but I'm quoting the kernel bits here on this page. And the idea would be that as this progresses, this will switch from strongly recommended to required. So this will be a standard mitigation for the entire Android ecosystem. So what problems have we encountered as far as getting this upstream and getting this ported, implementation ported to the kernel? So probably the biggest one was LTO, default LTO in clang was massively slow. It added something like 20 minutes to the link stage or possibly hours when you're doing a really large kernel. Luckily, thin LTO in clang does a much, much better job of this. It had a lot of corner cases that needed to be fixed, that it missed, and kernel had a lot of weird linker aliases and other strange behaviors that any sane linker would be reasonable not to expect. And the kernel, another piece that the kernel has is a ton of assembly code, obviously. Most of the kernels in C, but a lot of it is in assembly, especially things like crypto and low level architecture functions. And at the time clang really had no visibility into that. And we passed around and found Peter Cullenburn was kind enough to fix this up for us so we would actually generate tables for all external functions without regard to whether or not it was in C or assembly. Another gotcha was relative addresses. We calculate the actual destination function as opposed to the jump table and then try to recalculate it later and it would fail. This is true in the exception tables. The simple thing here, which was nice, was that the exception tables are already hard coded data in the kernel, so they are in effect their own CFI jump tables. So we could just turn off the CFI checks for those. Linker aliases, I sort of mentioned earlier, there's very strange tricks that are happening. KPTI was interesting because what it was doing for dealing with meltdown is it switches out the page tables and has a very minimal kernel page table visible from user space. So on syscall entry into the kernel, maybe it was that you then swap around the page tables and then suddenly the entire kernel was visible. However, there were enough indirect calls in the syscall entry stub that it also needed to be able to see the jump tables, which was awkward. So right now the best solution is to map those jump tables back into the common thing. This results in some meltdown KSLR leaks, but it's not like we didn't have those already. What I'd like to do is either eliminate the indirect jumps on syscall entry or limit the jump tables to only the ones that we need for syscall entry. Then we've got upstreaming status, where are we? Clang, as this moves, we always think we're done and then we hit some other new bug. So I'm going to say Clang is done as of the unreleased LLVM11, but we keep tripping on things for once in a while. Usually it's in corner cases or really specialized configs, but for general purpose stuff now, this works, and this works even in Clang 10 as well. On the kernel side, there's been a lot of consistent progress. The shadow call sac support for ARM is in Linus's tree now, so unless it gets ripped out suddenly, it should be in 5.8. There's an ongoing work to fix the function pointer prototype, because you can do casting in the kernel in a number of ways to say, oh, yeah, yeah, I'm totally calling this function here, and then when you actually look at the function that's being targeted, it has a completely different prototype. C sort of doesn't really care about that too much. You can trick it in a lot of different ways. The kernel has found many novel ways to trick the kernel or to trick C into being happy with that, but CFI is way, way more careful, so we keep encountering those. There's a bunch of different ways we can look for those, and I expect that to continue for some time as CFI gets used more widely, because right now, most of the focus has been on Android and ARM, so the what's getting exercised by CFI is relatively narrow, but it does work for a desktop system as well. The LTO patches are, it's pretty large, it's mostly, I would say, I'm calling it tricky mechanical build script changes, in that the issue with LTO is normally what happens when you build is you take your C file and you generate an object file out of it, which is the actual assembly code that's going to be run, the machine code, and then the linker just squishes it all together. With LTO, what happens is you actually get an intermediate, you get a different language, it gets converted from C into LLVM's IR, the intermediate representation. So what was an object file is now this IR file. So in a lot of places, the kernel is expecting to find an object file to do checks on it or to merge it or to make it part of an archive or do various things like that. So we need to either delay those checks until later once the object, once the actual assembly has been produced, or not, to build things not with the IR, like there's a bunch of sort of little test modules and other things that could build, that expect to be just an object file, so we can say that those aren't, you know, those aren't built, being built with the IR, those could just be built normally. And then there are some questions about memory ordering and visibility, and those seem to be settled. The concern was the compiler may do optimizations now that it can see the entire kernel all at once. It might do weird inlining or do other things like that that cause a problem. But the belief is that if there are bugs in the compiler about that, they exist even in a non-LTO build. So we're just going to move forward. And then there's CFI proper. This is pretty big, but hopefully it should be relatively uncontroversial and land quickly after LTO. Most of the ongoing work on that is with the prototype fixes, but it should be, I'm predicting, perhaps naively, but it will go smoothly. So if you want to try any of this stuff out yourself, I have, back in November, I put up a blog post on actually doing this for yourself in the upstream option kernel. The main chunk of that I updated recently for getting the latest version of Clang built. And then in there I was, in that blog post, I had talked about a specific tree that I had published for, it was basically Sammy's tree plus more x86 fixes that I had been working on. But all my fixes are now in Sammy's tree. So you can just go pull directly from Sammy's tree. And when you're building, you want to make sure you redirected to Clang and Clang's linker, sorry, the LLVM linker. They just turn on LTO, thin LTO, turn off LTO none, turn on LTO Clang, turn on Clang CFI and the shadow. And then if you want, if you want the CFI violations to be non-fatal, in other words, to not stop that kernel thread, you can actually set them to be CFI permissive. And you'll just get a warning instead of a bug. And I'll show that here in one second. And then if you're on arm, you can turn on the shadow call stack as well to get the backer edge protection as well. And then build your kernel like normal and enjoy it. So it's a little tricky to do a live demo of this. I tried to capture just sort of what this, what you can expect to see when you fail. There is a module, the LKDTM module has a test for doing a protocol mismatched call. So you can test using LKDTM over if you have x-boy handy, you can try it with this. But basically you get, if permissive is not set, you end up panicking the kernel and you get the trace back and you see the CFI check that happened and it failed. That was the UD2 that I had shown earlier. And if you've got permissive set on, you'll just get a warning and I'll tell you precisely where it happened and get the standard call trace. And as I mentioned earlier, the shadow call stack failures are currently totally unreported. So if you corrupt the return address in the stack, it is just silently ignored. So that would be perhaps nice to add, but perhaps we'll end up with hardware support for these things and we can just get rid of SCS entirely. Anyway, I went through that pretty quick, but if you have thoughts, please reach out to me. There's another link for the slides. You can look at sky.org and links to the kernel self-protection project where we try to sort of herd cats and get mitigations into upstream and our mailing list. Anyway, that's it for me. I'm going to go look at questions here for a second. So I'll just do them in order. Okay. So the first question is, how does CFI deal with cases where our generic function pointers are used? Sort of, you know, the void pointer void. So if I'm understanding a question correctly, it is mostly about casting and passing functions by a void pointer. And the answer is, well, that does not work with CFI. And the idea is to make sure that type safety is being strictly adhered to on those kinds of situations. This is what a lot of the prototype fixes are about. If you look at probably one of the most egregious versions of this, this is probably from about a year or so ago, I did a conversion of struct timer in the kernel. And it actually passed arbitrary data as an argument to a function and as an unsigned long and then would cast it back to whatever it needed at the call site. And while this is about argument casting, the same approach applies, which is normally our callbacks are going to be typed at any given location. The question is, how do we verify our type all the way back to where we need to make the call? Hopefully that makes sense. I understood the question correctly. Let's see. The question is, do you think that information on possible call targets will be sufficiently good that we could statically determine maximum kernel stack usage? I don't think we can always get it. I think we could probably get pretty close. I'm not sure CFI in particular gets us closer to that. LTO might get us closer to that just because we get the visibility. I think there's already quite a bit of research being done on the stack depth already with things like SMATCH and other stuff that are actually doing the static analysis of the code base to look at call graphs. So LTO might be able to assist on that, but I think it's sort of a separate issue. Are there any plans to port some or all of these options to GCC as well? I would love to see it. GCC does have the beginning of LTO support, I think. I know there were some patches a while back that were working on it. I have not gone to look at what's happening in GCC's LTO. I think that if LTO were well supported by GCC, I don't think adding CFI would be too much work on top of that. I have asked the last couple plumbers, I've waved my arms with the GCC folks saying, hey, here's some stuff I'd love to see that's incline, but I don't see it yet in GCC and vice-versa. So right now, there isn't, but that's okay, because we can build stuff incline. Question is, how would CFI affect dynamic tracing, instrumenting, debugging? Right now, it's an interesting problem. It should be entirely solvable. The idea is that the way that dynamic tracing tends to work right now is it is actually directly updating kernel text. So it uses a separate page table to change the preamble on a function. So those things end up being direct function calls to get out into the tracing and other things. So a lot of that should be unaffected. I have not spent a lot of time looking at those or trying to play with those things in combination, but I do not expect there to be too much trouble with it. I think that's all I can reasonably say about that. There have been issues with just getting things to work together and through the history of CFI work. I know we've turned off F-trace and then turned it back on as bugs got solved, but fundamentally, I don't think there's any conflict because of how tracing works in the kernel right now. See, the next question is, how does CFI work with the kernel module? So I sort of skip that in the interest of time, but there is effectively every module exports a CFI check function. So when attempting to execute code that's in a module memory area, it sort of offloads it to that. So there's a bit of a greater performance in dealing with CFI through modules. And the assembly example I gave is a little bit more complex, but as long as the other part of the question was, is it possible for a kernel to call a function in an arbitrary kernel module? And it is as long as the kernel module is built with knowledge of CFI, like you're building it with Clang, CFI, and you're building the same kernel configuration, that arbitrary module should have all the same instrumentation. So it's not really any different than anything else. It's an API like anything else in the kernel, it's just a really internal API. And I think the last question was the same about GCC and CFI, and I would love to see some more CFI work done for GCC, I think that'd be very exciting. But that's it for questions that I've got here. I think oh wait, maybe there's page two of two? Oh no, that's not it. Hey look, more questions. Okay. See, would it be possible to implement similar CFI without LTO by emitting direct calls to an indirect call to a specific type helper that is generated after the rest of the build is done? Probably, I think it would basically be treating the entire kernel like how modules are done. I think would probably be a performance issue that the question was from Jan, and I know he likes to lower security problems into performance problems. So perhaps that would be a way to go. The question is apart from compiler options and build options, the CFI or any other code modifications in the kernel, and the answer to that one is yes very much. All the prototype changes that have been landing are basically about places where the kernel code has effectively tricked C into being happy with and building and then just sort of call into functions. So a lot of prototype changes have been needed, but most of that is just, most of that is mechanical. It is refactoring code to do the right thing. Do I think Linux kernel can move to use Clang in general due to CFI benefits already there? I think I'm going to try to interpret that question as is it a good idea to move to Clang in general. I think LTO gives some performance benefit already and Clang already works for building the kernel. The Clang built Linux project has been doing a ton of work to make that happen. So I mean it's worth trying even without CFI. Question about static key entries. There is basically no problem with the static key entries because how that happens in the kernel is through this writing to kernel text and that's a completely separate function that will actually swap the permissions and write directly to the kernel text itself. So there's not really any problem with static keys and those kinds of things. Question is what is the status and how do you see the introduction of CET in the kernel? I think the most recent patch that I saw was for CET in user space support. That's ongoing and I think after that will be CET in the kernel. That's to get the Intel hardware support for the return of the backward edge CFI protections in hardware. I would need to double check the status on that but I know that the user space side of that is written and I think there will not be too much trouble with this in the kernel. We do end up butting heads a little bit with some other like specter and meltdown mitigations. There's a clarification about the static key entries changing offsets so the changing offsets trying to interpret what you're asking about on that one. Right now what happens in when CFI is like that CFI pass is basically replacing all the function pointers to the jump table entry points. So in most cases this is invisible to the implementation in C. All function pointers end up being pointers to the table and so for static key entries and things like that there's no issue. There are exceptions where things are a little weird and those we need to take care of and a lot of that is in the CFI series that Sammy's got. And I have a minute and a half left. I do not have a third page of questions so I'm I think I am done there. Thanks everyone for coming and asking questions and that's it for me. Thanks!