 Hi, my name is Case Cook. This is a little bit about a specific area of the Kernel Cell Protection Project sort of looking at the C language generally and why it causes us so many problems and what sorts of things we can do to improve that. If you want to follow along any of the links or read some of the very small text I have in here, you can download the slides there or once I get that linked from the Linux Foundation website as well. So this is specifically about the Linux kernel, obviously. And the agenda here is I want to give sort of a quick background on KSPP and talking about C as a language and how it's really just a fancy assembler. And then looking towards to some specific issues that we can try to solve hopefully or at least minimize. So the Kernel Cell Protection Project was started a couple of years ago to sort of focus on bringing kernel protections into the kernel and we've had a lot over the years of protections, the kernel supports for defending user space from user space. There hadn't been as much focus in the upstream kernel on protecting the kernel from user space. And this is a pretty wide project. We've got about 12 organizations with maybe 10 individuals working on a bunch of stuff. This is an upstream project. It's not a fork or anything. So it sort of follows the upstream development models and is slow and steady is the way I like to think about it. So this brings us to one of the main problems we've had is dealing with C gets treated mostly like machine code. It's trying to be an abstracted version of this. The kernel does this because it's trying to be as fast and as small as possible. And there's a lot of things that the kernel does that there is no C API for setting up page tables, switching to 64-bit mode. Those are machine specific issues. They're not about the C language at a higher level. So that's why as close as we can get to machine code without all the pain there. But this comes with some really fun things with the language itself, a lot of undefined behaviors in the C language, which comes sort of from its history and some problems associated with having a weak standard library that have old problems. So some quick examples that I'll get into more detail in a bit are, you know, the idea of an uninitialized variable. From the C language perspective, we just say, ah, we don't know what's happening. It's fine. We'll throw a warning maybe. But on an actual machine code, it obviously does have a value. It's whatever was in memory before. And then in C, we start to forget that it's supposed to be a language, and we think about it as machine code again, and we can just call function pointers without any regard to what the actual type of the function is, because when it boils down to it, you're running the machine code. The machine code says, well, we're just jumping to a location in memory and running. But that's not actually what you were trying to say with the C language. So there isn't as tight a binding between those things. And then you also get, you know, things out of the API like mem copy where you say, well, I have an address, and I'm just going to copy as much as I want to it. But that doesn't really help anyone using that library. And normally you see people who are trying to build up a series of copies, you know, they'll have a size, and they're tracking how much size they've copied, but they're not really paying attention to how much is left in the destination. So why don't we have better APIs in that regard? Of course, this is a tiny fraction of all the other undefined behaviors in C. There was a great blog post recently on this with undefined behavior, anything is possible, and I bought the shirt because you have to have the shirt. But this is a huge topic, and I'm trying to sort of focus in on specific areas where we can try to improve the kernel itself, or at least deal with the problems that have been created. So one of those is variable length arrays. When you define a local variable, it ends up on the stack, and in C, you can just sort of say, well, I want the size of this to be how over large based on an input variable from the function, and this creates problems because the stack is a fixed size and you can have a linear overflow that just runs right past the end of the stack and writes over things next to it, but this is a valid stack frame. So things like the stack protection, the stack canary, the stack cookie is actually not going to stop this because C, when mapping this down into assembly, basically says, oh, this is fine. It's just a huge stack, but of course, we've gone way past it. And then there are cases where even if you had a guard page, which is now possible, you could potentially still just jump past the guard page and create problems as well. And again, as far as the C language was concerned, it was perfectly happy with this. The nice thing is this is easy to find with dash W, turn on the VLA warning, dash WVLA. So from a security perspective, the main thing I'm looking at is that they're bad. But it turns out they are also slow. When we went to remove these, one of the driver authors actually did a micro benchmark of the code because it was, I think, a checksum or something code where you could actually do that and he had all the instrumentation. And he saw that a fixed size stack array actually gave him a 13% speed up. So he's like, great. I can now justify the security improvement with improved speed. But I had to know why. Why is it so, so bad? So if you can read this, having a fixed size array generates this tiny chunk of assembly and having a variable sized array did all of that. I didn't even bother to read all of that, but it seems impossibly bizarre that it's that bad. But apparently it is. So just don't use VLA's. Another case is the switch fall through. So C specifies break to stop a switch case. But there isn't anything to say, please move on to the next one. It is simply the absence of a statement that says move on to the other one. But an absence of a statement could also mean you forgot to put a break. So this weakness in C actually has its own common weakness enumeration item. These are omitted break statement and switch. So is this actually a bug? We don't know. We have to look at every single case. So static analyzers have had this problem for a while. So they flag them and to whitelist cases where you do want to fall through, actually with static analyzers they start adding a comment that said fall through. So the compilers following the static analyzers have added parsing of a comment as a statement to indicate I do want to fall through here. But that's sort of where we are. But adding this to your compiler now you can say I want implicit, worn on implicit fall through and if you don't find the comment statement, it will yell at you. And so we've been going through the kernel adding these, looking at every place where it's missing and trying to decide was this an accident. And we've had a lot of bugs found this way. So another one is back to stack is you're getting rid of the uninitialized variable case. Right now if you try to use with most compilers, you try to use a variable that you declared locally that you didn't initialize first. You get a warning that says trying to use an uninitialized variable. However, this warning gets silenced if you pass a variable into a function by reference and suddenly you have no idea the compiler just forgets. Like well, I assume since you passed it into a function that now it's initialized you did it. But of course there's no reason to believe that it actually got initialized. So there are some plugins in the kernel for doing various versions of this. One for force initializing any structure that has underscore underscore user pointers in it. This was expanded to all things that are passed by reference. And then there are still some leftover cases, especially with structure padding where you still want to initialize them for sure. And in some discussions we actually encountered Linus praising the idea of always initializing all the variables all the time. So that's what we're trying to work towards. There was a patch for GCC to do this. It's not upstream. There's a patch in Clang to do this. It is also not upstream. We're looking at building a function or a plugin to do this as well. But this sort of gets rid of the C problem of, well, what's in the memory? It's like, well, we just declare everything is zero initialized no matter what. You can just depend on that as a feature of the new Linux kernel version of the C language. And that makes things more easy to think about. One interesting side effect that I thought was adorable as part of the moving from C to machine language is you got this error by, once you force initialize your variables, I got a warning out of GCC that said, well, you have unrunnable code. And I went looking and it was because you had initializers before the first case statement in a switch, which never gets executed because nothing will actually ever go there. Because variable declaration, which in assembly is making room for it on the stack. And initialization requires running something to write stuff to that area of the stack. So by forcing the initialization in the area of the switch statement that I didn't even know you could put declarations, it would never run it and it would never be initialized. So I went through and lifted out all of these places where this occurred. There weren't a lot, but this was yet another surprising side effect of C. I just didn't know. Another case is dealing with integer overflows. GCC has support for checking for signed integer overflows. This is one of the many things that gets enabled with config UBSan right now. So the good news is it's very, very fast because it's just checking for an existing hardware flag. And if you want to, and I'm like, I couldn't actually measure the difference. I need to do better micro benchmarks to really figure out how many cycles difference it is, but it's, I think, going to be very, very small. And if you just want the kernel to abort immediately, it grows the kernel image by 0.1%, which is good, but the downside is if you want warnings about this, it grows the kernel image by 6% because there are thousands and thousands of integer calculations being made, as you might imagine. In the meantime, there are, we can do explicit single operation tests where we say, I want to know for sure in this code flow whether or not I overflowed. So we now have a set of arithmetic overflow detection helpers in the kernel. Klang can do unsigned integer overflow detection, specifically signed overflow is considered an undefined behavior for a variety of reasons, but unsigned is considered well-defined, except that it is usually unexpected. There are, however, a lot of cases in the kernel where we do intentionally perform unsigned overflow, so we'd have to go through and mark that and deal with it. But this is one difference in implementations between GCC and Klang if we're doing that. Klang gives you quite a variety of ways to handle it, sort of showed them in this slide here where you can have it abort, you can have it worn but continue, you can have it worn and give up, you can do a bunch of different things. So plumbing that into the kernel would be nice. And then generally bounds checking, this is, this remains a big area of vulnerabilities in the kernel is just having string copy or mem copy wander past the end of an allocation and just keep writing into whatever memory is next. In the kernel we have the hardened user copy which checks the places where we're explicitly copying to and from user space in the copy to from user checking. And this is under 1% performance hit. I tried to extend this to the string family and the mem family functions. And there are about a 2% performance hit each. I still need to look at this a little bit more. So pre-meltdown, this was a totally unacceptable performance hit for security post-meltdown, it's under 5%. So maybe I have a chance to land this stuff too, we'll see. But it would be nice because we keep getting vulnerabilities where the mem copy is just wrong and we could have easily detected it. We had everything we knew, we know how big the allocation is, we know how big everything around it is. Anyway, and this moves on to can we just get better APIs and get rid of old bad APIs that came from the standard C library. And this tends to be also quite a political problem because in trying to bring developers into the Linux kernel community, you don't want to have to teach them an entirely new C API. However, we're already doing that because we said, well, string copy was no good, let's use string end copy. Except that string end copy doesn't always null terminate and if it's too long and if it's too short, it just nullpads the entire allocation that you did specify, so that's not good. So we made string l copy, but that reads the source string beyond the max length also. So how about string s copy, seems okay so far. So maybe we can improve mem copy too, that would be great. So yes, the point was this is slow, but there is hopefully some future world where we're gonna have hardware supported memory allocation tagging in hardware. So the example here is that your allocator, in this case, Kmalloc, you say I want 128 bytes. And the allocator says, okay, this blue area is 128 bytes. I've given it tag five. That tag lives in the high byte of the pointer value that comes back from the allocator, and so you can say, great, I'm gonna write at an offset from that pointer, and the hardware is looking at that and says, okay, you have the right tag for your offset. It's within the range of 128 bytes, we're good. And then later on you say, well, I want an offset slightly beyond that and says, well, the memory region pass that has a different tag, you're gonna fail. Because you're outside of what you were expecting that pointer to actually point to. So stuff like this exists already in Spark with their application data integrity extension in ARM. This is coming and supposedly we might have this on Intel at some point. Which moves on to CFI, control flow integrity. So with decent control over having memory not be writable and executable, attackers have moved on to trying to use the existing code that's in the kernel to take advantage of indirect calls. Where you have saved a function pointer somewhere and you eventually turn around and actually run it. In this case, for the forward edge calling out, you've got a function pointer saved in the heap, you go fetch it and you just call it. And then on return, you return from somewhere to where you came from. And that's effectively an indirect call off what was stored on the stack. But this is all implementation details. In C, you specified I wanted to make this function call and then come back from it. And without CFI, it's just kind of like, well, I can overload. I can change what I'm calling. I can just tell C, don't pay no attention to this. And we'll go ahead and call the call one versus call two which have completely different function prototypes, violating what we'd asked this function to be. But again, when mapped down into machine code, we're like, sure, it's a function pointer, whatever. Just go there and run whatever happens to be there and we don't care. So doing forwardage checking with Clang's CFI. This will actually blow up because it tries to execute and it says, but I was expecting to call this type of function. But I arrived at a different type of function. I'm gonna freak out. Now this isn't perfect. It's based on the function prototype pattern. So right now in the kernel, there's still plenty of functions that return unsigned long and take as one argument and unsigned long. So that's not great. But for a lot of other routines, it does narrow the window of, narrow the attack surface for indirect calls. Of course, this is forward edge. For backward edge and return, there's things like splitting up stacks where you say, okay, we're going to push all of our weird variables, all of our weird locals, buffers, and by reference variables into this unsafe area because we don't know what's gonna happen to them. Bad stuff might happen. But things we can prove are safe to use, register spills and safe accesses and the return address will split into a different stack. This is one approach to solving that because it makes it so that if the attacker doesn't know where the safe stack is, it's harder to deal with. Similar to this, but with less logic is to do a shadow call stack, which is only thing that you put on the other stack is the return address. And it's harder to get at this one because you can keep a dedicated register for this entire stack, sort of how there's the regular stack register for the unsafe and then another separate register effectively for the call stack. And this works in Clang right now. So there's hardware support for dealing with backward edge, CFI. Intel CET deals with one aspect of this, which is you're doing it in software, leaves that second stack writable, which means it's still, if it can be found by an attacker and written to, they've taken over your return path. With CET, this is effectively a read-only area of memory that is writable only during the call and return instructions that do this implicit read and write to that area. And then a different version is the pointer authentication and ARM v8.3a that adds new instructions to effectively add sort of an encrypted tag to what you're writing out to the stack. And then when you pull it back, you can re-verify it. And the difference on that is pretty simple. It's as you enter a function, you sign where you're coming from. And then when you're about to leave, you double-check that what you have is what you wanted. So where are we now? With VLAs, it's been about four releases of the kernel. We went through a little bit over 100 of these, which were each a little bit different. So it's taken quite a bit of time to get rid of these. But we're down to only a handful in crypto remaining. I'm hoping that that will be completely finished by the 4.20 or whatever is next after 4.19. The explicit switch case fall-through. I know that Gustavo had been sending patches slowly over quite a while. And I thought, well, how many has he sent? I saw that he had sent 745 patches and like, well, I wonder how many we have started with. So we have only, sorry, he hadn't sent, he'd sent more than 700. It's like over a thousand. So now we're only down to about 700 of these remaining. But again, each one of these, you have to look at it and decide what did the author mean? Is there a comment here to describe whether or not the fall-through was intentional or not? But once we get through those, that entire class can go away as well. The always initialized automatic variables. We have, a lot of this is available through the plugins. But we don't have complete coverage. It's not quite the way we think we want it in the kernel yet. We'll see, it'll be nice to get more complete support from the compilers on this. So upstreaming those existing patches would be great. On overflow detection, it would be nice to have GCC grow the unsigned overflow protection, but this does work right now. We just need to specifically tear it out of config UBSan and we should have this, it would be nice. Bounce checking, mainly it's crying about performance and waiting for hardware. That's okay. And CFI, this actually works right now in Android. There's a talk later on this. It's pretty impressive. And again, waiting for hardware. So sort of how do we get there, that's where we are, how do we get there? It's like trying to get people involved. We have a lot of cultural challenges in getting things into upstream. There's a lot of conservatism in not wanting to make changes to code and accepting the responsibility of the overhead and sort of sacrificing one's time to make that happen. Obviously the technical piece, there is a lot of complexity here. But we can solve that. And of course, just getting people to help with doing it, reviewing it, testing it. And in cases where you're not running the latest kernel, actually back porting it to your releases. Since traditionally the LTS kernels only have bug fixes, they haven't normally back ported features. And the reason for that, as you could see with a lot of the patches, the hundreds of patches to fix VLA's and stack switch statements and other things, it's actually a huge number of patches. So back porting that is somewhat prohibitive. So that's it. You can reach me at these places. There's the link to the self-protection project and these slides again. I got, I caught us back up on time. Any questions or other things? Casey? Okay. Yes, here's a microphone for you. I've been doing C programming since 1977. And it was always no, the comment was always no break. Where did they come up with this fall through? It was the static analyzers. So why, why that comment? Why not the one that's been in use for, you know, 50 years, four years? I probably because the static analyzer folks hadn't been writing C since 1977. Well, okay, thank you. It was, yeah, it was just a, it was just, I mean, the reading, the, in the feature request for here is the support for parsing a comment as a C statement. There was great anger in the fact that the, that the compilers got painted into a corner because the static analyzers was like, well, this is what we're doing. This is what we're checking. Here are all of the giant numbers of code bases that we've instrumented now. You know, we've actually updated all the, all the code to say to have fall through as a comment. Compiler people were just kind of like, but we could give you a statement. Too late now. It's a comment. Anything else back there? You want to get the microphone? So once upon a time, I saw an effort made to try and enforce things with the string APIs for them to be more secure, say, no more stir copy after you stern copy. And it resulted in some what I would call stern copy anti-patterns, where people were just doing things like calling a sterling on fixedized strings or other things like that. So what's the plan to try and make sure we don't turn these supposedly more secure into APIs into perhaps still insecure APIs? I think it's mostly us designing it right and actually getting people who have strong opinions about this and looking at the past anti-patterns and saying, what do we need to have? Like what is actually a helpful API for the author that provides us the defensive characteristics we want without getting in their way? And like in the past, we've just kept going, doing tiny band-aid fixes like, well, string and copy. We're good. Just ship it. And I think the other problem we've had is doing evolution of APIs in Linux kernel. We've had a long history of saying, here is a new API. I will use it in this one place. And it's everyone else's problem to fix all the old APIs. And I have tried in some of the conversions we've made to look at past APIs and remove them. So first, move all old APIs up to, move an ancient API up to the old API and then move all of the old APIs to the bad API and then move the bad API to the good API in the process of wiping out the availability of all the others. And I think that's part of the cost associated with this is actually getting rid of the old APIs and not allowing them to exist anywhere and get misused in the future. Just a further bit to Laura's comment. When we've found anti-patterns in the past, we've added like Cox and L scripts and Cox and Chex scripts. Do we perhaps need to proactively figure out, we're adding this API, here is a way people might misuse it, add checks for those kind of things in advance before we start seeing them? That would be nice. We have some sets of the Cox and L scripts already in the kernel, but they are effectively disjoint from a regular compile. In some places where the kernels get built for vendors, they will actually do two-stage compiles. They'll say, first, we're gonna do the static checker compile, which includes Cox and L and some other things. And if that, those tend to be so noisy that really it's, if that does not produce a difference in the output from before and after, then continue and do the build for real. But that's actually been something that's bothered me for a while, is we don't include that in the common build, so there's no warning that something bad has happened, which is why I've sort of pushed to just eliminate the API from the kernel, because if it's gone, it won't even build. But we're forced into some cases where we span multiple releases with APIs, we have to continue to support, and then people get distracted by other things. So it's just a matter of doing it as completely as we can. The question was, is there any way we can mark APIs as obsolete? So we did have underscore underscore deprecated, but Linus deprecated it. So, yeah, well, I mean, Linus's argument was effectively the same, which was, if you're removing an API, remove it. Don't make it someone else's problem. Which is agonizing, right? Yeah, I have done this, and yeah. So there isn't a particularly good solution here. I don't know without having some form of developmental mandate where someone can say, I am removing this API. It is your problem to fix, or your code gets left out? I don't know, like there isn't. We could add things to check patch. That's happened in the past. There's sort of a potpourri of various mechanisms that people have tried. So yeah, just killing the API appears to be best, but it is extremely time consuming, yeah. And however, I've kept this as a, hey, we'd like to get rid of an API. This is a bit tedious, but it's actually usually pretty mechanical, and that works as decent, like Colonel Newby types of stuff. So if there was a list of, here, we'd like to get rid of this timer interface, or the string interface, things like that. Keeping that list in one place is another idea, and now I'm over time. Anything else? Oh, there's one more. So there was a mention of hardware support for bounds checking on x86, and I am pretty sure that there are already instructions for that, that there is an instruction called Bound, I think, in x86 assembly. So what's the problem with it? Because I've heard about it, I've read about it, but I'm pretty sure nobody's using it, not even compilers. My understanding is Bound is separate. I came across it a while back, but it doesn't provide the protections we want, because even if we had that, it requires an explicit check. You would say, am I in bounds? And then you do it, but that instruction still needs to understand what the bounds were, and that information may be totally separate from the execution path. So having the support in the MMU, where it's actually working, when it's actually trying to dereference pointers and do other things, attaching that at the hardware level, it will actually get us what we want. Otherwise, we can just do it in software, and maybe we get those instructions getting used. So plumbing access to the allocation is the slow way in software to do it, but in hardware, if we can just associate it with memory region, then we get it fast for free. Anyway, I think that's it. Come ask me questions if you want to in person, or email me. Thanks.