 Talk a bit about trying to make lazy binding by the dynamic linker a bit more secure. So plan. What's the problem that I'm trying to solve here? And what can I do about it? And then some bits about how good the solution actually is. And then, OK, what are the problems created by the solution? And then bits about what other options there are and the status of this all. So review of lazy binding. Lazy binding is how the dynamic linker lets us defer a bunch of work that it would otherwise have to do. So that when a function call is done to a shared library, the dynamic linker would have to look up the actual symbol address. This is a way we can say, let's not actually do that work unless we need it by deferring it until the first time the function is called. So the pro, of course, is that you don't have to do the effort of the symbol lookup for functions that you don't use. The con, well, OK, the first is inconsistent call latency. So the first call is quite a bit slower than all the others. And then the one that I discovered later is that it violates the goal of a right, excuse for, execute on some of the architectures. The code or the function pointers involved, they need to be changed on the fly while the program is executing. So we're writing to actual executable code in some cases. So that's bad. We don't want to do that. And some of you may recognize this slide from Theo's presentation from several years ago, describing how the OpenBSD project moved around various things. And here we move to constructors and constructors over here next to the got in order to make sure that they were not executable and writeable, and similar with various bits and permissions on the got and PLT. So it's like, hey, we did all this great work. And oh, by the way, there's a gap. Oops. So goal, we want to make it possible for the dynamically incorrect to perform lazy binding. And we don't want a window vulnerability. We want to, preferably, be a bit more efficient. Actually, on a wall near where I used to live in Emeryville. So a little review of dynamic linking here to kind of set the context. The executables, they don't contain the library code that's in the shared library. They just reference it. So the executables are all smaller. And we can even do nicely update the library without relinking. And note that the references are by name. And it doesn't remember like a symbol index or something else like, we don't say calling printf doesn't mean calling the 37th function inside libc. You know, the executable knows all I'm calling printf. And it's the responsibility of the dynamic linker to say, oh, printf is what you wanted. OK, let me go look that up over here. OK, addresses, xxx3212. Now that information of exactly what addresses are correspond to what are represented by these things called relocations. So there's these tables of these. And the assembler creates them. And the linker kind of consolidates and rewrites them. Talking about how the code and the data depend on these load locations and simple things. So when you load an executable, it has to know, OK, these certain things are at, you know, these functions may be at a different relative address. And it has to convert those to an absolute address. So these are just a couple of example relocations. Every platform has its own set. And this is a couple from, obviously, AMD64. So you have the first is rewriting a specific area by adding a value to it to the load location. So at a certain address, we're going to take the actual address, add a value to it. And the other one is, OK, let's just take a symbol value and add something to it. Particularly, relocations, they're actually two types. There's those that have to be done immediately when you actually load a library or when you load a program. So as soon as it's loaded, when either at startup time, when the executable is actually executed, or later, if you loaded a shared library with DL open, all the immediate relocations, those have to be done right then and there. It has to go through. These are places where the code actually refers to some data location or has a pointer to a function. And there's no way to capture the flow of control at that moment. The code just wants an address. That's all it does. On the other end, there's also the lazy relocations, which are places where it's actually calling a function. Now, when you're calling a function, obviously, you're in the flow of control. So we can trick the flow of control to go over to the dynamic linker instead. And then, once in the dynamic linker, it can figure out the real bits, rewrite the relocation, actually perform the laziness, or stop being so lazy, and then go on to the actual code. And so with position independent code, we do this to maximize the sharing. And with executables, you can even do this with executables for position independent executables for those who were in Sean's talk. I talked about it. So even though the library executable is loaded at a random location, we need to be able to get to the right place. So the relocations handle that. And the generated code can't, you don't want to rewrite all these things in the text segments itself. So instead, you do some indirecting. And so the program, the executable, or the shared library, its representation on the disk has these two tables, as the GOT and the PLT. The GOT, the global offset table. This stores, it's just addresses and values like that. And just one big table. Well, the PLT is the executable version where this is how, if you do want to do a lazy binding, the program actually calls the PLT entry. And then that figures out how to go from there. And this gets all very, very architecture specific. And in order to make some of this a little clearer, actually dig in a bit on a couple architectures here. So on named D64 for lazy binding, the PLT has never changed. It's a static chunk of code. It's the code that's in the executable is mapped into memory, and it doesn't actually have to change. But what that code does is it actually uses the fact that we can do these relative lookups. And it then says, OK, the PLT entry for printf, for instance, it knows, OK, I look over into the GOT table for the address of printf. And from there, I can figure out where to jump to. I3D6 is similar, except they don't have the nice relative instruction pointer relative addressing. I3D6 there really should have taken a note from the AMD64 designers. They should have gotten this right. So the result is that instead, actually, the caller of a PLT, it knows it's going to be calling it PLT. It actually has to set up the GOT pointer for the PLT so that it can all fit inside the PLT. So even more specific, here's a small chunk of C that if we assume that foobar and are there are actually being pulled from a shared library, the generated assembly looks something like that. So that the first one, it has this annotation here, foo, come on, highlight. That actually is a special bit of code that says, OK, YoAssembler when you actually create the link or when you create this, we're going to put a number, actually turn that into a number, which is the GOT entry for the variable foo. So that just loads into the RAX register foo's address in the GOT. And then we can dereference that in the next instruction. And then the second one demonstrates how calling the function, the PLT, it's just that annotation to call the PLT. So what the PLT end up doing, this is even getting deeper, PLT end here is, we'll say, is the entry for BAR. So it has this nice little thing where it does its own look over into the GOT and ends up loading index and jumps back to the PLT. And that eventually goes over into the lazy binding code. So the goal here is that this starts off being an address that actually points over here to the PLT zero, but we're going to rewrite it to point to the real printf or the real BAR function. So the lazy binding code, what it's going to do is it's going to say, OK, the BAR symbol, for instance. Let's actually resolve it to the correct address. Looks through symbol tables, goes off. That's all standard stuff. And that's going to update that GOT entry. OK, but we already said, and we have seen on that slide by Theo, that the GOT and the PLT don't normally need to be written by the application code. So on openBSD, after loading, they're both unprotected to read only. So how are we going to update that GOT entry? OK, so here's what we really do. We resolve the symbol. Then we unprotect the GOT to read write, update the GOT entry, and then unprotect it back to read only again. OK, well, if a signal came in there and we called the function, the signal handler called the function. OK, so that's not going to work either. So let's actually try this again and actually block the signals across this all. So we block all the signals, unprotect, update the entry, unprotect it back, and now unblock our signals and actually do the function call. And this works, that's what we're doing. Threads, of course, as usual, come and mess up things even more because threads are a pain. And so in a threading program, one more time, more feeling, let's actually also grab a lock there, a spin lock, just so we can make sure that another thread doesn't try to do this at the same time. Because the problem is what happens if you unprotect, if one guy unprotects and then another one tries to do the same set of unprotects and unprotects, you'll end up with it in the wrong state. You'll end up leaving it unprotected, or you'll actually fault trying to do the update. So the spin lock is actually a register with a callback so that the thread libraries tells the LDSO code, hey, I guess you have to use the spin lock here, use that here, so that you don't actually get the spin lock on non-threaded code. And as a result of all this, if you actually are K-tracing a dynamic leading program, you see lots of noise like this in the resulting K-trace, where you see it block the signal, do an unprotect, you don't actually see the, there's no sys call there for the actual change, but it then re-protects it and unblocks the signals. And then finally, hey, turns out that was the, trying to do a resolution for an octal. Now, that unprotect isn't free, neither of them, if that matter. So okay, when you add permissions to a page, when it's actually marked read write, we just set some bits and off we go. And then when the kernel, when you actually try to do the write, then the, we'll actually have a fault and actually fix it up and say, oh, okay, yes, you are allowed to write to this page, okay, okay, okay. Now, on the other hand, when you remove the permission, that has to be instantaneous, well, that's to be before the return from the unprotect. And more importantly, if you have threads, other threads in this process, we need to make sure none of them can, they all are denied right as well for this. So that whenever you do an unprotect in the threaded process to remove permissions, you actually have to send IPIs to all the other CPUs, which are also that are involved in running these threads. And that's, it's not cheap, it gets expensive. And it's kind of annoying because we don't even want those other processes, those other threads to even see that it was writable to begin with. So we make this change and then we have to tell, oh, you didn't even see that change. That's, it's a waste. So came up with the idea of K-Bind. And some of you may remember me glossing a little bit about this under the name M-Write a couple of years ago. The name has changed and a number of things changed about it. So basically it says assist call for doing these updates, other PLT are got. You pass it the address and length of memory to update and a buffer of what you want it to stick into that pretty good memory. And then in the kernel, it has to do the same sort of permission check. So it still does basically the same checks that improtect would have done and let the copy on right handling would have done. It has to, for instance, make sure that if this page was actually copy unwrite, it has to clone it and resolve that. It has to make sure if you didn't actually have write access to the underlying page because it was from a read-only file, it'll break your fingers and things like that. Nicely because it's by doing a system call, it becomes uninterruptible. So no, you don't have to do that sig proc mask dance. It doesn't have to kernel. It's kind of implicit there. Nor do we have to do any of the spin lock because we can just say, okay, yo, UVM system and PMAP, make sure that you do the locking sufficiently so that if you have two threads, we're both calling K bind that nothing goes wrong. So the kind of before and after, this is basically what the LDSO looks like for AMD64. They've got the particular resolution part. And then after, it's a bit smaller. And the result is that it actually is a bit nicer even in the K trace output. I don't need to see it passing enough information and then off it goes. Well, that's the kind of user space side in the kernel. The kernel, we can try to be a bit more efficient about things. One of my early implementations, I actually basically copied the M protected code path and went from there. But after some help from our UVM hackers, started just using some, started just wiring in stuff. So it consists of just a few steps. We just copy the data into the kernel. We then force the UVM fault, fault wire there, takes care of the permission check, the copy on write resolution, make sure the page is actually in memory in case it got paged out. And then because the AMD64 is a direct map where there's an address range in the kernel, virtual addresses, which map to all of the physical addresses. So it's a little subtraction operation for the hardware. We can just grab, figure out the right address in that range, poke the pages, poke that page directly. Though actually it's not actually a B copy because you actually want this to be word-wise atomic from the user space side. And then once that's updated, we can just unwire the page so that it can be paged again and clean everything up off we go and return to user space. That's it. We don't have to signal other CPUs. The fault, if there was a copy on write resolution, the fault, the UVM fault may have to do that. But if we were gonna have to do that, we were gonna have to do that anyway. No loss there. Spark64 as a contrast is a bit different where the got instead of in on empty 64, the PLT was static and we did everything in the got. In Spark64, it's kind of the other way around. So the got has never changed. And instead, we're actually gonna update the executed code. The dynamic linker, we actually update the, there's a number of different code sequences you can use to say, okay, if it's, if the place we're jumping to is within two to the 21st bytes of the PLT entry, then we can use this code sequence. And if it's with close to page, you know, address zero, we can use this sequence and stuff like that. Now, there's like literally eight different code paths in the function in a, in LDSO. And I don't think anyone's actually ever gone through and exercised all of them because we're all buggy. I went through this and discovered that all of the ones that we're doing relative calculations were failing to actually correct for the placement of where the actual jump was within the PLT entry. So they're all jumping into the middle of the function. So which, and it honestly calls in the question of whether maybe we should just delete a whole bunch of these code sequences because they're obviously not being used or the, or that or our Spark64 are all crashing in randomly depending on the layout of the libraries, depending on how close they were in the address space layout. So something to be aware of when you're working on your ASLR stuff. And in some cases there was great places where the as in was described as being one way, but actually when you actually, you know, look at the octal value, the hex values, you're like that isn't what that instruction is. Ignore the comments, read the code. Now the other problem, of course, is that we're changing the running executable code. What happens if another thread does this, you know, calls this at the same time? And, you know, the sun wasn't dumb. They actually came up with a set of sequences and instruction sets so that you could actually do this update and make sure it was possible to consistently get either the original or the corrected PLT result. But it doesn't mean you have to actually change it in two steps. So for instance, similar setup on, from AD64, the first couple entries are the ones that jump, do the actual jump to the dynamic linker or PLTN when it's, so we load the offset to the entry and then we branch, this branch is back up to the one that goes then into the dynamic linker. Okay, that's nice and simple. When we update it, what we want it, for instance, if let's say assume the target address is within two to the 31st bytes or whatever, we'd use this call sequence, where we save the return address, do the call and slip the return address back into place. Now, if we just did it in the obvious order of writing the instructions in normal order, the first step obviously breaks things. If we wrote that first word, then the someone, a thread client at the same time would of course barf. So what everyone does correctly is you actually do it in the other order. You actually update all the other instructions except for that one. And then once those are in place, then you can update the first instruction and it switches semantics from jumping into the dynamic linker to instead jump to the resolved address. Now, for k-bind, that means we actually have to give, the kernel has to know how to do that. Now, there's a couple of ways we could do this. We could have the kernel just know, oh, I'm on a Spark 64 and these are the bytes. Therefore, I will write them in this order into the memory. No, I don't wanna put that in the kernel. No, no, no, that's wrong. So instead, I'll pass the kernel two blocks. I'll pass the kernel one block that says, okay, I want you to update these two instructions. And then after that, update this instruction. So, okay, my system call function signature I gave, I lied. No, instead it's something a bit more like this where, okay, pass pointer to parameter which actually contains an address, a length, and then some data, and then other address, length, more data. In the kernel, it's Spark 64, it doesn't have the direct P map. So instead, we basically force that page to be mapped into the kernel map. Then we can poke it directly there and then unmap it back out of that and detach, clean up everything. So that's actually a little more what the other arcs want to be doing. And the result of all this is, okay, so how good is this? What are the results? And they answered, well, okay, the make build of the system, it saves about 4% on the execution time of the make build. And most of that is actually in system time. Now, how much of that is because our P maps could be better and our MP could be better and other things could be better, but it's at least removed 4% of bottleneck. Now, note that it turns out that almost all the savings actually come during the make install step when we have lots of little short-lived processes where the amortized time is, there's a lot of PLT resolution relative to the total computation time. I mean, GCC does have a lot of entries that it looks at, but it eats much more time relative to those. And the one thing is there's something not right with my UVM stuff. It's not always faster. There are test cases I've got where actually running it with the M protect version actually is faster than the K-Bind version. And obviously, it should be definitely possible to get around that. And one possibility is just the fact that K-Bind doesn't actually trigger any read ahead. So it may be that I just need to get the UVM stuff to actually be reading ahead on the PLT and stuff. Whoa. Yes, okay, good. So, you know, maybe the system call has some problems. You have a system call that allows you to change read-only memory. What could possibly go wrong? To quote the wonderful fire sign theater, it's the power that can be used only for good or evil. The, wait for that one. So, is there some way we can lock this down? You know, we don't want to just leave the system call hanging out there, which allows anyone to get around M protect. I mean, I imagine there's some processes which actually would love to do that, but not all of them are on our side. So, can we fix it so it only is capable of being used by LDSO, in effect. We specifically don't want this to be a target for return-oriented programming. We don't want it possible to return into a call to this and then take advantage of the fact that it has changed some memory to effectively be executable and often do other stuff. So, we've had a number of ideas for how to lock it down. Bunch of checks here. The first rule is that, okay, it should be possible to never have this fail for legitimate reasons. I mean, LDSO should never actually screw up, right? If it screws up, then heck, the process probably should die immediately at that point. So, if any of these checks fails, we'll just kill the process, not even a catchable signal. We'll just call SIGG exit in the kernel directly. So, okay, well, we can lock it so Kbind can only be called from one address in user space. We could pass a per process cookie, maybe even a per thread cookie. We could pass in the old data as well as the new data and the kernel could compare them. Or maybe we could lock down exactly what Kbind was allowed to touch by marking the pages in some way. So, like locking down the PC, we can actually just record in the struct process the address from which Kbind was called. And you copy that address, that's copied on 4x0 and exact. And then if that changes, if you tried to call it from a new address, then we kill the process, of course. And then we can even make sure it make it possible for that a statically linked program could basically take advantage of this to make sure that a statically linked program would never be able to call this. It would just make one call that says, never let me do this again. And then the result is we then can do inline asm in the linker, in dynamic linker. And there we get some leverage from one of our other security things, the stack protector, where if you try to jump into the middle of the DL bind function in the LDSO where this is, sure you might be able to get the function, you know, to hit the system call, but your stack won't have the right stack protector cookie in it. So, cool, you made the call and now you die. So, we can leverage that to make it a little harder for someone to cause problems. You can also pass a per process cookie, have LDSO in its own open BSD randomized segment. So, it automatically gets filled in and then on the first call the kernel remembers that and we just make sure. And we can even then order the logic in LDSO so that at first, you know, the DL buying code first says, okay, let me get the cookie. Now, let me do the actual logic of looking at the symbol and figuring out, oh, bar is at this address, printf is over here or whatever. And then I'll call KVide. So, if someone wants to jump into the middle of the function, they can't get the cookie load logic without also getting the symbol address resolution logic. So, they kind of have to get both of them together. But, as observed, you know, an attacker could probably figure out where that cookie is in LDSO's section of memory, so it's a little unclear on that one, how good it is. We could also pass a per thread cookie and then update it even on each call. This gets really painful because you have to actually be doing TCB management correctly, the thread control block, which OpenBSC doesn't have right now. So, it also, this one is probably gonna be, probably gonna get, we're probably gonna drop this just because it's not clear what actual security benefits. I mean, it's, I've actually implemented, it's running on my laptop, but in the end, it's a pain in the butt and if it doesn't actually, you know, we can't even come up, we can't come up with a real attack vector that it protects against, then it's hand waving. We can pass the old data, so the kernel can then compare and make sure it's only changing something that looks like a LDSO, a PLT or got. But it's, we're passing the data from LDSO itself and there's actually a corner case where the binding can actually change where if one thread went, started to resolve a PLT, another thread, DL opened another library and then another thread came in, you could actually have a symbol that needed to change in the middle of the processing. So, it would be one really hairy case to actually try to figure out that that's what happened, so it's not worth dealing with that. So, more interesting is the idea of doing the protected mappings. Actually, this one actually got code to implement part of this and we can actually mark the PLT and got so that it can't actually change again. Now, this way we'd actually make sure that k-bind is the only way that those particular pages in the executable would be changed and the process could be changed and that k-bind could only change those pages. So, it kind of limits the exact scope of this dangerous call. And then you can also make sure that LDSO doesn't get unloaded and stuff like that. So, the status of this, okay, it's no, this is actually a work in progress, unfortunately. The, it works. I've been working on three of our architectures, but we're gonna want to update, deal with these issues and make sure that the, we're not creating too much of a headache for ourselves. And then, kind of, we don't want to commit something to just a couple of the architectures until we have it, we're pretty sure that it's gonna work for all of them. You know, PowerPC, we need to make sure that actually we have to switch over its ABI to use the secure PLT ABI. We actually need to do that, I think, because of the current one, I don't think it's even thread safe in certain cases. So, that's kind of a looming requirement and I need to tweak exactly how some of the UVM stuff is so we can make sure that it's actually consistently better. But it does close the, right, to our execute. Now, the other thing to note is that OpenBSD, we don't actually do some of the other things that some of you guys have actually done, I think. I believe FreeBSD, and I think NetBSD as well, have both done a pretty good cleanup job on the symbols exported by Lib C and Lib P thread and all these with the result that there aren't as many PLT entries in Lib C to start with. I mean, you don't have to go through this whole song and dance on a PLT entry that doesn't exist. So, best is to get rid of all those. And right now, for instance, our MD64 Lib C, 771 PLT entries, almost all for references to other pieces of Lib C, just on the off chance that someone wants to override printf with their executable, that sounds like more of a bad idea than anything else. So, it would be better, we think, to do the cleanup job that you guys have already done and get it to the point where there's few or none. Any questions? Is the mic being lit up? So more of a comment than a question. I like that you pointed out that it does have security implications. One of the things that you could do which would cost a whole heck of a lot in performance is sort of reimplement the runtime linker inside of the kernel and ensure that the new value passed in resolves to and also pass in the name of the function that you're resolving and have the kernel do a double check that have the kernel crawl through those elf headers and do the resolves. Yeah, that basically amount, it would be implementing most of the interesting part of the dynamic linker at that point. And walking that address space from the kernel would be terrifying to do. Sainly, that's a huge chunk of code which I would be a horrible thing to pull into the kernel and we'd rather try to figure out how to make it sane to avoid doing that. And maybe some of the bits about locking down the pages would keep us from actually having to go that far. The note was that it also, that of course would also make much harder to actually make changes to the dynamic linking process. I mean, there's a number of enhancements to our dynamic linker in functionality and performance that we'd really like to do. And putting into the kernel, while doing this work, once I got a K-Bind system call in my kernel, I've been switching back and forth between LDSO versions with only like two or three cases where I've had to boot from a CD. But it's been much easier to actually test various versions of the implementation and to flip back to, okay, let's do a comparison performance-wise. I think we would rather, we would probably throw the whole idea off the bridge instead of putting that linker into the kernel. Let me know how you fix that because I would be very much interested. For the performance testing, did you compare to running with LD-Bind now? I did not compare to running LD-Bind now. What I did do is I actually did a three-way comparison of running without any of the M-Protects at all. So don't do the OpenBSD, you know, protecting the got. And actually, there was very little difference between protecting the got and not protecting the got. The memory costs that I was seeing were the performance costs were a fraction of a percent in the, you know, which actually told me that my tests are probably not the exact right test at that point. But, no, doing some more broad tests to say, okay, let's actually pay the cost of doing buying now and see what happens would be a, I have a hard time imagining that could ever be faster. But, you need to make a broader comparison of the performance of, without any of the protections, with the protections as they are now, and keep on to kind of as my performance metric. You had a question? I have a probably pretty simple question. When you mentioned that we need to remove exposed symbols in Lipsy, does that mean marking more function static? So they, how does this work? Why are they exposed? So there's a number of functions internal to parts of Lipsy. For instance, you know, there's under bar under bar S find Fp or whatever in first DDIO where you can't just mark it static because it's actually used by functions in other translation units. So instead, you actually, you can, at the elf level, you can say, okay, mark this as visibility hidden and then the linker will actually eliminate it from the dynamic single table. And if you mark it, there's ways, it is, Old Rook Drapper has this long paper where we're about optimizing the hell out of all this stuff. How do you mark it? Yes. Yes, it's a DCPAT visibility hidden kind of thing. And if you do it right, you can actually even eliminate, you can make it efficiently avoid even, the code can actually say, oh, even though it's in different translation unit, it's because it's marked as hidden this time, I can actually just do a relative call even though it's in theory to be a word. Any more questions? So, thank you.