 Ladies and gentlemen, on this very last day of this conference, I have the honor and the pleasure to bring to you two guys here, Jeff DeLeo and Andy Altson, who are going to take care about the lecture called Colonel Tracing with EBBF. So for the ones who don't understand this extended Berkeley Pockets filter, and what they're going to do is actually introduce you to the functionality of it, of the Linux kernel. So they will cover some practical uses of the technology beyond Mercode profiling. And for me, most interesting is actually that they're going to play a little bit gods here on stage. Is that only a little mad with the power? A little bit. I don't know what kind of gods maybe the spaghetti monster will be around. At least stage is yours. Give them a warm welcome. Thank you. Welcome. I'm Jeff. This is Andy. We're security consultants at NCC Group. So we hack things for living and find bugs and tell people how to fix them and stuff. And one of the things that we've been playing around with recently is this EBBF stuff. So EBBF is extended BPF, but what is BPF? So BPF for those of you who don't know, although maybe everyone does know, is Berkeley Pocket filter, which is a byte code and VM designed for being run in a kernel to process network packets really, really fast in de-filtering. EBBF is this sort of completely unrelated language that was designed in the Linux kernel as its own thing and nearly kind of took the name. It's designed to be jitted with sort of a direct mapping to X8664 and other modern CPU architectures. It's sort of designed around transpiling a sort of restricted C limited language subset into this code so it can run very fast. And it's being applied to anything and everything in the kernel that might benefit from having programmatic logic being applied to it. You interact with it through this BPF syscall and basically you send it some code, you send it some other metadata about what you want it to be doing, and if it likes it, it gives you a file descriptor. If it doesn't, it gives you an error. So basically the real power of this stuff is what you can apply it to. So what does it look like? Well, this is one form of it. This is a horrific form of it, a macro hell in C that basically no one should ever be writing this. So what is BPF? EBPF. It's basically a BPF implementation that just threw all the rules aside, added a bunch of registers, added a bunch of functionality to call arbitrary sort of kernel functions that are registered into it. And it's just applied to all sorts of things. It has a bytecode verifier that we're going to henceforth refer to as the validator because it's not really good at verifying nor validating for that matter, but it does. And the main functionality of EBPF really is the helper functions that are exposed to it depending on the context of what you're doing. So, you know, your socket filtering EBPF isn't going to be able to read arbitrary kernel memory, but your Kpro BPF is. So why EBPF? High performance, in-plane packet processing. Safe, sorry, safe high performance in-plane packet processing with network tunneling and custom IP tables rules and a safe high performance in kernel programmatic operations with syscall filtering to reduce the need for buggy kernel modules or all kernel modules with a firewall subset. Okay, wait, what? So as EBPF has sort of gained all of these features, the why keeps changing and people find new reasons to like it. So what, why really? So it depends on what you're doing. We like to hook all of the things, especially the Linux kernel. So this gives us the ability to do that without risking crashing the kernel, which happens when you start writing kernel modules in C. The talk is about kernel tracing after all. So EBPF is interesting in us because it has the potential to give you a bit of a detrace, which is sort of the dynamic instrumentation framework for Solaris and for BSD and Mac OS, a good run for its money. Maybe not in the power of the things that it provides you metadata for as sources of events, which Linux is probably not going to get to that level of unified design anytime soon. But it's more programmatic and focused on lower level operations and taking like C code and compiling it to this and then using the fact that you've basically written C code to operate on C code in the kernel. So detrace is focused on like one off human based command line types of operations, but EBPF enables you to do all sorts of crazy wacky things. So let's talk a little bit about tracing. Tracing is basically very fancy logging of program execution. That doesn't really mean much for us. We don't really care about the logging so much. We care about, you know, getting at things and hooking things and making them do things and observing things. So we care about dynamic tracing. So what is dynamic tracing? There are two main kinds. There's the kind where you enable and disable existing logging functionality, which we don't really care about. And then there is the kind where you add arbitrary functionality that wasn't there before. We care about the ladder, but we also still don't care about the word logging. We care about dynamic instrumentation. So the two main kinds of dynamic instrumentation, depending on your perspective, there's things like function hooking and things like instruction instrumentation. And depending on what you're doing and what you're targeting for your hooking, maybe your hooking is actually implemented with instruction instrumentation. And, you know, EBPF is basically like that. So let's, the alternate title of this talk is instrumenting Linux with EBPF for fun and profit, a title that we definitely not get accepted in 2018 had we submitted with this. So let's go into the history of Linux just a little bit on its tracing technologies. So the most important thing is K-Probes, which are over a decade old, but they've gotten faster recently. And basically nowadays, they allow you to hook any instruction in the kernel, really, but if you try to hook a function at its entry, it'll have extra logic around seeing it to completion on a return. And so it'll try to analyze it to determine whether or not it can do the fast path. And if for whatever reason it can't do a fast path of finding all the exits, it will essentially just use a breakpoint and single step through it, which is slow, and you don't want to do that. Ftrace is a basically file system API for adding things like K-Probes into the kernel and doing all sorts of other stuff. It's not super important from our perspective, but it is used as part of EPPF. Perf events is a whole bunch of crazy profiling stuff, but one of the key features is a very fast ring buffer for copying data from the kernel to user space very quickly. Trace points are essentially that former kind of dynamic tracing, where the functionality was already there. And U-Probes are essentially K-Probes for user space memory. And EPPF is this fancy combining robot that plugs all of the things together. So EPPF basically integrates with all of these different kernel technologies, but the core concept when using it for tracing is that you have your EPPF program and you plug it in to some sort of data source using one of two APIs currently. Then you use something like the Perf events ring buffer or a memory map EPPF map as your output to user space. The latter actually can take is a bi-directional mapping between user space and kernel space. So the user space can actually update that and send input to the kernel. So the sources that we're working on generally are your K-Probes, your U-Probes, your trace points and raw trace points, which are basically trace points. The way that this works is that you make a whole bunch of crazy sys calls and chain them together. So you start with your BPF sys call to actually make your program. Then you use the ftrace API in SysFS to basically make a K-Probe and then get an ID for it. Then you get a Perf event by calling Perf event open and you pass it in the value that you got out, the ID of the K-Probe. And then after that you've made that, you attach the BPF program to it and then you enable it. That's the old format. In more recent kernels, there's a slightly different form where you can skip the ftrace, traceFS entirely, but in practice, it still ends up getting used because this magic number of sys on the slide here is actually generally gotten through it, but you don't need that. So everything else basically follows the same APIs except for the raw trace points, which leads the Perf open entirely. So how not to use the BPF is to do is to not is, you know, how system D is apparently using it, which is by assembly by hand. It's really hard. It's basically impossible to do anything complicated or fancy, and it's also highly error prone. I'd be very surprised if that code was correct. So what should you do? You should use this thing called BCC. It's basically a framework for compiling C to BPF or EBPF and then hooking it up to all of those fancy sources. This talk, I'd like to clarify, is not really about BCC, but we basically end up using it for everything because it's essentially the only consumer of the kernel API, which is also the case because the people developing BCC are also the people writing the kernel code for the BPF. So essentially they're writing their own APIs for them, which are otherwise essentially completely undocumented. And where they are documented, it's completely lacking, and none of that multi-call, multi-sys call stuff is really documented anywhere. I got that from reverse engineering how this stuff works. So BCC is the only real option for doing this hooking. So how do you write a tracer with BCC? Well, generally you're going to use the Python API that it provides, and you're going to unfortunately have to write it as a single Python file. So the bigger this code gets, the more complicated and harder it is to write it. And your C code, you're generally going to store on a Python string. And then the structure, this is sort of a big, ultimate structure, but the really important thing here is you have your string, you tell BCC to compile it to BPF, and you tell it to register it to events like Kprobes and your maps for I.O., and event handler callback functions. And everything just works. The single Python file is actually because of a weird limitation in it. That probably will get fixed at some point, but right now it's sort of annoying to get around it. So let's write some code. This is what a very simple EBPF Kprobe looks like when using BCC. BCC has a lot of code generation going on behind the scenes, so this fancy Kprobe underscore underscore syntax basically lets it know that you want to use a Kprobe on sys open, which is the kernel side for the open sys call. The name of this changes between kernel versions, at least recently, there was a change. It has a slightly different syntax now, but it works the same way. So we hook this thing, and then we just call a print K, which is sort of similar to your generic print K if you've ever done kernel module development, and then we run it. And we get nothing. Why do we get nothing? Oh, yeah, glibc. Piece of crap. So glibc has for the past while basically made all of your attempts in C code to call open actually go to the open at sys call instead. So we need to actually hook open at. And so once we do that, we start seeing all of the events all over the system. But let's generalize the code a bit because why should we have two Kprobes when we really just want one piece of functionality? Because as it turns out, both open and open at go to the same place. Do sys open, which is the underlying implementation in the kernel for them. And so we can now hook them at once, and in the print K, we now put a percent s, and we'll try and print out what the path name is. And we start seeing all of these random procfs things in system D, journal D is doing all sorts of wacky stuff and system D two. It's really scary stuff. But print K is, the trace print K is actually considered harmful. And the reason for this is that it's a lot like F trace and that there's one log buffer shared across the whole system, which means that messages from different active tracers are kind of running into each other and no one knows who belongs, what belongs to who. Your eBPF programs also have a race condition in them with this, because the way that eBPF is safe is that it's anchored to the executing process that has created the object or maintains control of the file descriptor. And so essentially when your last remaining process with a handle on that FD dies, the eBPF gets unloaded and then it unloads from the K probes that it may be traced attached to. So there's a weird race condition because you terminate, but if your K probe is still registered and hits, then it runs the code. It tries to log, but your K probe then gets detached afterward, but there was also no client to receive the message. The messages just stick around until someone decides to read them. So the next person, next process to try and trace this stuff doing print K is immediately going to see messages from other people. So that's a problem. So instead, what you have to do is you have to rewrite the whole thing to use your own sort of custom data passing mechanism. And so in this, the way that we do this is we use the perf output perf events ring buffer. And so we use this fancy macro that basically BCC provides us to our C code. And then we also set up a scratch space because the eBPF buffer stack is 512 bytes and we're operating on paths that are being opened and paths have a max length of 4096 on Linux. And so that's not going to fit. So we need to have some off stack scratch space to put it in. So we use this perf CPU array that is entirely safe as long as you stay within the one thread of execution. And so we do this and then we just have this sort of annotation for how we access the field in it. We only have one element in it. It's always going to be there. This is never going to fail, but we have to put it in to please the eBPF validator. After this, we just copy all of the data in from the call into the scratch space and then we perf submit it, which copies it to the fast buffer for for sherry. This is a lot of copies unfortunately. There are ways to optimize this, but we're not getting into them right now. So on the Python side, if you're familiar with Python C types, it's an API that essentially allows you to interface with native code. So we define the structure that basically mimics our C structure and then we register a handler function to that will get called every time the perf submit happens. And we register it to the table named output, which is the table that we defined in the C code. Then inside the handler, we do some casting and then we print the data. Then at the end, we set up this K probe poll call on the BPF objects that BCC provides us. And the naming of this is actually changed in one of the more recent versions of BCC, but it's still it's still there. And it basically pulls for all of the perf events for us. So how does this all actually work? Well, let's just go into a little bit of cleaned up S trace output. So the main thing to focus on here is this BPF prod load up at the top with a type K probe, because we want to make a K probe. We get a file descriptor of five for that. Then all of the CIS kernel debug tracing stuff is the API for F trace, the CIS of S. And the way that you interact, the way that you create a K probe through it is you write something to it and then you probe at the ID number that you set to get the actual ID value for it. Then you open a perf event on to that ID passing it is the config. It's a very flexible call sequence. So some config may be one of anything, depending on what type of thing you're actually doing. It's very much like an I octal. That then we take that and we call an I octal actually on that file descriptor. And then we actually use that to attach the eBPF program to it. Then we enable it. That way our eBPF program is actually associated with the K probe through the perf event. So what does this eBPF perf output thing actually do? Fancy macro. Actually it doesn't do anything if you look at it. It just sets up, it creates a dynamic struct type and then creates an instance of it. But it never actually fills in the fields. That's a bit weird because basically it doesn't do anything. It's fake. The whole thing is completely fake. It's mocked out code that's going to be replaced with code generation anyway. So it just needs to pass like the compiler, you know, quick check and then it gets replaced. This is actually pretty common way to do code-gen-based APIs. So what exactly is going on when we call this perf submit over here? Well, it turns out that this gets replaced with a eBPF perf event output helper function call. And the important thing about this, other than that it passes all of the arguments in, is that it passes in a flag called current CPU identifier. And what this does when it's passed that is it attempts to pull a perf event object on the kernel side out of a perf event array using the current CPU as the index. The reason that there is a perf event array there is because back in that S-trace output we had a couple of these BPF map update elements, BPF syscalls, which set index 0 and 1 with some perf event file descriptors. So at the beginning of this whole thing, the first thing we actually do is we create that perf event array. And then we get a file descriptor of 3 back for it. Then towards the end we start opening up these perf events and in green is 0 and 1, these are actually the CPU indexes. And then we get back for it a file descriptor. Then what we do is we call this updateLM and we use the file descriptor from the perf event array. And then we pass in this key thing and this value thing that are these weird hex numbers. These are actually pointers. So key is actually the pointer to the zero value for the CPU zero. And value is actually a pointer to the eight. And then vice versa at the bottom where key is one and then the value is nine. Then we just start polling on these file descriptors and everything's great. So let's switch gears a little bit. Let's talk about the EPF validator hell. It's a very, very painful place to be in. To make EBPF safe, because otherwise it's doing all sorts of wacky things in the kernel. It potentially has real code when it calls functions that can read and write memory. It just takes values and things, right? It needs to be safe. So Linux kernel attempts to validate all of the code before actually loading it. So, you know, various things are, you know, your simple kind of checks that everyone's aware of. It is not allowed to loop or jump backwards to prevent infinite loops, because this stuff runs straight in line in the kernel. This stuff doesn't get like preempted or broken out of. If this thing manages to get in an infinite loop, the whole kernel hangs, at least on that one thread. But even if your code doesn't have any loops, the validator may reject it for a whole host of reasons. You know, you could have calls that aren't static inline, and then you're technically jumping either forward to it, and then when it returns, you're jumping back. Or maybe the function is defined before you, and then you're actually jumping back to call it, and it doesn't like that. And so then we, you know, we start unrolling all of our loops to, you know, get around that. But then you have weird compiler optimizations that kick in. So sometimes your code was actually wrong to begin with. But the compiler, BCC uses an optimized compiler pass, and so through clang a little VM. And so sometimes your wrong code, if it uses a small enough identifier as a max operation count, it'll just inline the whole thing and unloop and unroll it. But depending on what you do elsewhere in your code, and you start adding features slowly but surely, maybe something about the optimization changes and it's no longer able to do that. And then you get an error that doesn't make any sense because the part of the code that you changed has nothing to do with where the error is coming from. So the validator also tries to ensure that all of your calls to those helper functions are past safe arguments. But the logic for this is kind of bad. There are all sorts of things that when you clearly have bounds that you're not breaking out of everything is fine, the validator just doesn't realize it, doesn't pick up on it. The reason for this is usually that the optimizer has actually cut them out. And the compiler is smarter than the validator is and this leads to a lot of problems. So the more code you try to add to make your code obviously safe with bounds checks, the more the compiler aggressively optimizes them out. And so then the validator is left with nothing. So you need to play really crazy tricks to prevent the optimizer from leading out the checks. So the validator can know that you're doing them. And then additionally depending on the updates to either BCC or the Linux kernel, you may have valid code that becomes invalid or invalid code that becomes valid. Anything can happen. Some of these validators are really creepy. Like we've had code that was rejected or accepted based on whether or not a function returned a bool or a size T that was either zero or one. It was being stored in a UNT-8T. And depending on other parts of the code that we would change or like comment out one line somewhere else in the code, the bool was accepted or the size T was accepted. We just commented out one of them be accepted and the other rejected. It made no sense at all. So we at a certain point I got really mad and I wrote this kernel module that just really hackily hooks into the validator to bypass a lot of the checks so I could just write code when I need to, right? Damn validator doesn't know what bounds are. So I will tell it what the bounds are. They're all safe. So it turns out the validator is actually super like tightly coupled to the interpreter. In fact, you can't just turn it off because as it processes through the code, it sets all of these fields on parts of the code that are necessary for it to actually load and run. So you can't you have to put in these really careful hooks into it to skip certain checks and then set up fake bounds on other parts after they've been set before with bad values so you don't get overruled. So we have this hacky kernel module POC called YOLO EBPF. Of course, of course. It implements its own custom function hooking implementation because it's useful in weird cases where we have EBPF but we don't have like ftrace API available because someone modified something in their kernel config. Caviots super x8664 only, like very specifically. I have just shell code in there in strings and arrays. And it probably doesn't work with current kernel versions. Almost certainly not 4.20 which dropped a couple days ago. And you know, if you have unsafe EBPF code, it could very well crash your kernel. We were going to make this code available more to prove a point in anything else. No one should actually use this. We have a link to our repo at the end of the slides. We'll actually be making it public right after the talk. So if you don't have the luxury of such a hacky kernel module, what can you do? Well, there are a couple of things that you need to do to appease the validator. One is to initialize your stack memory. Now, this is actually a generic sort of vulnerability in network and kernel services. Anything that crosses a trust boundary where if you have a struct on your stack or anywhere else and you haven't initialized the memory and it has padding spaces between the fields and you try to do something like mem copy it or just write it over to somewhere else, you're going to copy what's in those padding fields which is maybe not what you meant to do. So EVPF really gets mad at this for no real reason because like you're super privileged when you're doing this and you're already intentionally leaking kernel memory so it doesn't really make sense that it gets mad about you doing this. But you basically need to do everything that you can if you're copying something from the stack to basically just not have those padding fields. There are so many things you can do. Next, you want to eliminate your loops. So a lot of pragma unroll through clang LLVM API to statically unroll any static bound loops and then everything is basically static inline function calls. Any newer kernels support not having that but also no one's on those new kernels yet so it doesn't matter. Then a lot of kernel code even if it is static inline functions that are in headers that are aimed towards helping you parse various pieces of kernel data structures, they don't really work that well in BCC sometimes because BCC, every time it does a pointer to your reference to memory that's not in the BPF memory region, it will actually instrument that, it'll just code gen it to be this call to a helper function called BPF pro read that will actually pull the data out. And so it's not great at converting kind of chained calls and anything within weird scoping contexts in C and so some of that will just fail to convert and it just won't work. So basically there are a lot of kernel functions where you're dealing with things like the task struct or anything that you want to get out of that you're going to have to call BPF pro read manually and pull it out piece by piece which is really annoying. Then depending on what you're doing if you're trying to go for super high performance you may need to sort of implement your own ring buffer to conservatively send data down to user space with as minimal amount of copying as possible. If you try to do this you need to actually have some you can't loop so you need to have a switch that gets called every time that attempts to update a synchronized counter variable and then at the end of it it needs to roll back over. This is actually very carefully written code that appeases the validator if in this particular case we're just doing zero and one if two happens we're supposed to cut back over to zero. So if you try to actually put a case two in here the BPF validator would just not accept it for whatever reason. So you need to have the default case actually flip it over and you can't have another case actually flip it over for whatever reason. So and then a lot of the kernel data structures are actually very dynamic not in the sense that they're just pointers to pointers to pointers but in the sense that they're actually these crammed structs that are just byte copied into each other all hamfisted and if you want to pull things out of them you need to essentially walk dynamic structures of pointers to things and pull them out and do stuff and BPF doesn't like loops so you need to do that statically with both unrolled loops and static inline functions and so you just need to be very careful if you want to do it it's very tricky but it'll work usually. The most important thing when doing this though is actually to know when you don't have enough sort of unrolled loops to fully count out all the data and you need to you need to bail gracefully when you do this know that you're you're truncating the data somehow. And then lastly or not quite lastly but one of the more super important things when you're when you're doing things in the kernel with BPF the reason you do it is because you want it to be fast otherwise you just use audit D for most things so you want to actually be doing a little bit of preprocessing in the kernel to do filtering but to do that filtering you need to actually do things like comparisons and other things like that and if you need to determine that a PID was within a set how are you going to do that without recursion or loops? Well what you can do is you can do something like have a balanced binary search tree in user space walk through it and code gen out what the comparison would be statically for the eBPF code. This works actually quite well. And then lastly one of the things that eBPF eBPF validator really hates is dynamic length byte copies where the length comes from something that you read from somewhere else. Mostly because the compiler will optimize things and it'll be left in the lurch without enough information to know what's going on. So the way that we got around this was to basically use a weird static inline function that would kind of break up the logic and then finally it started passing. This stuff is crazy. And then one of the most helpful things that you can do for BCC at least is enable the full debug output that will actually dump out all of the eBPF instructions but will also dump out the C code sort of pre-processed already that is being implemented with those instructions. So when you get an error you can actually look back at that and figure out where in the assembly something went wrong. Good luck. That's easy, right? Pass it over. So since we're security consultants we had to ask ourselves can eBPF be used for defensive measures? We said why not? I mean eBPF is really fast and if we're successful we can improve the state of auditing. So yeah, let's let's do it. So what does security monitoring software do? Basically it just watches everything program executions, file accesses, network traffic, administrative operations and hey eBPF K-Probes can do all this. So what would make this a good fit? Well tracing eBPF programs they can see everything and they can hook into any internal kernel function and we can observe all kernel and user space memory. So start off let's implement some example, you know, monitoring tasks. We're going to begin with hooking the exec VE call and just keep it nice and simple, nice refresher, we've got the K-Probe underscore underscore and then sys exec VE. Just a nice simple trace print K. All these examples just going to keep it that not to be confusing. So now let's do something more interesting. We can compare the file path with known standard directories where biners would be. So for example bin or user local bin and we could do this just by checking the file name parameter via the sys call. So we have it set there. We have our directory prefect set. For this one we're just going to check bin just too complicated for eBPF to check all that all the other directories and we have our unrolled loop there. We're just going to check byte by byte and see if it's in bin or not and if not we're just going to let the user know. So another thought that came out of this was if we have a web app would we be able to know if it's executing stuff it's not supposed to be? So I mean we can imagine that we have a web app that you know it's just a front end to ping, it takes an IP address from user input, you know usual stuff and we want to know if it's executing anything other than ping. So right here we just have the C and same thing hooking sys exec VE. The if-deff-pid is assigned via Python via a define using arg parse and if it's not equal to the PID of our web application then just don't continue checking anything. So we're copying the file path to a temporary buffer so we can get the length of it and then we're doing a quick exit if the if they're not the same length but if it is the same length we're going to make sure that it's not ping and then let the user know. So now we're successfully monitoring file executions. Next we can watch for file opens from certain directories. Okay again nice and simple straight forward just going to hook do sys open but to make this somewhat useful we're going to see what if something is accessed under the root directory and same thing you have the path set unroll the loop and then just check byte by byte if it's actually in the root directory but we just have to let you know something. All these examples are insecure dangerously insecure just because eppf make sure that it's not going to affect the kernel doesn't mean that the stuff it's running is safe and in fact the limitations make it extremely difficult to write secure eppf code. One of the things we noticed was a time of check time of use and we noticed this when you kprob assist call so the user supply data that you get from the hooked sys call can change from the time when you copy it to when the kernel copies it and actually performs a sys call and this is relatively easy to test for so we'll just start with a two threaded program where the first thread copies two different file paths into one character array and then the second one just opens that and we see here some nice output we got that everything just is all mixed and how do we avoid this problem? Well our recommendation is to hook internal kernel functions instead of the sys calls preferably where the values are already copied into kernel memory so there's some things that we have to take into account and make sure that we think about when we're creating these things such as how file names work so what if it's not accessed by the absolute path from inside the directory from a relative path and can we fix those? Well I mean we could attempt to by comparing the value or attempting to canonicalize it ourselves but Linux internal structs are just complicated to parse and it's just way too complicated for what ebpf was designed for or we can try to find an internal function later on that has it but for this example it it didn't get the it just gets the same exact value as the sys call gets so we noticed that vcc has example code for network monitoring and we also noticed that it didn't properly calculate ip header ip header offsets specifically it did not account for the fact that TCP options are variable length so it was possible to spoof a TCP header so we sent them a PSE and of course a patch and here we have the diff of that patch where we're now checking the IP header length in general make sure you're requiring strict validation because ebpf doesn't have a copy from user so you basically could be tricked into reading kernel memory that was supplied by user uh... user space pointer and then uh... just manually verify the pointers so uh... we ask the question again can it be used for defense uh... not directly the limitations make it hard to use it in this fashion uh... instead it's much more useful to uh... observe the data that flows through the system so we have unix dump which is basically TCP dump but for unix domain saccots uh... demonstrates our successful fight against the validator and uh... link for that will be at the end of the slides so this basically just hooks the stream send message and datagram send message and retrieves a message header uh... contents from it and we use python like jeff mentioned before to dynamically generate c code so this is used to uh... tweak the ebpf program at runtime in quotes uh... and also helps us get around the lube restrictions so here is uh... the binary search tree that jeff mentioned before and uh... the per-cpu array which keeps track of the ring buffer slot uh... because we can't loop we have to generate the switch that jeff also mentioned before and this is the code for that and uh... we have to make sure we're checking the values uh... in the kernel space and user space because there's a race condition where uh... c might overwrite what python has because it doesn't know that python is finished with that value so we uh... make sure that c knows not to use the value still in use if ebpf isn't that good defense what else can we use it for let's talk about offense so you know let's assume someone bad gets some privileges in a modern linux system maybe like capsis admin in a container because containers are safe right what could they do with ebpf a lot actually uh... all that tracing stuff cannot only see everything it can also write user space memory i repeat it can also write user space memory so bpf probe write user is the the magic uh... call that we can make from all these tracers it allows us to write only to writeable user space memory so unfortunately we can't write to the text section we can write everything else basically though so what's anything useful in these regions buffers for reading writing data you know coming through the syscalls maybe there's a you know some sensitive stuff being read on a sensitive file descriptor maybe codes being executed from it you know maybe there's a privileged process doing that and wonder oh yeah cron cron so we wrote i wrote this thing called con job which spoof cron jobs it is the cron auto-poner it hooks all of the stat family of syscalls and basically uh... whenever uh... you know cron is looking to get information to know whether or not it needs to reload etsy crontab it stats them and looks at the last modified date so we just update that to make it very recent so it always loads the file and then we hook open at and close so anytime that file is opened we save the uh... fd that's going to be used for it and we store it somewhere else for later and if it's closed we stop keeping track of it and then we hook read for that fd and if we see a read in there then in the krep probe return we actually stomp over what the contents of the file was actually supposed to be with our own little you know root command as one does so brief demo this is actually running container it's slightly tweaked uh... because uh... docker happened to update their app armor profile a little bit uh... before after after we made this and it killed one of the vectors of how this ran so we we can do a little bit more but now we've uh... we basically just updated this to you know do the attack so as we see in here this route uh... you know command is injected right at the top of cron tab so that's cool what what you know what's what else is fun about this so we use some per cpu maps to have all the stuff communicate safely with each other then we use some hash maps to pass them between the pairs of those things and we also have to use this nifty hell for function that allows us to keep incrementing the time forward without having to go back to user space so what else can we do with ebpf we're gonna go for broke uh... if you recall we can write to the stack the stack has return addresses we can also read the stack in all of user space memory we can scan for the text section and share libraries i think i think people see where this is going so wrote this thing called glibcpone which is the system d auto-poner so basically it scans uh... pid one memory for libc it backs up uh... the stack content uh... then injects uh... rop chain into the process right where it's about to return to uh... the rop chain uh... basically runs a deal open uh... on uh... a path that we provide uh... we load our malicious library into pid one and then we clean up after ourselves all nice and clean to talk about what's going on not this one so we have uh... this piece of it called find you libc this guy oops sorry uh... basically dumps us out uh... return address and uh... and the base of libc and then we plug it into the pwn libc function uh... that basically goes and does the rop chain which i assembled by hand based on uh... libc glibc which has lots of fancy gadgets all very useful and so we run this and eventually it returns and it run it both loads are are shared library which if you see uh... this c code is actually what was there uh... so it's going to do a syslog and then it's going to run id and output it to temp evil temp evil is now filled with with root and we've also seen a log coming from pid one system d hello thirty five c three don't have a lot of time but to briefly go over how this whole thing works essentially we hook on timer fd set time because in all of uh... system d basically calls this thing once a minute guaranteed and it's very reliable uh... we then scan forward from the stack based struck that it passes forward so we can find the return address and then we part we basically scan for return addresses back into the syscall stub because we know what's in there it's gonna set you know rx to the right the right id number and then do a syscall instruction so we keep parsing forward until we find it and then once we find it we use uh... we find where the syscall stub is we use the offset that we know from the beginning of libc and we return both the return address pointer and this after we made this we then realized there was a hacky way to actually get all of the registers associated with the process because ebpf in bcc doesn't normally provide it to you have to scan memory to get it but so this is still useful for other things though then we hook timer fd again and in the return value for this uh... we actually then go and we back up the stack for safekeeping then we write the rob chain in uh... then we return user space uh... the timer fd set time then returns right into you know right after the syscall instruction was then it returns into a rob chain our rob chain sets up all the all the point all the variables all the all the arguments it then calls the deal open uh... after it's done with that we we don't want our module to have to clean up after this whole thing so we actually do is we have the rob chain clean up after itself so we actually call close with a magic negative value that otherwise is not going to do anything in the kernel kernel is not going to care but we actually in our uh... hook on close use that as a signaling value when that is hit we actually write back most of the original stack except for the last remaining gadget then we write a new rob chain past the end of where the stack originally was the reason we do it in two parts is because the deal open itself could potentially use up a large amount of stack space and it would have clobbered anything that we put there then we return user space into the last part of our old rob chain the last gadget shifts RSP forward to the new rob chain the new rob chain then just writes back over the the the last remaining parts the old rob chain with the original values and then it clears rax so that it looks like the timer fd call that we finally eventually repair back to seems to have succeeded so glibc is actually super stable like all of these gadgets work across multiple versions of libc across different linux distros is all great like this is super important like you'll have to regenerate the the specific offsets but they're all basically the same gadgets uh so what else can we do with ebpf use it as intended so you know once you're once you're running that kpro you know you shouldn't let anyone stop you right you know you can prevent processes from interacting with the kernel for example you can prevent processes from listing ebpf programs and kprobs you can stop them from creating ebpf kprobs you can stop them from loading kernel modules and you can stop them from phoning home so this is important because the bpf probe right user actually when it's loading when the bpf program is compiled and loaded uh this has a bit of logic in the kernel that sends out a dmessage warning that this is in use and they corrupt things so that's a it's a pretty good warning sign but you have a weird race condition because anyone reading dmessage has to read from it and then if they attempt to use any syscalls to phone home you can stop those so they need to basically already have something like dpdk to do non-syscall direct memory mapped packet IO to a network device so that you can't catch them but even then there are all sorts of wacky things you can do to stop them from phoning home and there's also this magic bpf override return helper that we had problems getting to work but it's supposed to be able to just stop a syscall from happening so you know the one downside of all this is that we need to keep our ebpf process alive right because once it once it dies you know the whole thing goes with it what if we could make our ebpf kprobes basically immortal you know we've already taken over pid1 you know so why not just run our like you know other ebpf kprobes from pid1 itself once we're inside this means we stay alive until the system shuts down or vice versa if pid1's the system goes the system goes down with it you know which is great for us because pid1 is now systemd for the most part so everyone you know knows that thing is just falling apart all over the place they're not even gonna think they're being attacked by you know space age ebpf root kits so ebpf is useful for everyone except people trying to build ids on top of it which makes no sense you know it needs to get a lot better at supporting that use case to be honest and right now it just it isn't there so a plea to kernel developers who might be writing these things you really need a copy from users that not everyone has to check the pointer is manually that someone didn't try and open a path that is actually a kernel address and you know you also maybe want to provide helpers for all these trickier file structures and things like that so people can actually get full paths out of them maybe give some direct memory comparison operations so we don't have to do these weird things and also memset please memset I don't like to have to waste my limited amount of instruction bytes performing a memset um so just some greets out to people the bcc developers have been super helpful they're building really cool tooling um all this stuff is just getting better and better every day um julie evans has great blog posts on how kprobes and all of this stuff works same with brendan greg and jesse frazell uh is has been building this thing called bpfd which is almost certainly going to be the thing that I would want to inject into pid one as my root kit manager definitely um because it manages all the things you can't hide from the future are there any questions so instead of going into the god motors where you went into the flying spaghetti monster motors isn't it questions ladies gentlemen number two here too hi there great talk uh can you just turn off compiler optimizations in bpf and save yourself a lot of trouble in the validator if you want to go modify the piece of the code that does it um that whole part of the code is in c plus plus because it it's all l o v m based and so you'd have to rebuild um bcc to do it I think there are there are there are the ability to set other flags onto it but I'm not sure if when it where it does the injection into it if it's before after and you could then override the optimization thanks uh number three um what kind of version have you tested with kind of what bargain which kernel version of you try this with because there is a lot of improvement in uh every resist so oh so uh a lot of this uh we've been basing on primarily two kernels I think uh 415 from ubuntu 1804 418 from ubuntu 18.10 and I want to say 16 or 17 from kali linux okay number two seeing that it's easy to inject stuff with evf uh how kind of prevented to be one in the first place if I need it anyway I'm sorry I I see that I need evf in a kernel for base stuff like firewall stuff and on but how do I stop as a user to like scripts I'm installing to inject more evf stuff in my kernel um so sorry can you say that once more how do I get hard for um unknown scripts to inject more evf in my kernel oh you need you need capsis admin to do most of this stuff and uh at least on uh containers uh there are a lot of operations that are in like app armor profiles in se linux stopping people from interacting with the sysfs that is used for a lot of this the problem is is that in the newer kernels they've implemented another way around this that just involves using either the direct bpf syscall or uh the perf event open without going through the sysfs so that's going to completely bypass that protection probably right sorry number two uh do you know if any like cloud providers which offer like container like as a service are vulnerable um haven't looked um if they're giving you capsis admin for real that isn't use username spaced uh they've got other problems uh going from capsis admin a full route is like it's more of a neat party trick it's not really like a full privilege escalation but if you happen to do it in something where if someone is trying to apply all these other restrictions on your container and you happen to find a way out that would be bad but also they probably shouldn't have given you capsis admin in the first place great someone out there on the internet with a question someone else no no no no well jeff uh jeff and andy uh andy also on jeff delayo we're gonna shut it down here uh give them a warm applause please