 Okay, so my name is Paul Chignon, and I'm going to talk to you about some work I did on BCC, and I'm going to talk about what BCC is first with Krangary Watter. So the BCC project is a number of things. I'm not a maintainer of the BCC project. I contributed a few patches to that project, and I'm going to focus this talk on what I contributed to the BCC project. So it's mostly known for being a collection of BPF-based tracing tools for Linux. It's been demoed in a number of conferences, but it's also a library to ease the development of BPF programs. So you have a Python layer, and you can use that to develop programs. I'm going to dig a bit into what that is. But first, let's talk about quickly what BPF is. So BPF is a way to extend the Linux kernel from a high-level view. It's a bytecode that you can load in the kernel. It's a bytecode that you can load in the kernel. It's going to be attached to different hook points in the kernel. So for instance, you can attach it to drivers, to care probes, to different tracing points. That's it. And the main point about BPF is that it's statically verified in the kernel before running it. So when you load the program, it's going to be statically verified. And then at runtime, there are a few additional checks, but it's mostly static verification. So these verifications are there to prevent crashes, mostly. For most BPF programs, you have to be root to run them and to load them in the kernel. So it's there to prevent crashes. But in some cases, you can load BPF programs in the kernel, in the Linux kernel, without root privileges. So in this case, the static verification and the dynamic verifications are there to prevent also that you can't escape the BPF VM, of course. So one thing you have to know for this talk about BPF is that from the BPF VM in the bytecode, you can call external functions. So there are functions implemented in the Linux kernel outside the BPF VM. There used to do a number of things you can't do from inside the VM. So for instance, here on the left-hand side, I have three examples, two examples of functions in purple. They're used to address, to access data structures outside the BPF VM. So persistent data outside the BPF VM. Okay, so as far as tracing tools in BCC, there are a large number of them. They cover lots of different things in the Linux kernel. They go from tracing latency at the driver level to system calls. The main thing about BPF and the interest of using BPF for doing tracing is that you can aggregate data in the kernel before retrieving it in user space. So for instance, if you want to measure disc latencies, you're going to measure disc latencies in the kernel, you're going to compute something like a histogram in the kernel, and once you've finished, you're going to just send the aggregate data to user space. You don't have to go back to user space for each measurement. So at this point, I was supposed to do a demo, but since I'm not on my computer, I'm not going to be able to do a demo. The demo wasn't really, it's just to try and show you the interest of BCC. So it was just demoing like a tool to measure latencies and another one to measure packet drops. So this is one of the tools I was supposed to demo. You're not supposed to look at the details here. I'm going to go over them later. But the main thing is this is a BCC script. The part on the left-hand side is the BPF program. You can see that there is a huge string inside a Python script. So the huge string is actually a C program that's going to be compiled down to BPF, to BPF by call. On the right-hand side, you have everything you need in user space to manage a program and to retrieve statistics and display them to the user. So this interface to load programs and manage them in user space is provided by BCC and I'm going to talk about rewriting of the C code on the left-hand side. But first, let's go over how BPF works and at the high level how do we manage and load programs in the kernel. So we have user space and kernel space. First we're going to write some BPF program in C. Then we can use the BPF backend in LLVM to compile it to a bytecode in user space. We're going to load this bytecode in the kernel. For that we have to use the BPF C score. At this point it's going to be verified by the kernel and then it's going to be git-compiled if you've enabled git-compiling in the kernel for BPF. Then you can load some data structures in the kernel to be used by the BPF program. So for instance here I have some representation of an hash map. There are a number of other data structures you can use in BPF programs. They're stored outside the BPF VM. So between different calls of the BPF VM you still have this data that persists among calls. Of course calls. Then once all of this is in place you have to attach this BPF program to some hook. So you could attach it to a K-probe or to a driver or to the traffic classifier and a number of others. So in BCC we use the clonry writer which is a part of clon that is able to do a source-to-source transformation. So we use it to transform the source, the C-program at the beginning to try and provide some synthetic sugar to users to try to abstract some of the complexities of BPF away from the user. Then we transform it into another C-program that then is going to be compiled to the BPF by code. So in BCC the clonry writer is used for a number of different things. So for instance it's used to parse the map definitions in our programs. I'm going to give an example of that later. And these map definitions are parsed from the C-program and then we create from that the C-scull to load them to create them in the kernel. We also use the clonry writer to parse function names because the function names actually have different patterns in BCC and you can just say something like K-probe, underscore, underscore and the name of the function you want to trace. So we're going to parse this and then use it to attach the program to the appropriate function. Then of course we rewrite function declarations. I'm going to talk a little about this later to fit what the kernel is expecting. We rewrite map accesses. Again I'm going to show one example of this later. And the last thing is we rewrite the references of pointers to the kernel memory. So I'm going to focus this talk on this last item of the list and explain first what I mean by this. So in this example, so in the BPFVM when you want to access memory from the kernel which is something you're going to do in a lot of cases if you're doing tracing. If you want to read some data from the kernel to do your measurements. The verifier can't really verify this at load time, sorry. So it has to do it at run time. For a simple reason, all of these accesses can have variable offsets. So it's very difficult to statically verify it in the kernel before executing the program. So what you have to do in this case is you have to call an external function. So a function implemented outside the BPFVM to do this kind of accesses to the kernel memory. So for instance here, I'm going to have, so one argument prev, I'm going to put all external pointers in orange, so pointers to the kernel memory in the remaining of the slides. So for instance here I have prev that is an external pointer, so pointer to the kernel memory. And to be able to read the value, to de-reference the value of prev I have to use BPFPRO-READ that is an external function. So one other thing on this slide that you have to notice is BPF programs only take a single argument. That single argument is called the context argument. So it's the first argument of my function. And then the BCC library actually rewrites this in order to retrieve prev from the context argument. So the reason the context argument is not considered an external pointer, a pointer to kernel memory, is that it's already statically verified by the kernel because the kernel knows its length and know what it contains. But it doesn't know this for all of its members. So for the prev member, for instance, you have to do it at runtime. Okay, so this function is actually the context switches in the kernel, so between the previous process and the next process. And here I'm retrieving the previous process. So with BCC what we offer is we try to make it so that the user can use de-references as usual without caring about external pointers and the fact that they are actually pointers to the kernel memory. So what the user writes is, as usual, just prev, and then it's going to access the PRID member. And in the background, what we're going to do is rewrite the source code in order to replace this with a call to BPF pro read. And what that means is that we have to track all external pointers at the source level, at the C level. And that's what we do with clown. Okay, so before we continue, just a point about the false positive and false negatives. So we don't aim to be perfect. As I will show in the remaining slides, it's very difficult in some cases to be perfect. In some cases, we don't even have all the information to know whether pointer is a pointer to external memory, to memory outside the VM. But there are two cases where we may be wrong. So false positives means we are adding unnecessary calls to BPF pro read. So we're trying to replace some of the references with calls to BPF pro reads when we shouldn't be, because it's not needed. What that means is that we'll have some additional overhead when doing these calls to the external function instead of just doing the memory access in the VM. It's also mainly to some syntax errors because sometimes it's a bit messy when you try to rewrite something twice or things like that. I'm going to mention this in the conclusion. And false negatives are the main issue here. If we miss some BPF pro read, so if we forget to replace a difference with a call to BPF pro read, it means that the program is going to be compiled to bytecode. It's going to be loaded in the kernel. And then once in the kernel, the verifier is going to reject the program saying that it's trying to access memory outside the VM. Then it's going to send an error to the user. And since it detected this in the kernel, it's not going to have a lot of context about why this error is there. And so it's not going to provide something very user-friendly to the user. So the error message in this second case is really difficult for the user to understand. So we really want to try to avoid this issue. Okay, so first, where do external pointers come from? So I've said that they come from the context argument. So there are, for instance, members of the context structure. So for instance, Prev in my previous example. But they can also come from external functions. So for instance, here I have an external function called BPF get current task. So I'm still in the function that is doing the context switch between different processes. I have the previous one. I'm trying to retrieve the current one. And this type of function can also return external pointers. So pointers to the kernel memory. Next thing I have to decide is how do I identify variables? And in particular, how do I identify external pointers? Or how do I compare them? So I could use a very naive way to do it would be to use the variable names. But that wouldn't work if I have different variables called the same in different functions, for instance. So it really wouldn't work. So the way we do it is we use something called KrungDecl. So the declaration of the variable to try and identify it. So in this case, the task variable is identified by its declaration here in the function prototype. OK. So once we have this initial information, we need to traverse the AST to try and track all external pointers to the code. And to do so, we're going to follow all function calls. And as we go, we're going to update the state of external pointers. So for instance, if we have an assignment of an external pointer to some other pointer, obviously we have to add this new pointer to the set of external pointers. Then once we've done this first traversal to detect all external pointers, we're going to do a second traversal that's much easier just to replace all of the differences to external pointers we detected. OK. So for instance, here I'm going to pass a function declaration. So I have different arguments. When I hit the function declaration, I'm going to go look at the second argument. If it's a pointer, I'm going to add it to the set of external pointers. Then I'm going to look at the body of the function, so the different statements. When I hit, for instance, the binary operator, if it's an assignment, I'm going to check if the right-hand side is an external pointer or returns an external pointer. And then if it does, I'm going to assign the left-hand side. I'm going to say the left-hand side is also an external pointer. So I'm going to do this for the code. I have to do it also for arguments of functions and for the values written by functions. OK. So it's not that easy yet. We also have to track the number of indirections of pointers. So for instance, in my code, I could have a pointer to an external pointer. That's the case here with the variable PTR. So for instance, PTR is a pointer to the external pointer as k. When I'm doing the first dereference of PTR, I don't want to replace it with a call to BPFProBrid because that value is on the stack. But I want to replace the second one, the member access to ESCA, the address. So we don't want to rewrite all external pointers. We want to track when we should do the rewrite and at which point. So that means that we need to track all of the different address of and dereference operators in the code, which again adds to complexity. The next case is we have to track external pointers to maps. So at the beginning, I mentioned that BPF programs can use persistent data structures outside of the VM. So someone may very well store an external pointer in one of these data structures and then retrieve it in some older function later in the code. So in this example at the top of the code, I have a declaration of BPF hash. It's a kind of a map in BCC. It's actually going to be rewritten into something else and it's going to be created from this code. It's going to create the syscode to create the map. And then I have some different accesses to the map. So for instance, CURSOC update. This again is going to be rewritten by the BCC library. I'm going to store an external pointer in CURSOC and then I'm going to retrieve it in the trace exit function later on. At this point, I'm going to use it to the reference to retrieve some value from the kernel memory. So I have to rewrite this last dereference. This is a very common example because when you're doing tracing, at the trace entry, so when you're doing tracing of the entry of a function in the return of a function, you do have all of the arguments of the function you're trying to trace at the entry, but you don't have any of them at the return point. So they're not in the registers. So when you want to manipulate to read some of these parameters at the return point, some of these arguments at the return point, you have to store them somewhere and then be able to retrieve them when you hit the return point of the function. So that's exactly what we're doing here. We're storing the esca argument and then retrieving it to perform some reads on it. So the way we undo this is we did several traversals of the AST. So the first traversal is the same as usual. We just try to track all of the external pointers to the code. Then the second one is going to track all of the maps that contains external pointers. So at this point, we know where the external pointers are in the code, so we can just look for calls to Kershok Update, for instance, and say, well, if this argument is an external pointer, then I know that this map contains an external pointer. And then in the last traversal, we're doing the same as the first, except that instead of taking the context argument as sources, we're going to take the maps as sources. So for instance, we're going to look for all of the different lookups to maps. And we know that in this case, we have external pointers in these maps. And then again, we're going to probably exist through the code and do the rewrite in the last traversal. So to conclude, I think this is a bit... It's a bit of a complex approach because we're trying to do it at the source level in C. There are a number of things that are more complicated than we would hope. So for instance, we have to do several traversals of the AST. The implementation, as I said, is more complex. We struggled a lot with rewrites of the code, why we are... So for instance, all of the different calls to Clang Rewiter replace text. So we are trying to replace text as we go, and in some cases it's very difficult to manage the offsets of where we should replace. And in some cases, we try to replace text that has already been replaced and it messes things up. And despite all of this complexity it's still not complete and it probably never will be complete because in some cases we don't have all the information we need to identify all of the external pointers. So for instance, if I have two different programs separated in different files, one of them might be updating a value in a map, putting an external pointer in a map and in the second program I might be reading from that map. So in this case I'm in the second program and I have no way to identify it as an external pointer. Now the question is, are there better approaches to handle this issue? So we could do it at the bytecode level for instance, but in this case we would have to rewrite the debarser of the BPF bytecode. Maybe that's the approach to take. We could take some very extreme option where we rewrite all of the different differences of data structures in the kernel headers. So that's a bit of an extreme approach because we would rewrite a lot of other pointers that are not needed. We could ask the developers to label the different variables to say whether they are actually external pointers. But if we start doing this why wouldn't we just tell them to use BPF pro-brids and not try to do this job frame? And there may be all the solutions if you have others. Thanks for listening. The code is on GitHub. Everything I've talked about it's actually quite short where it's kind of a big file but it's a single file mostly. So it's on GitHub on the BCC project and I think I have time for questions. If you go back a slide to the possible solutions. Sorry. Sorry. It was like one of the other alternative solutions. It's like one of the second to last slides maybe. It was the second to last slide. You're at the end? Oh, okay. Sorry. Sorry. The last one before the solution would be complete. So really the second to last. This one. Oh, sorry. So I was curious about, security kind of requires kind of labeling of pointers. So like the kernel will use certain like macro defines that expand to attribute section or attribute address space. For BPF you mean? Just separate from BPF. We're already the latest kernel and it does this as a way of kind of separating these things. So I'm curious as far as like labeling external pointers of them just in the parameters. That should still give you enough information to be able to rewrite, to insert those probe calls. Yeah, you should. The user still has to be explicit about what's there or not. But then that still gives you enough information so that they don't have to spell out BPF probary every second. The issue is we're trying to abstract all of this away from the user. We're not having to care about whether this is an external pointer. So if we try to do this probably we're going to succeed. But the thing is we're going to ask the user to care about this. That's what we're trying to avoid. Well, I'm just curious like if you were to do approach where you have an always explicit kind of thing of like labeling the pointers explicitly and then like you get machinery working well for always inserting those probe reads or probe writes wherever necessary. So I'm just going to say that when you do the rewriting is you know that every frame after the first then you can have that to it. Yeah, clearly. The pointers on IR level, for example from a local instructions or local instructions. Did you use the IR level? No, we didn't try this but the main thing is... So the question was are we trying to identify these external pointers at the IR level? We haven't tried it. The main thing is it would had maybe a bit of complexity to the clang side of things this time so I'm not sure whether that would be acceptable stream. If you try to do it in BCC maybe it's a better approach but I'm not sure and we would have to do... Yeah, I haven't looked into it so I'd have to look into it to be sure. Yeah. Do you have a cushionary guard against the memory disappearing from under you? Like if there's more of a pointer in a map Yeah. And the thing it points to that appears or with recognition? Yeah, so this is why you have BPF properties. Yeah, sorry. So the question is how does the BPF machinery guard against memory disappearing below you? So for instance if you have a pointer in a map and then the pointer is not valid anymore because it's been freed the memory has been freed. How do you guard against this? So an access to an initialized memory. So the thing is it's why we use BPF properties at runtime. These BPF properties are going to check the memory to be sure that it's safe but we're not trying to read something that's not there for instance. I don't think that's something it checks. So this is what for instance. So this only works for specific structures? No, it works for everything but the thing is it's these programs need root privileges so you're not trying to guard against escape of the memory, you're only trying to guard against crashes of the BPFM on the kernel. So maybe that changes a bit of things. If you don't have root privileges to run a program, so for instance I think programs attached to your sockets don't need in some cases root privileges. In these cases you can't read memory outside the VM. So you have different restrictions based on whether you have root privileges or not. You said that you might need to modify LVM IR if you were to represent all those on the IR level. Couldn't you make this work with LVM IR address spaces and just tag the different address spaces and have a path that protects through that? Yeah. Yeah, that's a good question. We use another language rather than doing a C2C transformations. It's a good question. It's a bit over my head because it's I guess it's a bit of legacy choices. So for instance lots of people are used to doing C programs to write in the kernel, so that might be one of the reasons. But surely it's maybe another approach to use. Some projects have used that approach. So for instance the C front end is not the only in the BCC library you also have a Lua front end. Not sure it changes a lot but maybe. And you have other high level languages that are specific to tracing that have been wrote to be then compiled to BPF.