 Okay, hi everyone, I'm Paul Chinon and today I'm going to talk about a new option of S-Trace, a second BPF. So if you've been staying in this room for the last two talks, you've already heard about second BPF, but in this talk I'm going to go into the details of how it works behind the hood. So as an overview of this talk, I'm going to first go into explaining how S-Trace uses P-Trace to stop at C-Schools. Then I'm going to explain how it uses second BPF to stop only at C-Schools of interest, so that's what the option is doing. And finally we're going to see, while we're talking about this, we're going to see the two CBPF algorithms that are being used to decide on which C-Schools to stop. So S-Trace default behavior. So at the top of the side you have got the thread that you're trying to trace, so Tracee and S-Trace. So when you start tracing this Tracee process, first it's going to do some initialization with regard to P-Trace, and then when it's ready to start the Tracee it's going to start it with a P-Trace command known as P-Trace C-School. So what this command is going to do is stop the Tracee at each C-School entry and exit. So for instance, if my Tracee starts it's going to do some processing in user space for instance, and then when it gets into the kernel mode with a C-School it's going to, P-Tracee is going to stop the Tracee, stop it with a C-School entry stop event, and then give control to S-Trace in user space. So then once S-Trace is done processing this C-School entry, it's going to restart it again with P-Trace C-School to stop at this C-School exit this time. So it's going to keep doing this, so every time you stop you have two context switches, two and from S-Trace in user space, and then of course two stops per S-School. So what's the issue here? So the issue comes to light if you think about the Tracee qualifiers. So the Tracee qualifier is a way to select which C-Schools you'd like to see. So for instance, if you only want to see the second C-School you're going to do dash-e, Tracee equals C-School. You can do the same with dash-e, second, and then you've got some other aliases for instance for all of the network-related C-Schools such as percentage network. However, when you're doing this it's still going to stop twice per C-School at all C-Schools. So even if you don't want to see the read C-School for instance, it's still going to stop at all of the read C-Schools. So as I said this involves two context switches, it's very very expensive. So we've seen in previous talks some examples with DD, so that's probably one of the worst cases. But if you're trying to do for instance a compiler Linux kernel, on my old computer it took about 12 minutes. If I'm trying to do the same with S-Tracee, even if I'm trying to see only a single C-School, so the connect one in this case, it's going to take 24 minutes. So double that. So we need a way to tell the kernel at which C-Schools we want to stop. And we need to do this in the kernel because obviously it's the kernel that is going to decide when to stop. If we do this in user space it's too late, we'll already stop. So we need a way to do this in the kernel. So from the name of the option you've probably guessed that we're going to use second BPF. So second is a way to filter C-Schools in the kernel. It's meant for sunboxing. One of the first users in particular of second BPF is the Chrome sunbox. So second BPF allows you to choose which C-Schools you want to filter. So which C-Schools you want to allow and what you want to do otherwise. As a side note, second BPF is the second user of BPF in the Linux kernel after the socket filters, but before all of the other EBPF stuff you've probably heard about. And it's CBPF in second BPF. It's not EBPF, so it's very much limited compared to what EBPF can do. Okay, so one example if you want to allow process to do open and open at C-Schools, but you want to kill it if it tries anything else. You're going to load this small BPF program, CBPF program in the kernel. So the third line is actually loading the C-Schools number. So it's the NR field of second data. Then you're going to compare that with 257 which is open at and then two which is open. And that's only true on the X8664 architecture. Once you've done that you're going to jump as to bad goods. Bad would be we're killing this thread and good is simply we're allowing this C-Schools so we're just going to do the usual processing of the C-School in the kernel. Now if you want to do pretty much the same thing, but this time you want to allow specific accesses, you want to allow processes to open specific files, you're going to need help from user space because you need to go and look into the file path for instance to take your decision. So in order to do this you're going to change slightly the program. This time instead of returning red to low to continue processing the C-School, you're going to return red trace. And in this case SecComp is going to call into Ptrace to stop your trace in your process and it's going to give control to a Ptracer in user space. So in our case might be Strace. Okay so Strace SecComp Bpf. So the behavior changes a little. So if we take the same scheme as before we're going to start in user space. Strace is doing some initialization and when it's done it's going to start the tracie this time with the Ptrace.com command. So what this is telling is simply that the tracie is supposed to be ever as usual. It's not going to stop at any C-Schools. It should just process the C-Schools and do whatever it is. So the tracie can do C-Schools, can do some processing in user space. It can do C-Schools. If we're not interested in the C-Schools we're still going to have the Bpf program that is going to run to determine if the C-School is of interest or not. If it's not of interest it's simply going to allow the C-School and let it go. So we can do some processing like this but once we get in the kernel with a C-School of interest the C-Bpf program is going to return red trace and in this case we're going to have a second stop. So it's a different event from the previous stops we had and this is going to give control to S-Trace in user space with a contact switch. Once S-Trace is done doing the processing for this C-School entry it's going to restart the process with P-Trace C-School. And the reason we can't use P-Trace continue to go to the exit of the C-School is simply because a second Bpf does not run on C-School exits. So a second Bpf is meant for sandboxing so you usually want to limit which C-School entries you can do and not which C-School exits you can do. Okay so it's going to keep doing this once we exit the C-School it's going to restart it with P-Trace continue again because we know that we can stop with the second Bpf program at the next entry to a C-School. There's one caveat to this however in Linux before Linux 4.8 the second stop happened before the C-School entry. So what that means is that we have to we can do the same as before so we can restart it with P-Trace continue at first but then once we reach the second stop we have to restart it with P-Trace C-School to get to the entry and then again to get to the exit and because of that in Linux before 4.8 we have two stops per C-School instead of one when the second Bpf option is enabled. Okay so what about the CBPF programs because I talked about how we change the way we stop the process but I haven't talked about the CBPF program itself. So one first naive way to do it would be to do a linear search through all of the different C-School numbers we're interested in. So for instance here if I'm interested in read write open close start and F-Start I'm going to go over all of the different numbers and if the C-School numbers so the NR field of second data matches one of these I'm going to jump to trace and return the return code that we need. So is this optimal? Obviously not. So this is O of N if we want to improve it a little there's one obvious optimization here we can simply optimize a contiguous set of C-School so for instance I was going from 0 to 5 I could simply check that my C-School number is between 0 and 5 and if that's the case I can just jump to the trace command. So what we're trying to optimize here is the size of the program because in CBPF instead of EBPF we're limited in how many instructions we can have in our BPF program so we're limited to I think 4k instructions and therefore we have to limit we have to ensure that our programs are as small as possible because they're first going to be faster to execute in most cases but mostly because we want to ensure that we can load the BPF program in the kernel. Okay is that the best we can do in some cases it's still not the best we can do so what is the worst case of this if we have some user that is trying to trace all old numbered C-Schools we are not going to be able to use this optimization and we're going to have a lot of different instructions to compare the C-School numbers. So what we can do instead is since in CBPF we have 32-bit bitwise operations we can encode the C-School numbers that we are interested in into 32-bit bit arrays and then we're going to go over all of these bit arrays and compare our C-School number with the appropriate offset in the bit array. So basically here if I want to trace select on IOCTL I'm going to set the bit corresponding in the given bit array so this is the first bit array in this case because they're set on IOCTL of small number and then we're going to go over all of the different bit arrays with our BPF program and we're going to select the appropriate offset once we reach the appropriate bit array and we're going to check if it's to 1 or 0. So the reason we can jump directly to the bit array that we're interested in is that in CBPF you do not have indirect jumps so you have to implement your switch case as an if-else and going over all of the different cases. Okay so we compared the two different algorithms with different set of C-Schools filtered so the first one is just known and P-Trace and not P-Trace basically everything except P-Trace and then we've got some cases with IOC that I mentioned earlier. The last one is just a combination of different IOCs to get a larger number of C-Schools. So what we can see is that in most cases the linear algorithm with the optimization I mentioned is much generates much smaller programs than the binary match. In some cases however when we have a large number of C-Schools in particular the binary match is going to give better results. So the reason for this is that in the case of the binary match we have to do some pre-processing on the C-School number to get the appropriate offset and then the B-Terrary we have to encode all the B-Terraries. So this is more of less constant size programs but there is still a lot of processing to do even if you have only a single C-School filter. So what we did in S-Trace is we generate both programs when we start S-Trace and then we're going to decide based on which is the smallest we're going to load the smallest in the kernel in order to get the best of both approaches. Okay some limitations of this option. So the first limitation which Dimitri already mentioned is that second BPF implies dash F. So dash F or trace dash F means that you're going to trace all of the children of your trace process when they fork, when they clone and so the reason for that is that in the kernel the children inherit a second filter chain of program from their parents and the way they do this in the kernel is that they give simply a reference to the beginning of the chain to the children. So each children in the kernel will have a reference to the second filter chain of the parents and however if we want so if we have a chain for instance of second filters one two three four but we only want to inherit one two and four because the third one is the S-Trace program the S-Trace BPF program so we don't want to inherit it for children if we want to do this we'll have to reconstruct the chain to have one two and four so we want to skip the third one and we can't do this with references in the kernel so currently there's no good way to do this except if we make copies but then there's a lot of overhead to copying the whole chain of second filters. Okay the second limitation of this option is dash P so if you want to trace an existing process you cannot use today the second BPF option the reason for that is very simple there's currently no way to attach a second BPF program to an already running process in the Linux kernel and there is however a way to when you attach a program to a thread of a group of threads there is a way to synchronize the second BPF programs across all threads in the group so maybe there is some hackish way to do this but yeah not sure. Okay to conclude to sum up first the we've seen that S-Trace stops at all syscalls by default and that's very very expensive because of context switches in addition we've seen that the second BPF option when you're using filters on your syscalls allows you to stop only at syscalls of interest and we've seen the two different second BPF algorithms that we're using in S-Trace to do this to implement this match over syscalls. There are over some things that could be improved in the current implementation that are pretty straightforward so the first one is on some architectures you've got system calls like socket call and IPC which allow you to do basically also syscalls so you would have the first argument of socket call for instance would tell you which syscall to actually do so for instance do a connect or maybe something like this currently this is not supported in the CBPF program because you would have to match on the first argument of the syscalls you would have to match on the the number that is the first argument of socket call for instance the second thing that could be done is the S-Trace-C option which currently allows you to print a summary of statistics on your syscalls this is a perfect use case for eBPF instead of CBPF because eBPF allows you to aggregate data in the kernel and therefore it could allow you to aggregate statistics for this option to only print to only send them to the S-Trace process at the end so instead of sending everything to the S-Trace process and stopping all the time you could only send a summary of these statistics okay so I've been a bit fast so we have plenty of time for questions I hope you have some and thanks for listening sorry yeah I did so I don't have the numbers so the question is did I run the second BPF benchmark with the did I answer it the Linux compilation benchmark with the second BPF option I did I don't have the numbers here but might have them online okay so here if you can see the one before last so the second one is the number with second BPF so it has a few seconds but yeah nothing much yeah so the question is I've talked about CBPF and eBPF and whether that is the limitation of what second ballers in the kernel so second only allows CBPF programs in the kernel there's been some discussion to allow eBPF on the mailing list but that's pretty much I don't think it's going to get there so the answer was very clearly that this is not something that won't the main reason for this I think is the unprivileged eBPF programs that this would require and they don't want you any more unprivileged eBPF programs so so the question so the question is should we try to upstream some work in the kernel to allow to allow us to detach second BPF runs from processes I guess we can always try I don't know if there's any yeah that's probably the main concern they're going to raise is the security aspect so what if but I don't see all that could be missed use because if you're actually trying to detach a program from your process I mean you're asking for it so maybe you need privileges for them for in order to do that personally I don't think this is a security issue because when you attach in a program you can explicitly say that this program should not be inherited and this way it wouldn't be any security issue that's why I'm asking this question yeah but then we should try this this would allow to support second BPF with full forks and without not just before work but also without there might be some issues here with a performance there because because of the way it's implemented in the kernel it's a reference count with so if we want to remove one from the chain it's going to be kind of difficult yeah any other questions so the next talk it is about eBPF this time so if you want to listen to some eBPF you have to list to stay here