 Hello everyone and welcome to this talk. I am Federico Lipearo, a maintainer of FACO security working for CSlib. In the last few months, I have been quite busy in improving the FALQ BPF probe. Today, here with me, there is Andrea. Hi all, nice to meet you. I'm André Tazolo, I'm a search fellow at Polytechnic of theory, and right now I'm working full-time on BPF technology. Fede, would you like to show where we are? Yeah, sure. Let's go through our agenda. First of all, we will explain the idea behind the FALQ project and the reason why we can call it a peculiar BPF use case. Then, we will have a look at how FALQ uses the BPF technology under the hood, analyzing the most relevant aspects. Finally, we will focus on the many issues facing the dark current instrumentation. More precisely, for each pitfall that we will point out, we will share possible mitigations and some future enhancements. So, if I'm ready to start, let's see what the FALQ project consists of. As you can see from this funny animation, FALQ is an open source project that basically sniffs all your system, searching from a level's action. To perform this inspection, it uses two main methods, an BPF probe and a kernel module. At first, let's start to understand the benefits that the BPF technology brings on the table in regards to the kernel module. As you can see, there is no risk of kernel crashes due to a wrong instrumentation, because BPF verifies and asserts that the bad code is safe before injecting it into the kernel. Moreover, the kernel becomes really easy to instrument adding new features on the fly. We have just to inject an BPF probe without caring about the internal state. Last but not least, we can solve all the painful portability problems that we will also touch in our discussion, with modern BPF features like compiled ones and everywhere approach, the so-called Korea approach. Okay, but why FALQ is a peculiar use case? Basically, it wants to explain all these advantages even in unconventional scenarios. On one side, it needs to capture almost all the system events, without degrading performances. This means capturing syscalls, content switches, page folds, and so on, generating up to billions of events per second to analyze. On the other hand, FALQ requires high portability. The support matrix is pretty huge, going from kernel 4.14 and client 5.0 onwards. Therefore, the BPF probe must be successfully compiled for all the elements of this matrix, taking into account the problem related to client optimizations and BPF verifier requirements. You can see that these are the two main reasons why we consider FALQ and the BPF peculiar use case. We will investigate these two topics more in detail, but before doing that, we must dive into our BPF instrumentation when it works. Andrea, ball to you. Thank you, Fede. I will try to explain how our BPF works without messing around with too many details. Let's start our journey. As previously said, FALQ catches all the most important system event, but for doing that, we need to estimate some kernel hook. A kernel hook is a particular point into the kernel code where we can attach our BPF programs. In particular, to deeply investigate our system, we need different hooks. Let's see them one by one. As a first thing, we trace the entire flow of a syscode starting with the sysenter hook and terminating with the sys exit one. After it, we have page fold user and page fold kernel hook, whose job is quite intuitive. They catch all page fold events. In the line below, we have the sketch rich hook, which catches content switches. Next to it, there is the per sexit hook that collects all terminated process. And last but not least, we capture all the system signal to the signal deliver hook. These are all the seven hooks that are involved in our instrumentation. As a first thing, we will focus our attention on syscode hook, because their behavior is a little bit different from the others. More precisely, we will consider the sysenter hook, but as you might expect, the exit one acts exactly the same behavior. To better understand the real logic, let's imagine that an open syscode is executed on the system. The BPF program attached to the sysenter hook is called, and we usually refer to it as a dispatcher. You might ask the reason behind this name, but I think it is quite intuitive. In a nutshell, this program has to understand which syscode has been called, and what is the corresponding BPF program that we have to trigger in response. Yeah, you understood correctly, we have a specific BPF program for every syscode. This specific program has the task to collect some security element data for regarding the syscode. But before looking at a concrete example, here we have to clarify how to perform this jump from the dispatcher to the specific program. We can surely use the telecall mechanism, but what is a telecall in the BPF context? A telecall can be seen as a mechanism that allows a BPF program to call another one, without returning back to the old program. Moreover, this call has a minimal verbat, because it is implemented as a long jump using the same stack frame. So, keeping in mind all this information, let's see a complete flow. When the open arrives, the dispatcher recognizes the syscode ID, in this case too, and he performs a telecall to the right BPF program. The specific program collects some information about the syscode, for example the path name, modes, flag of the file being opened. After collecting all the desired parameter, we are able to finally push data to user space. Now, if we consider a different hook from the syscode one, for example the page fault user, we can easily understand that the flow is simpler than before. We don't need any more of the dispatching phase, and we only have to manage the specific capital logic. Here for example we collect some data regarding the page fault, like the address and the error, and we send them to user space. As you might expect, all the other hooks that are not syscode related have exactly the same behavior. To finally recap, we can see in this slide all our programs. More precisely, we can notice that sysenter and sys exit hooks generates way more even than the others, because they are triggered every time a new syscode is executed. This and the instrumentation part, but from the previous slide, it should be clear that the number of events forwarded to user space can also grow dramatically, and this could become a serious problem for a file. Fede, would you like to provide us a little bit of context about it? Yeah, sure. As you said, there are some situations where this excessive instrumentation can generate troubles, and we have to consider alternative solutions. The first two problems highlighted in this slide, so the customer issue and instrumentation issue are mainly related to our instrumentation, while the last one refers to the concept of high portability that we have introduced at the beginning of this presentation. But let's proceed in order and start with the first problem. We call it the consumer issue. This term means that the user space is unable to fully consume all events generated by our PDF instrumentation. Right now, we have identified two main reasons behind this phenomenon. The first and most intuitive one is that on the very second environment with lots of CPUs, the number of protested events is huge, up to tens of millions per second. Of course, our single-threaded user space is unable to analyze them all. User space process is single-threaded because we have to guarantee the event chronological order to maintain a consistent system state. Moreover, a multithreaded customer would force us to adopt synchronization mechanisms. Further producing syscodes that in the end would produce a back edge on the kernel. So this is the first reason, millions of events against a single-threaded user space. The second one is less obvious. It depends on the CPU time granted to user space and obviously on the raw system load. Given that this second point is harder to grasp, we realized a concrete stress test. As a test environment, we used a local machine with 12 cores, while as an event generator, we used the stressNG tool. StressNG is a companion tool that produces an enormous quantity of syscodes, filling all the CPU time and degrading system performances. We let it run for 10 seconds and we obtained these results. The total number of events produced by the kernel is 34 millions. Of these, only 4 millions have been successfully captured by user space, all the others have been dropping. Basically, the whole capture was lost. This means that user space has conquered the CPU for a very short time, canceling only a few millions of events. Of course, this is an extreme case that tailored to our needs. But in the end, it fully alights the issue. Now, you could ask us, how could this kind of problems be mitigated? The answer is pretty simple indeed. Even that all we want is to provide more time to user space. Let's reduce its work by producing fewer events. But the question now is, how can we reduce the number of events for a world that uses space? Here is where the so-called simple consumer mode comes into play. Running Falcon's simple consumer mode means turning on a kernel-style capture pattern that denies events that are not so interesting for the system state to reach user space. This filtering logic has two main benefits. On one end, fewer events are produced. On the other end, the instrumentation time is reduced, improving the overall system performance. Now, let's check what really happens with simple consumer mode enabled. Considering always the same C-center hook, let's imagine that a C-scale that PPI-ID is executed. How would a special problem restart the execution? But this time, the flow gets stopped since the simple customer logic detected that the C-scale is interesting for our system. All these events would be completely rejected and not sent to user space. As a last thing, we can observe the results we obtained on running our stress test with this new filtering pattern. The simple consumer is as effective as we expected, cutting down the number of drops from 90% to 60%. In this specific test, the capture is still partial. However, by applying the simple consumer to real cases from the Falcon release 0.31, we have seen notable improvements. The future is still brighter. In fact, we are thinking to further reduce the number of C-scale 3C in simple consumer mode. As you can see from this slide, we started with a pool of 332 C-scales. After our first analysis, we chose to keep only half of them, and this is the actual state of the simple consumer. In the last weeks, we are thinking to further improve this logic, obtaining a final set of 84 C-scales, with what we can call an unashamed simple consumer. Anyway, you have to consider that the analysis behind these choices are quite complex, since every event can have unexpected impacts on the system. Therefore, it is always good to move carefully through these attorney paths. This ends up the discussion on the consumer issue. Andrea, would you like to drive us to the next pitfall? Yeah, sure. Let's talk about the instrumentation issue. The first thing in the discussion, I would like to thank Giorgio Sarton, a member of the Falco community for the amazing research he carried out about this topic. The data collected by Giles raised a series of reflections that we will summarize in these slides. You can find both the Gile research and the discussion in it in the presentation we have uploaded on scade.com. You have just to right-click there. Now, let's go back to our topics. To understand what the instrumentation issue means, we have to consider that in some use cases we only want to capture specific syscalls and not all of them as seen before. But what is the cost of capturing just a bench of syscall without instrumentation? If you remember well, every time a new syscall is executed on the system, the dispatcher is triggered, and it will take all a specific BPF program. This approach performs well if you want to trace a syscall, but how would it behave if you only want to capture a subset of them? For example, if you're only interested in 20 syscalls, what are the performance? Unfortunately, the dispatcher would call all BPF programs, even those related to syscall that we are not interested in, increasing the kernel execution time and slowing down the entire system. This is what we usually call the instrumentation issue. Giles, in his paper, proposed an alternative instrumentation to other states of the Red. Since a BPF offers the possibility to attach programs directly to syscall-specific hooks, why don't we exploit these instead of using the generic sysenter hook? If you want to trace just 20 syscalls, we have only two instrumented write hooks. For example, the sysenter write, sysenter read, sysenter fork, and all the others. Conceptually, the advantage seems clear, but let's see it in terms of performance. As we said before, we assume we are all interested in 20 syscalls. To measure how much the kernel is bounded by a certain BPF instrumentation, rather than by another, we consider as a reference the execution time of one of the syscalls. The more the system will be overloaded, the longer the syscall will take time to execute. In this case, we choose the write as a reference syscall. The first column represents the write execution time without any kind of BPF instrumentation, so on a vanilla kernel. This is the ideal time to which we want to get close to our instrumentation. The second column shows what is the cost of using our actual instrumentation, so with the sysenter and sysactive hook. Finally, the last column represents the alternative instrumentation, seen in the previous slide, with only the 20 hooks necessary for capturing the required syscall. As expected, trying to capture only a bunch of syscalls with our current instrumentation introduces a significant kernel overhead compared to the alternative instrumentation. However, if we think about all we have said so far, we can notice that the simple consumer approach could help us to mitigate this situation. Let's see how. The simple consumer interrupts the dispatcher execution flow stopping the instrumentation before calling the specific BPF programs. This seems to be the solution we are looking for. However, measuring the write execution time, we can notice that we don't have a great improvement. But why does it happen? With the simple consumer, we don't call any more or the specific BPF program, so where is the problem? If we consider the filtering step again, we can notice that it doesn't happen immediately at the beginning of the dispatcher. In fact, before calling the logic, it is necessary to read some data inside the BPF maps through the well-known BPF helpers. Although these calls are not particularly expensive, they are repeated for every syscall, even for those that will be filtered out by the simple consumer approach. This is why we still have this overhead. So, in this case, we can state that the simple consumer is not the great mitigation we are looking for. Before jumping to the next topic, we can see what we are planning for the next future. We have to consider two main alternatives. On one side, we have the solution put on the plate by Giles team, so the one with the syscall specific hook. During this advantage of this kind of solution is that we have to maintain two different BPF instrumentation and enable the right one, according to the Falcon use case. It's worth noting that this alternative instrumentation brings advantages only if we are interested in tracing a small number of syscalls, while the main parkour purpose is to trace the entire system. So, in the general use case, there is no particular benefit from this approach. The alternative solution to maintain a single instrumentation for the use case could be to adopt the simple consumer approach in a slightly different way. Recently, we are thinking of moving the filtering logic near the beginning of the dispatching. Hopefully, this will stand the number of BPF calls as much as possible, making this implementation cost for all filter syscalls almost none. One possible way we have to move this logic is using modern BPF features, like global variables and BTF enabled tracing program. BTF stands for BPF type format, so the BPF debugging. But without going into too much detail, we can just say that on one side global variables allow us to read and write data inside a map without using BPF helpers, while on the other one, more tracing programs allow us to directly access the kernel memory without having to issue the well-known BPF property. So, explain these new features. We can surely dumb down the number of BPF helpers that we call in our problem, solving the aforementioned problem. Moreover, reducing the number of helpers, we could also obtain a substantial improvement on the overall instrumentation cost, since almost all our programs widely use them. Anyway, just to summarize, we can say that even if we don't have a working project yet, we hope to get similar performance to the solution on the left, since instrumenting an almost empty BPF program seemed to have a pretty low overhead. This end of discussion on the instrumentation issue. Now it's your turn, Fede. That was quite long. The last point that we have to address is slightly different from the others, as between a long and constant performance, but only portability. As we already stated, FALCO requires high portability. The support matrix is pretty huge, going from kernel 4.14 onwards, and from kernel 5.0 onwards. So, the same BPF program must be successfully compiled for the elements of this matrix. In addition to the obvious issue that the internal kernel structures can change from one version to another, we have a much more serious problem. Each kernel version enables different optimizations on the code, generating a different bytecode. Moreover, this same bytecode could be accepted by some versions of the kernel verifier and not by others. Here you can see an example of an issue that we faced some time ago, but it is not alone. We have plenty of them in our repository. You can find a link on this issue in the presentation uploaded on scan.com. You have just to right-click on this image. The code on the left was correctly accepted by the kernel verifier and compiled with all clang versions, accepted when built with clang 11 and kernel version 5.8. This was pretty strange, so inspecting the generated bytecode, we noticed that the clang of optimizations introduced the possibility of a backcatch in the code. Backcatches, of course, are not allowed by the BPF verifier. Therefore, even if this backcatch was unreachable, the kernel rejected our probe. The fix is pretty weird at first look. By removing the black construct, clang is denied to introduce the expected backcatch. Of course, once you solve such an issue, you must double-check on a lot of matrix cells that no issues appeared because of the fix. In the end, maintaining such a code can become a great mess, given all the hidden actions out about control. Checking all the support metrics every time a change is committed is time-costly and slows down the piece of development. Unfortunately, we currently have no real mitigations for these problems, but good thing is, we are already working for future improvements. In the last few years, any BPF concept was born. The compile once ran everywhere approach. As its name suggests, the idea is having a single BPF probe that is portable among different kernels. This means that the bytecode is actually relocatable based on the kernel where the probe is being injected. To achieve this, we need a key requirement over the others, using the BPF library to load our probe. All the logic to perform the relocation between our code and the actual kernel version is necessary inside this library. Even if this requirement might seem easy to achieve today, filecode does not use the BPF as a loader. Instead, it directly uses the BPF-C's code to perform the values actions. Remember that filecode was born before query was even a thing. Therefore, it's code architecture follows low-level patterns that are not easily portable to the BPF logic without rewriting all the existing code. Anyway, the query approach will allow us to compile our probe a single time with just one kernel version, and this is exactly what we are looking for. We can forget about all optimization issues related to the different kernel versions. Basically, we will remove the kernel axis from our support metrics. Yeah, with this finally good news, we have ended our last topic. I leave to you, Andrea, for a final recap. Thank you, Fede. Just to conclude, we can say that the main idea behind this talk was to provide you with an overview of how FARC uses the BPF technology to trace the entire system, and which are the issues that we often face in production due to the huge quantity of data that it generates. We hope that you were able to easy follow us and that you almost understood all the different problems and possible mitigation that we can put in place. If you have any doubt or any curiosity, please feel free to reach us on the Slack channel. Hoping to see you soon. Bye, and thank you for your attention.