 DevCon 30. Who's having a good time? Yeah! Whoa, you guys, awesome, awesome. That's great. So, Stalker's by Rex and Junyun. So, let's give a big DevCon 30. Welcome, come on. Yeah! Okay, thank you everyone for coming to our talk about bypassing system call tracing. My name is Rex Guo and this is my co-speaker Junyun Zhang. So, we have a lot to cover in this talk. So, I will just jump over this slide. You can look up us online. Okay, so imagine a sophisticated attacker compromise your Linux production environment. He launches a lock for shell exploit and then fire a reverse shell back to his machine. Then he discovered that the machine is running a vulnerable version of Seudo so he elevated privilege on the box. And then he read the Etsy shadow file to see if there's any interesting hatches or crack. He also discovered that he can do SSH hijacking to move into the second machine by reading to the process environment variable. So, he moved to the second machine. And as he is celebrating, he discovered that his assets was gone. So, he fired the RC again, there's no luck. Quickly, he discovered his assets was completed block. Now, let's take a look at the other side of the story. While everything is happening, your security engineer receive a sequence of alert from the cutting edge security monitoring software. And these software is able to monitor the system calls and also processing information for the application. So for example, when the attacker execute a reverse shell, there will be a connect system call and maybe others depending on the reverse shell technique. When the attacker read the Etsy shadow file, there will be an open and open assistant call. Essentially, any non-trivial action that the application perform will need to involve a system call. So, how do you use the system call information to detect threats? So, here we show a really, really simple example. Let me explain what this is. Here's a rule that try to detect untrusted program reading the Etsy shadow file. And let me explain the rule. It's saying if the system call is open or open at and has a reprimition and the file name points to Etsy shadow file and the program is now in the allow list of programs, that will fire an alert. Now, you can probably see that you can build a much more complex rules and even machine learning models on top of this data set with one system call even multiple system calls. But they all realize on the fact that the system call monitoring software is able to extract the data correctly. So, in this talk, we were going to dive very deep into these system call tracing technologies. And then we'll talk about the vulnerabilities that we discover that allow us to bypass the tracing. And then we'll conclude with mitigations and takeaways. With that, I will hand over to Jun Yuan. Thank you, Rex. So, as Les mentioned, system tracing is very important to detect these threats. So, this diagram can give you an overview of system tracing. Basically, the system call tracing including two parts. One is the hooks for system call interception and also the tracing programs. So, when the application issues system call into the kernel, the system call copass is executed. If there are any hooks placed inside the copass, the attached tracing program will be triggered to collect system call data and send those information to the tracing monitor agent to detect threats. So, the tracing program can be implemented in the kernel space as shown in the left diagram or implemented as a user-based program as part of the monitoring agent, which is shown on the right side. So, the program used to collect system call data like system call arguments is called tracing program. This program can be attached to different hooks like TracePoint, Kpro, Ktrace. So, we can directly leverage the Linux native mechanisms as tracing program or imprim your own tracing programs as kernel module, EBPL pros or user space programs. So, the first kind of hooks for system call interception is called TracePoint. Basically, it's the kernel static hook. For system call interception, the Linux kernel provides this enter and this access point. So, if we attach the tracing program to the TracePoint, the function call of TracesEnter and TracesExit will use the same parameters to trigger the tracing program. The first parameter is called REX. That's saving the system call arguments. And the second is called ID, which saved the system call number. TracePoint provides low overhead, but it only provides static system call interception. You can also use the dynamic approach like Kpro to intercept system call. So, using Kpro, you can register the tracing program on almost any instructions in the kernel code pass. Like, basically it's a system call code pass. So, when the instruction gets executed, the tracing program will be called. Compared to TracePoint, the Kpro is the dynamic approach, but you need to know exactly how data is placed on the stack or register in order to get the useful information. So, Ptrace provides a user space solution for system call interception. Similar to the TracePoint is static hook in the system call enter and exit. Using Ptrace, you don't need to improve any kernel programs as a tracing program, but only use a space program needed. Compared to the previous two approach, Ptrace performance overhead is high. But for optimization, you can combine with the set call as a system call filtering for better performance. There are other approach for system call interception like LD pre-low. This approach actually is easy to bypass if you use assembly call to trigger the system call. So, many of you have probably heard about the cloud workload protection products. These products usually provide advanced threat detection based on Cisco tracing. There are different kinds of cloud workloads such as virtual machine containers on customer managed VMs, service containers, and others. So, for service containers, it usually allocate and maintain by the cloud providers on demand. So, it usually have no access to the host. This table summarized how different Cisco tracing techniques can be applied for different cloud workloads. For virtual machine, we can use any kinds of hoax and tracing programs. Tools we can leverage is including Fargo eBPF, kernel module, and Fargo PDic. We'll talk about these tools later in our talk. For containers that are on customer managed VM, we can have the same options as virtual machines as long as we get enough capabilities. For service containers, as I mentioned, it has no access to the host. So, we can only use key traits as hooking points and implement tracing program in user space. Instead of Fargo kernel module and eBPF programs, we can only use the Fargo PDic. Fargo used a similar techniques to trace system call. It's an open source project in CNCF. It's very popular and in kernel space, we support kernel module and eBPF Pro using trace point. In user space, PDic is developed based on P-Trace. So, Fargo PDic is dedicated for statistical tracing of service workloads. So, we did not evaluate other security monitor agents, but we believe the popularity of Fargo represent an implementation that is widely accepted by the community. Unfortunately, this kind of implementation for statistical tracing is vulnerable to total issue. That is time for check, time for use. Let's take the connection call, for example. The second argument for connection call is called user V address. That is a user pointer pointing to the user space buffer called socket address. So, during time of check, the tracing program will reference this user space pointer to get the socket address. And during time of use, the kernel will also reference the same user space pointer to also get the socket address. But between time of check and time for use, the user memory pointed by the user V address is vulnerable to be changed by the attacker from the user space. So, in this case, the socket address can be different between time of check and time of use, causing a total issue. So, let's dive into the connection call which can help you understand this total issue. So, when application issue the connection call into the kernel, the system call handler will check if any tracing programs attached to the static hook at the system call enter, like ptrace, secon, this and the trace point. If this is true, the tracing program will be caught. And after that, the handler will look up the Cisco table and jump to connection call to create a connection on the socket. Before returning to the user space, the handler will again check if any tracing program attached to the static hook at Cisco exit, like ptrace, secon, this and the trace point. Similarly, if this is true, the tracing program will be caught. So, as we mentioned earlier, the second argument for a connection call is a user pointer pointing to the user space or socket address in the user space. This pointer is propagated into the connection call pass and assigned to different kernel variables which is highlighted in red. The kernel calls move address to kernel function to make a copy of socket address from user space to the kernel buffer called address that is highlighted in green. After that, the kernel will call the internal function, six connect file function to create a connection on the socket based on the kernel buffer address. And this is the time of use by the Linux kernel for connect system call arguments. So before the memory copy function, the kernel buffer is not created. So the user pointer is the only place we can de-reference and also get the socket address from the user space. In this case, during time of check, if we attach the tracing program to the static hook at system call enter or to any places before the memory copy function using dynamic approach like Kpro, the tracing programs have no other options but need to de-reference the user pointer to get the socket address. After the memory copy function, the kernel buffer is created with copies of the socket address from user space. However, in this case, this total issue may still exist. So think about if we attach the tracing program to the static hook at system call exit like this exit trace point or P trace. The tracing program may still de-reference the user pointer to get the socket address. Franco P. Dick used P trace for system call enter and exit but he only used the second system call filtering as system call enter since the second is not available for system call exit. For Franco version older than 0.31.1, the kernel module and EBPF implementation only use the six exit trace point. So hopefully you get some idea about the total issue for system call tracing. Next I will hand over to Rex talk about the vulnerability. Okay, so the example that we see is on kernel version 5.7 and on connect system call. But really the observation is true across all the kernel versions because we confirm with the kernel developers these top two windows exist since the day that trace point and P trace are introduced because they were originally designed for performance and debugging purposes. So in order to do secure tracing, what they recommend is the software need to monitor the kernel memory instead. So report this issue to Falco. Basically the issue is they use the six exit trace point and they also use the six exit P trace. And there's a version older than 0.31.1 is impacted and if you happen to use the commercial version, you may want to check what version is impacted. So we report this issue in December and the issues is mitigated in March. The mitigation that was deployed is for the Linux security module or the EBP version. They check the system enter and system exit system call argument for selective set of system calls. And they also do the same for Falco PDIC. In terms of how many system calls are impacted, so we analyze the open source rules in the Falco repo and majority of them are impacted with two exceptions. One is the exact V system call. The reason for that is in their implementation when they trace the exact V, they actually read the kernel memory instead. The other one is the send to and send message system call. We haven't found a reliable way to really block the system call. We'll talk about what blocking means in the next few slides. But keep in mind that it's very heavy to monitor send to and send message and this typically limits its usage. Okay, so hopefully everybody understand the vulnerability at this point. It's fairly relatively simple, but let's talk about how to exploit it. Now, in this, to exploit this, this, this tracing, we don't want to acquire any additional privilege or capabilities, right? Basically attackers should be able to evade the detection at any user, any privilege or capabilities. We want to have some level of control on the time to inject the delay. And we'll talk more about what I mean by that later. We also need to inject enough delay such that when we override the memory, the machine has enough time to propagate the overwritten data to the whole machine. The last thing is we want this exploit to be 100% reliable. Because if the attacker being detected once, then it's an entire operation, maybe at risk. So this leads us to two exploitation strategies. We'll just quickly brief overview the exploit strategy number one, which is what we did in DEF CON last year. And from there, you will see how we reach exploit strategy number two. So last year we discovered that you can actually inject a relatively small amount of delay by using cross-core interrupt. And because the amount of delay we inject is relatively small, we have to precisely orchestrate the whole process. What I mean by that is have to inject at the precise time, precisely override it, and also synchronize using some special techniques. One of the primitive that we use in the whole exploit is user for FD system call. Now, although this technique is powerful, it has some limitations. So for example, if you use Docker container and you enable the Docker default second profile, Docker will actually block the system call. The other thing is most cloud workload don't use the system call. So the usage of the system call itself indicates some kind of anomaly. And this is actually the mitigation deployed by FALCO last year. They implement a detection rule to detect the usage of the system call. So last year we went back, we think can we actually develop something that doesn't rely on using the system call. So we come up with a scenario. What if you can actually inject a delay that is really, really long, then you don't have to orchestrate everything in a precise timing, right? It sounds very simple, but the question is how to do this. And we actually find out there are two ways to do this. One is to use the blocking condition in system call. This can be used to attack this exit. The other one is to insert additional second rules. This can be used to attack this center. So what do we mean by blocking the system call? If you think about the underlying mechanism of system call, it's basically the kernel interacting with resources on behalf of the user space program. Now a lot of these resources are IO devices. And the kernel will need to wait for the response from the device before it needs to proceed. So let's look at a concrete example. This is the connection system call. And in this diagram, there are two machines. There's a client, there's a server. Now let's say the client is monitored by tracing software. The client wants to talk to the server, so we issue a connection system call. When the kernels are executing, it will notify the underlying networking stack, and then it will send a sim packet, server return a sim act, and then client return an act. Then the system call will return, trace point P trace, will read the arguments before the system call exit. Now we may be wondering, this is computer networking 101. What can go wrong here? Now let's imagine this scenario. Many times when the attacker compromised a machine or the workload, they would try to talk to the command and control server. This is a very practical setting. So in this case, the attacker also controls the server side. So let's look at the example in detail. Now in the diagram, there's a client and the server. On the client, it's monitored by the tracing software. And then the attacker create a system call thread. The system call thread is first going to create an override thread. And then the system call thread call the connect system call with a memory page that points to the malicious IP it wants to talk to. Then the kernel will send the sim packet over to the server. Now remember the attacker control the server. So the attacker can say, hey, I want to drop the sim packet. Then what happened? The client will resend another sim packet. And then the server can drop the packet again. Then the client will try again. So every time the client retry, there's a delay controlled by the TCP congestion algorithm. Without going into too much detail, roughly you can think about it as an exponential delay for every retry. While this retry is happening, the override thread is going to override that memory page with the B9 IP address. And then we have enough time for that data to propagate to all the memory copies on the system. And then finally the server say, okay, here's your sims app packet. And then the trace point and P trace will read out the B9 IP address. And then the system call will exit. So to show this works, we will show you two demos. First we'll show a demo on VM container level. So in this demo there will be, first you're looking at the server side. Now notice the server IP address ending with 176. Okay, and then we fire up this XDP program, this XDP program looking for a specific sim packet on a specific port. And then we also fire up a listening server on the server side. And then on the client side, we run the Falco software. We also run the Wireshark program. Then we run the exploit. It's basically a TCP client that try to connect to the server. You see the connection successfully established. And then sending some chat message. Now you see the server receive the chat message, sending something back. Right, everything's working, connection is there. But now if you look at the IP address reported by Falco, it's actually saying 1.1.1.1.1. But if you look at the Wireshark, it's saying the first two sim packets are dropped and the IP address that is communicating is actually the server, the real server IP address. So this indicates a successful bypass. Okay, now in the second demo, we're going to show you the bypass on Fargate. So similar here, first we're showing the server side. The server IP ending was 163. And we're going to run the XDP program that's going to drop the sim packet. And then we're going to start the listening server. So here's the Fargate container. It's going to run the PDIC with our attack program. So notice that PDIC report, the connect system call is actually connecting to 1.1.1, which again indicates it's not able to identify the real IP address, right? You see the client server connected, they start chatting, but PDIC report, hey, we're connecting to 1.1.1. Okay, so at this point, you're probably wondering, this is for connect system call, what about the other system calls? So we find out that the entire class of file system system calls are impacted. Also, if other system calls that relies on file system to perform some task, they're also impacted. So one example is the exact VN, exact VAT system call. Because when you execute a binary, first it's going to the disk to fetch the binary. Next, Junyuan will talk about how we bypass the file system system calls. So before talk about how to bypass the open accession call tracing, so let me introduce Fuse. So Fuse stands for user-based file system framework. It usually includes kernel module, users of the library, and also mount utility. So in cloud scenario, Fuse is used as the remote storage Fuse. So with such Fuse, we can mount the remote objects as local file system. And also access the remote file as local file. Since it's a user-based file system, so it provides faster involvement or development. And it usually do not panic the kernel. So remote storage Fuse is widely used. So here's the list of example. From this example, you can see the major cloud provider has their own Fuse. So this is the general architecture for remote storage Fuse. If the user-based application or container wants to open the remote file, use the Fuse, basically what it does is the same as opening a local file. It will issue an open accession call into the kernel. And the request will be routed from the VFS layer to the Fuse kernel driver and then to the user-based file system demand, like GCS Fuse. The demand will send a request to the remote storage, like GC storage or AWS S3. Once the response is back, it will send back to the user-based application through the original pass. So one thing I need to mention here is the delay between the client and remote server is much longer than this call is in the client. So basically you can leverage the long delay to bypass the open accession tracing. So let's see how it works. So we have the malicious client that is monitored by the tracing program. The CIS course red is going to open a remote file called malicious file. Basically it will issue an open accession call into the kernel, means the past name pointing to the malicious file in user space. Since the file is stored remotely, so the open at request will be routed from the kernel space to the user space and then reach to the remote server. Before the response is back, the override thread can jump in and override the past name pointed, override the memory pointed by past name from malicious file name to B9 file name. And again, because the delay is much longer than the CIS call itself. So the CPU have enough time to propagate a change to different copies of memory and registers. So after the response is back and before returning to the user space, the tracing program can use the six exit trace point to get the arguments from the system call. Unfortunately, this argument has been changed from malicious file name to B9 file name. In this case, we can successfully bypass the open accession call tracing. Let me show you how it works in a demo. So we have a console on the left, so for the Google Cloud Storage bucket. And we deploy the FACO agent in GKE cluster and we log into one of the pod. We check the process inside the container. So we have FACO agent running, we have a GCS field steamer running, which mount the Google Cloud Storage bucket into the local folder, MNT. And then we check the logs for the FACO. So there are one event generated because we just log into the pod. And then we check the folder MNT, so it's empty, which mean Google Storage bucket is empty. And then we're trying to open the file, open the file called malicious file in MNT folder. Yeah, then we check the folder, so the malicious file is created in the remote storage in Google Cloud. We check the FACO logs again, so there are one event generated because we're using an open access code to open the malicious file under MNT folder. We remove the malicious file, which will be removed from the remote server. So you run the attack, as I just mentioned, and we check the MNT folder. So the malicious file is created by the attack code. And then we check the log for the FACO. So there are no new event generated, which means our bypass has succeeded. Next, I'm going to talk more about the exploits and also conclude. Okay, so the previous few example networking and file system, basically, these are examples to demonstrate that how you can trigger the blocking condition in system call. Now, this doesn't exhaust all the blocking condition. There are many other ways that you can trigger blocking condition on different system calls, but we're going to talk about how to use SETCOM to bypass this called tracing access enter. So if you recall the slide earlier, when the system called execute, it first going to hit P-trees and then going to hit SETCOM and then bunch of other stuff, then the real system call is going to use the reference to user space memory, right? Now the question is, can we use SETCOM to bypass the P-trees tracing access enter? Before I dive into the exploit, just a brief recap about what SETCOM is. So SETCOM is a kernel mechanism to filter system calls. You can build more than sandbox on top of it and it allows the developer to filter system calls and even based on their arguments. Now, if you have complex SETCOM rules, this quickly become harder to compute and takes more time to compute. Also one caveat to keep in mind, which will be useful later, is the first inserted SETCOM rule will get evaluated at the last. So it's evaluated in the reverse order. Okay, so here's how to bypass the P-trees SETCOM enter. The idea is very simple. Now what you're seeing here is just a single machine. The attacker runs the system call thread. First it's going to insert bunch of SETCOM rules and then it creates the override thread. And then it's going to call a system call, right? So in this diagram we use create, but really this worked for any system call. So in the create system call, it passes in the path name that points to malicious file, sorry, benign file. And P-trees happily read the benign file argument. Then the SETCOM check starts and then the override thread at the same time to overriding the memory with the malicious file. Because the SETCOM take quite a bit of time to compute due to the number of rules that we inserted, then the system has enough time to propagate this benign file to all the memory copies. And then finally SETCOM compute finish. Kernel reads the argument and Kernel actually create a malicious file for us. Okay, so there's one interesting caveat related to P-trees is you can use P-trees in plain mode. You can also P-trees in the mode called P-trees plus SETCOM redirect. What that means is when the tracer, if the tracer start the application, then the tracer can say any system call that the application calls, please redirect back to the tracer. So tracer is going to insert a SETCOM rule first. And that rule says as long as you hit the SETCOM rule, you redirect execution back to the tracer. So if the tracer is able to insert the first SETCOM rule, then no matter how many SETCOM rule that the application insert, it's not going to impact the timing window between when the tracer reads the system call argument and when the kernel reads it, right? So in this case, it's not vulnerable to the attack. But if you are attaching to running application, then you have no control on how many SETCOM rule has been inserted before the trees are attached. Okay, so we have covered a lot of different issues and also how to exploit them. So just to summarize these issues and also how to mitigate different scenarios. So first thing is P-trees as a center. Now in order to mitigate this, you can use P-trees plus SETCOM redirect, but you need to start the application as well. Now, if you need to attach to running application, you may want to inspect what SETCOM filter has already been inserted to see if there's any anomaly. Otherwise, you can also insert a SETCOM redirect, but then that means all the previous SETCOM rule inserted by the application are ignored. So it won't be useful. The second thing is trees point as a center. Although the talk-to issue exists, but we haven't find out a reliable way to exploit it. The next one is trees point as SETCOM exit. Now this is a vulnerable to both the attack in this talk and also our research last year. In order to mitigate this, one option is to compare the SETCOM center and SETCOM exit trees point data to see if the argument has been tampered. Of course, this will incur some performance overhead. And the other option is to use K-pro, which I will soon talk about in the last row. So next one is trees SETCOM exit. Now it's vulnerable to the same attack as the previous one. In order to mitigate this, one will need to deploy all the mitigations at the trees SETCOM center and then compare the trees SETCOM and SETCOM exit results. Last thing is K-pro bone adder kernel internal functions. Whether this is vulnerable really depends on which kernel function is being probed. But in general, if you want to avoid the talk-to issues, specifically on system call monitoring, you can pick the Linux security module interface. Some newer kernels also allow you to do BPF RSM. But this depends on how wide range of kernel you need to support. Because the Linux security module interface have different support across different kernel versions. And if the specific kernel interface you happen to inspect is not in RSM, then you quickly get into the complexity of figuring out whether that exists in all the kernel version you want to support. Okay, so to conclude, first, if we zoom into this research issue, right, the first thing is we show that the kernel tracing can be bypassed reliably in many different ways. So if you happen to use a tool that does similar things, you may want to check if your tool is vulnerable to this. Because we've only evaluated the open-source solution. We didn't evaluate any proprietary software. The second thing to keep in mind is the mitigation is complicated. It depends on what form factors that you use and also what kernel versions that the software need to support. So you want to double-check that the mitigation was actually implemented in the way that you want it to be. And then if I zoom out from this particular issue a little bit, if we think about the bigger picture, if your security team is able to correlate different data sources and have a comprehensive view about your environment, then it suddenly makes the evasion complexity much higher. Last but not least, this is something that we do a lot at LaceWork, is that we understand, you should understand that your environment, what is normal, what is intended to be wrong in your environment. If you know the baseline of your environment, then even if the attacker is able to override these arguments, they're basically being constrained to the value that is fit into the normal expectation. Okay, before we take the Q&A, I just want to give a quick shout out to the following folks. So Joe helped with the kernel and security discussion. A lot of my LaceWork labs colleague gave a great feedback on this work. Also I want to thank the FALCO open source team for a very collaborative disclosure process. And also John Dixon for helping the presentation. So before finally take the Q&A, if you have any other question about cloud security in your environment, or if you want to advance your career in cloud security, I'm very happy to talk to you guys offline. With that, we're ready for Q&A. So anyone who want to ask a question, there's a mic here and Steve will help to walk to you guys. Can you hear me? Yeah. In some of your screen captures, you had kind of your exploits from last year, they were Phantom V1. These are Phantom V3, what was Phantom V2? Oh yeah, thank you, thank you for that. So certainly there's too many attacks that we publish. So Phantom V2 at the time, we call it semantic confusion. So it's also for the folks who's not familiar with that, that's also a bypass on the system called tracing. The idea is this, when you trace file system object, for example the at sea shadow, if it is an open at, then the rule is basically detecting whether the path is open at. But what if you create a link, that is linking to the file object and then you open the link, then the tracing software will not be able to interpret it, but the kernel will be able to interpret it. So there's a confusion between the tracing software and the kernel. Hey, thank you everyone for coming. Yeah.