 Good morning, everyone. I'm very happy to be with you today. My name is Yue Qicheng. You can also call me Lewis. Today I'm going to present my recent work on protecting the memory of Linan's kernel, especially the memory managed by Slab's lab educator. This protection can be enforced without recompiling the kernel and rebooting the system. That's why it is called on-demand and on-the-fly. At the very beginning of the talk, I'd like to first introduce myself and my collaborators. I'm currently an assistant professor in computer science at the University of Colorado-Bowlow, and Zhengpong is my previous lab mate, here right now a PhD student at Northwestern University. I want to give my special thanks to Michael and Hanny from IBM Worson Research Center. This project began when I was an intern at IBM. I'm also very grateful to Dan Williams, who is now an assistant professor at Virginia Tech. Dan was previously a researcher at IBM and was my mentor when I interned there. I can never make it without their help. Just like software, vulnerability has its own life cycle. The first important time step for vulnerability is when it is introduced. And after several months or years, it is discovered by developers or attackers through techniques like static analysis, dynamic fuzzing, or maybe simply manual code auditing. Then developers start diagnosing this vulnerability and the craft patches to fix it in the upstream kernel. Finally, the upstream patch is shipped to various Linux distros and thus deployed in the real world. Along this timeline, I want to highlight the time window between vulnerability discovery and patch development. This time window is very sensitive because no patch is available and attackers have the full freedom to develop exploits to compromise the operating system kernel during this time window. According to a previous measurement of work, this time window is very long, it's about 66 days on average. Therefore, we propose to stop attacks in this time window and develop our protection named HardPPF to provide in-time protection. Before digging into details of HardPPF design, I want to first briefly describe some common exploitation techniques in the kernel which enlightens our design. It's well known that slab-based exploitation starts with memory corruption. Taking use of the free as an example, at the beginning, slab is divided into slots for holding objects. The kernel visits the fields in the object via a dangling pointer. Without any corruption, the function pointer in the vulnerable object refers to a benign address. Typically, use of the free vulnerability can be exploited in three steps. In the first step, the vulnerable object is freed and the slot holding it is recycled for the following allocation. As the dangling pointer is not nullified, the adversaries can still visit the freed object in a slab, including the function pointer. This violates the legal usage of kernel objects and thus is considered memory corruption. In the second step, the adversaries perform heap spray to allocate an object to the same slot. The vulnerable object, which generates corruption, overlaps with this spray object. As a consequence, the function pointer is overwritten to a malicious address. Finally, in the third step, the adversaries irreference the term for the function pointer via the dangling pointer and thus hijacks the kernel execution. Now, let's take a look at the exploitation against slab-based slab out-of-bound right. And this exploitation can also be completed in three steps. At the beginning, we have a vulnerable object on the slab. The attackers can trigger this out-of-bound right to corrupt memory region right after this vulnerable object. To exploit this slab out-of-bound right to exploit vulnerability, the first step is we put a victim object right after the vulnerable object. And this victim object can include a function pointer if attackers' goal of exploitation is to hijack the control flow. In the second step, we trigger the vulnerability to override the function pointer in the victim object and have it referenced to a malicious address. Right now, you can see that we have a corruption and this corruption starts from the vulnerable object to the victim object. Finally, the attackers dereference the corrupt function pointer to hijack the control flow of the kernel. The corruption here is associated with a vulnerable object similar to what we saw in the use of the free exploitation in the previous slide. What if we are able to separate this vulnerable object? The social corruption will be separated as well. So as such, it becomes harder for attackers to tempo the victim objects. This idea enlightens the design of hard BPF. Now, let's take a look at the kernel memory layout. On the slide is the layout of kernel over x86 architecture. It has two regions. One is called direct mapping, which is also known as FISMAT. This region includes the memory managed by slap-slap allocator. So without hard BPF, without our protection, both vulnerable object and the victim object reside in this region. Another region on the slide is called VMaloc region, which never overlaps with a direct mapping region. The key idea of hard BPF is to separate the vulnerable object to the VMaloc region. So as such, the memory corruption associated with a vulnerable object is also separated to the VMaloc region. The attackers cannot tempo the victim object in a direct mapping region by misusing this corruption. The idea is quite simple, but to implement this idea is very challenging. The first technical challenge is how to identify the vulnerable object. Right now, the Linux kernel has millions of lines of code. So statically, the number of kernel structures is over 6,000, while dynamically, the number of kernel objects at runtime is over 100,000. Among so many kernel objects, only one or two objects are vulnerable and can cause corruption. We need to develop techniques to identify them and separate them to VMaloc region. It's just like finding a needle from a haystack. Therefore, something like a magnet will be very helpful for us to determine whether an object is worthy of separation. You may wonder why not simply take a glance at the back report to identify the vulnerable module and separate all objects in that module. It's much easier. We don't adopt this cost-grant separation for two reasons. First, from the perspective of security, if we separate too many objects, it increases the probability that a suitable victim object is also separated, which actually facilitates its attackers with exploitation. And second, from the perspective of performance, it makes no sense to separate too many objects because accessing objects in victim region needs to first go to page table. It takes much longer time to access a VMaloc object than a face map object. Longer latency over object access indicates higher performance overhead, which is what we want to avoid. The second technical challenge of implementing hard BPF lies in how to separate corruption without recompiling kernel and rebooting the whole system. Imagine that you have a set of servers running Linux kernel in a data center. Without a dynamic separation enforcement, the whole process is very curbsome. First, you need to stop all machines that are vulnerable. And for each machine, the code for separation will be added manually. If your machine are running different conversions, you need to craft a separation code for each version. And after this, you need to recompile the kernel and reboot the whole system before finally restarting the critical services. Yes, you need to resume the previous service and this interruption will make you lose your profit and customers. Fortunately, if we can dynamically enforce separation, all these steps can be skipped. And besides, with this flexibility, we are able to protect the kernel immediately after report of any vulnerability, leaving no time window for attackers to compromise the system. Now let's take an overview of the hard BPF design. The input of hard BPF is bug reports generated by dynamic fuzzing tools like syscaller and many other static analysis tools. When finding a bug, these tools will generate a bug report which contains useful information regarding the root cause of the bug. For example, the stack call trace. Hard BPF runs an agent in the user space and a BPF program in the kernel space. In the user space, the agent will analyze the bug report and attach the call stack to identify vulnerable objects. As we mentioned in the previous slide, if the bug is a bug overflow, then the vulnerable object is the object contains the overflowed buffer. And if the bug is user for free, then the vulnerable object is the object referenced by the dangling pointer. After identifying the vulnerable objects, the agent will analyze the kernel image and calculates the allocation address of vulnerable objects. For example, if the vulnerable object is in the type of tunafire, the agent will identify all the allocation sites of tunafire at source code level and then leverage debugging information to obtain the binary addresses of these allocation sites. The agent will pass the binary address to the BPF program in the kernel. At the background, the BPF mechanism works as a virtual machine in the kernel. It can observe kernel behaviors and trace kernel activities. The BPF program is originally designed for package filtering and monitoring, but here we use BPF program protection. More specifically, after receiving the allocation address from uterspace, the BPF program will hook the binary addresses to intercept the allocation of vulnerable objects and divert them to V-malagore region instead of slab. So in this way, the vulnerable object is separated to the V-malagore region. And besides, the BPF program will guarantee one-time allocation, which means that once the memory is freed, it won't be reclaimed and reused. Therefore, common use-after-free exploitation techniques cannot do heapspray to retake the freed memory and overlap the vulnerable object with other objects. Here are more details about how to identify vulnerable objects. So as you can see on the slide, there's a call trace which records the call and calling relationship when the vulnerability is triggered. So typically, the last several functions in this call trace are for native debugging features in Linux kernel or utility functions instrumented by sanitizers. By analyzing the call trace, we can recognize the panic sites. In this example, the panic site are one-on-micro in function V-host develop cleanup. This one is a native debugging feature in the Linux kernel. And in this example, the panic site is a kernel address sanitizer instrumentation. These panic sites suggest the program states that cause problems. If you take a close look at these panic sites, you can see that these panic sites include checking conditions. For example, if the develop work list is empty, the checking condition is wrong, so kernel panics. If the SG offset is referencing an illegal memory region, the kernel address sanitizer will panic the kernel and report. The critical variables in checking conditions include the root cause information of the vulnerability. As such, our following analysis starts from the critical variables in the checking condition. Let me use an example to illustrate how we identify vulnerable objects from the critical variables in the checking conditions. On the slide, the checking condition is read once with a time of base at the critical variable. We first tend time of base and propagate the tend backward along the call trace attached in the back report. The tend is propagated to timer and then T file NAPI. The T file is in a time of tune of fire like this. The definition of tune file is on the slide and the NAPI struct is part of it. We consider the tune file is a vulnerable object rather than the NAPI structure because this NAPI structure is embedded in the tune file and it cannot be allocated to slab individually. With this tune file in hand, we further analyze Kona code to find its allocation sites and then use debugging information to obtain the binary addresses of the allocation site. Due to the space limit and the time limit, I won't go into details regarding how to identify allocation sites at the source code level and how to use debugging information to obtain the binary address of the allocation site. Instead, I will focus more on how we dynamically enforce separation. Say right now the vulnerable object and its allocation addresses are identified, the agent will produce a BPF program and install it to the kernel. Then the kernel installs this BPF program. The kernel will write the kernel image by replacing all the core instructions to allocation and free with int3 interrupt. When kernel traps int3 interrupt, the corresponding handler will be executed. So more specifically, the eBPF program has a K-maloch handler and a K-free handler and a MAP structure. The MAP structure records the address of vulnerable objects, so when the K-maloch node is going to be executed, the K-maloch handler is triggered and the K-maloch handler will request a piece of memory from the V-maloch region and record the address of the returned memory in the MAP. And when K-free is going to be executed, the K-free handler will be triggered. It will query the MAP and if the free address is hit in the MAP, this means that the freed object is a vulnerable object. So the K-free handler will skip the real free operation, which realize the one-time allocation, because the memory is never recycled if the true free operation is not executed. So let me describe this process from another perspective. In the eBPF MAP, we have key and value. The key is the object address and the value is either 0 or either 0 or 1, recording the status of objects in the V-maloch region. So when object3 is freed, the K-free handler hits this object3 in the MAP and then changes its value to 0. However, the K-free handler won't indeed freed. Therefore, the memory for object3 will never be reclaimed to avoid the use of free exploiting techniques so that we can achieve the one-time allocation. Before presenting evaluation results, I want to highlight some implementation details. Before the version 5.13, the kernel doesn't support adding hooking to arbitrary kernel addresses. Besides, it treats K-probe and K-probe return at the same time of the eBPF program site. Therefore, the implementation before the version 5.13 is very dummy. The eBPF program is hooked to the function entry of the allocation functions, so every time the allocation function is invoked, the eBPF program will be executed no matter whether the allocated object is vulnerable or not. And within the eBPF program, we compare the RAP register with the identified allocation address to determine if the kernel is going to allocate a vulnerable object or not so that we can decide whether we divert allocation or not. So as you can imagine, this implementation wastes a lot of time on unrelated allocation sites. The situation is much better since version 5.13 because we can precisely hook the exact allocation site that we want to hook. For example, we can only hook the allocation that is invoked at the hex-DD offset of root 4 change function. Another unrelated allocation will not call eBPF program, which saves much time and resource. Finally, we evaluate the effectiveness and the performance overhead of hot eBPF. We randomly selected 25 real-world bugs that we can reproduce from the Cisco dashboard, and these 25 real-world bugs contain 46 bug reports because one bug can manifest different behaviors when the input varies. The types of bug reports are very diverse in our dataset. We have bug reports generated by native debugging features like bug, worrying, eBPF, info, and also generated by sanitizers like kernel address sanitizer, kernel memory sanitizer, and undefined behavior sanitizer. The amount of useful information contained in these reports are different. Typically, the information in reports generated by sanitizers is much more previous. However, as you can see, this difference won't influence the effectiveness of our analysis tool. To evaluate the performance overhead, we use two benchmarks. The first benchmark is standard LM bench to measure the overall performance overhead. And the second benchmark is a customized bench which stress tests the worst case performance. It is derived from POC programs and keeps invoking system calls that allocate and free the vulnerable objects so as to obtain the time latency of the eBPF programs. The performance evaluation is conducted over a desktop. We use CUMUL and VMWare at the very beginning, but it turns out that the overhead measurement fluctuates significantly. So, we have no choice but to use a Bell machine. I show part of our results on the slide due to the space limit. The title column stands for the bug report title. We have bug reported by bug, kernel address sanitizer, kernel memory sanitizer, and worrying. It is worth mentioning that the last two warning bug reports are belong to the same bug, and their results don't manifest difference which demonstrates the robustness of our identification tool. The ground choose column represents the indeed vulnerable object which is identified manually, and the analysis results are the output of our automatic identification tool. The FP stands for false positive which means how many objects identified are not vulnerable, and the FN is where those object identification misses the ground choose. Allocation sites indicate the program sites that allocate the identified objects including those false positives. And allocation addresses are the corresponding binary address. Some allocation sites are associated with more than one binary address because of function inline and compiler optimization. So, in summary, all our 46 bug reports don't have false negative because our analysis is conservative, and the false positive in the worst case is 7 out of 8. I show the performance evaluation of four cases on the slide. The average overhead is negligible with LM bench, and the reason is because the LM bench measures the overall overhead while our separation only happens when the vulnerable object is allocated and freed. So, we do stress testing to obtain the worst case overhead. As you can see that the worst case overhead for the first bug report, the average time spent in BPF for allocation is around 100 ns if the separation doesn't happen. Otherwise, if the separation happens, the average time spent in BPF program is around 23,000 ns. It takes 1,500 ns to query the map and determine whether the object to be freed is vulnerable or not, and it takes 6,000 ns to mark the vulnerable object as free. You may not be able to tell the meaning of these numbers, so as a reference, the time for read system call is 420,000 ns, which is much larger than these numbers. The current implementation of hard BPF is not perfect and has limitations and lots of to-dos. The first limitation of the design of hard BPF is some objects cannot be put to the remote region. For example, the DMA object must reside in this map. And the second, we are working on adding support for KSLR because the allocation size changed when KSLR is enabled. For the similar reason, we need to add support for allocation size in kernel modules because these kernel modules will be loaded dynamically and the address of allocation size in modules also changes every time. Finally, I want to say thank you very much and the code has been open sourced on GitHub along with CasterCases. If you have questions, please email me and I like to open the floor of questions and I will answer them in the chat box.