 Hi everyone, thank you for attending my talk, Fuzzing the Linux with Zen. My name is Tomasz Langell. A little bit of background about me. I'm a senior secure researcher at Intel, and I'm also maintainer of Zen's introspection subsystem as well as with VMI. And my background is mainly in malware analysis. To get started, I assume everyone has a basic understanding of fuzzers, but what we'll be talking about today are feedback fuzzers. And an important aspect of feedback fuzzers is that they are not just about feeding random input to your target code. Feedback fuzzers actually monitor the execution of the target to collect the execution log, which is also called the coverage. And the idea is to compare this coverage information from run to run to identify when new code is being discovered during the fuzzing exercises so that the fuzzer can actually mutate the input that discovers new code. The idea is that the more new code you discover, the higher the chances that you will actually find more bugs. This actually works really great in practice. If you think about it though, feedback fuzzers need determinism for this coverage information to be useful. If the target code behaves differently between the executions that's not due to the mutated input, the feedback will be just noise. So if you have garbage in, you will have garbage out and the fuzzer will spend most of its cycles mutating on input that doesn't actually lead to discovery of new code. Zen VM forking was developed to address this issue, particularly for kernel code execution. Kernel code is highly non-deterministic. You have interrupts firing, multiple threads running in the background. It's really difficult to achieve determinism unless you look at ways to snapshot the system and restart from the very same execution context. VM forking allows us to do this on Zen very cheaply and very efficiently. The idea is to populate the VM forks execution space, memory space, as the VM is already running. And in the pagehold handler, look at the type of memory access that's being performed. If it's a read or execute, we can simply set up a shared page table entry. Or if it's a write, that's the only time we actually have to deduplicate the entire page for the VM fork. To achieve even better speed, we can actually just reset the VCP registers and throw away the duplicated pages to speed up execution of fuzzing cycles. If you look at the numbers, if we start from scratch, a VM fork can be created about 1300 times per second. If we do resets instead, we can go up to 9,000 resets per second. These numbers are quite good to achieve the desired results during fuzzing. Another important bit of the Zen infrastructure I want to highlight is the introspection subsystem. Introspection allows us to actually poke inside the virtual machine's runtime, read, write, translate, guess, virtual memory, but also pause and intercept hardware events as that VM is executing, such as CPU ID instruction being executed, software breakpoint, single-stepping, EPP faults, and a bunch of other events that can be used very easily to intercept and control the execution of that VM without actually having to install any kind of agent within that VM, which simplifies the setup a lot. Zen VM trace is also a new feature I wanted to highlight. This is something that was just added to Zen 4.15. This allows us to actually collect the execution log using the processor itself, using Intel processor trace, which allows us to collect this information with very little overhead, and it can record the execution of the entire execution context of the entire VM. Now, if we look at how fuzzing on Zen works, the idea is to boot up the target kernel that we call the parent VM, just as any other regular VM would be on Zen. In that VM, we boot the kernel with the magic CPU ID compiled into the target location that we want to start fuzzing at. The idea is that when that CPU ID executes with a magic number, the hypervisor will understand that this is the point where the VM needs to be paused, and this is where we can start creating forks from. Once that CPU ID is caught, we create a VM called sync VM. This is a fork from the parent. In this one, we look up the address of a bunch of different kernel functions that are used by the kernel itself to detect and report errors that are detected, such as panic when the kernel crashes, but also if we enable subsystems like Casan or Ubisan, we can use those error detection mechanisms during fuzzing to detect when something goes wrong. We look up these functions and we inject the breakpoint into the start location of these functions so that when these functions get called, we can actually trap into the hypervisor. Now we create a fork from that. This is the one that we'll actually use for fuzzing. In this one, we will actually inject the input generated by the fuzzer. We will use American Fuzzy Lop or AFL, and we'll take the mutated input from AFL and write it directly into the fork's memory space. We will on pause the VM, let it run, see what happens. If we catch a breakpoint executing, that will be one of these functions that we breakpointed and that will signal that there was something bad happening so we report a crash to AFL. On the other hand, if we catch another magic CPYD, that will be the end harness that we also compiled into the kernel that signals that nothing bad happened during the execution. If neither of these things happen, then we discover the timeout, the input actually can trigger a hang inside the kernel. After the VM for execution finished, we look at the EPT log, decode it, and we use that to report the coverage for AFL that will decide how to mutate the input for the next cycle. And we just keep repeating that as long as we want to. Now let's look at a demo. Here I have a Ubuntu 20.04 VM running on Zen that has a kernel already compiled with the CPYD instructions at the starting locations of a kernel module that I want to fuzz. I boot with a couple of debug options enabled. And this can see on the left-hand side, this VM booting and printing the dmessage buffer. Now on the right-hand side, you can see the configuration values for this VM and I have a emulated USB3 XHCI hub attached to this VM with a emulated Tom drive plugged in. We will actually want to fuzz the XHCI kernel module in that VM on the left. I start up the KFX setup step. This will start listening to the magic CPYD with 1337, 1300. On the left-hand side, I log in and I execute FDisk to create some stimulus that will actually interact with the device, the XHCI device. And we see that the magic CPYD was caught and the VM got paused. So what I can do now is actually read out the memory that would have been used as the input to a particular function inside that kernel module. And we will want to start fuzzing that input. I'm starting off AFL fuzz on the left-hand side and we will be using processor trace to collect the information. We have the address that we want to fuzz. This is the address of a structure in memory that the kernel module will be using to perform some action and we start fuzzing that. And we can see the speed is about 2000 executions per second and we almost immediately discover a crash. Now at this point, you are probably wondering what did we just fuzz and what is the bug? And you're right. Fuzzing itself is only part of a fuzzing operation. First you actually have to figure out what you want to fuzz which would be part of the analysis step. And then once the fuzzer is running and you start finding bugs, you have to figure out what those bugs are. It's not enough to tell a maintainer that, hey, there are bugs in your system without being able to provide sufficient details so that they can go and figure out and fix the issue. So let's take a look at these next. What we were fuzzing in the demo is DMA. And DMA is memory that's made available to the device for facilitating fast IO operations. While IOMM use can generally restrict what memory a device can use for DMA, some pages are required to be made shared with the device to facilitate DMA. And what we were fuzzing is a DMA page that was explicitly made accessible to the device. Now, actually figuring out where DMA accesses are in the Linux kernel is far from trivial. The way we began this analysis is actually reading the source code. We were looking for any type of hints in the source code like the IOMM cookie or variable types. What actually was the most telling is when there is NDNS conversions function involved and then we cross-referenced the source lines that we find during the source code review with ftraceout so that we could tell which of these functions actually execute during runtime and then figure out which one of these would be good targets for fuzzing. We also took a look at the XHCI spec. This was actually very useful to figure out what the names are of various syrings that are used by XHCI for DMA. So here, for example, you can see in the middle, event ring, transferring and command ring are the DMA pages and structures that are used for facilitating this FESTIO operation between the device and the host CPU. Now, if you look at the code itself that we had to add to the kernel, what you see here is the XHCI handle event function which is an interrupt handler for the XHCI subsystem. The interrupt comes in and then this structure called event is getting dequeued from the ring, which is DMA accessible. Event at this point is still just a pointer pointing to that DMA mapped memory and that's where we actually want to start fuzzing so we see the harness start with the magic CPU ID number and we transfer the information about the pointer and the size of that structure as part of that harness and then we have a couple of harness stop functions where this interrupt handler would actually be returning. Now, if you look at the harness start and harness stop functions, you can see that these are just the CPU ID instructions and nothing else. Now, for triaging, to figure out what went wrong within that crash that we found in the demo, the problem with VM forks is that they actually have no wire at all. They have no network, no disk, no console. They are literally just running with CPU and memory and nothing else. But fortunately, the Dmessage buffer itself that logs the information from Panic, Ubisan or Kisan is just in memory. So using introspection, we can actually carve out the Dmessage buffer without actually having to log into that VM at all. To do that, we can use what's called ZenGDB-SX which is a minimal GDB bridge that has been shipping with Zen for a long time. If you compile the kernel with the following config options, you can actually use just GDB to perform various debug operations. And among those is dumping the Dmessage buffer from the VM. So let's take a look at this in practice. We are at the point where we just discovered that bug using the fuzzer. So what we want to do is figure out which of the target functions that we were monitoring actually tripped. So I will rerun KFX with the debug option enabled so that we can see that it's the Ubisan prologue that actually tripped with the breakpoint. So now we will rerun KFX with that input that we know will trip Ubisan, but we will stop at Ubisan epilogue. So that means that at Ubisan epilogue, the kernel has already finished printing into the Dmessage buffer. So now the memory already has the debug information that we're looking for in memory. So we just need to carve it out. So we can fire a GDB-SX, attach it to that VM fork, and then just use GDB and connect to that bridge and print out the Dmessage buffer from the VM fork's memory. That will tell us what exactly the bug is, what was the issue that the fuzzer found. It's attaching to the bridge and there we go. We see that it's a Ubisan array index out of bounds error. And it actually gives us the location of the bug. For most bugs that we found using the fuzzer, this type of triaging was perfectly adequate. We can just look at the Dmessage information and that will be sufficient to find the bug. Usually you just have to look at the source code and it is quite evident what the issue is. Unfortunately, not all the time. Sometimes the bug triggers a code that's actually far away from the driver that we are actually fuzzing. And then it requires a little bit more inspection to figure out what went wrong. For example, if you look at the kernel module, IGB, which is used as use for a network device, this is the harness that we used. Very similarly, this is an interrupt handler when a packet arrives and the device places a Rx description buffer on this ring that gets dequeued and we start fuzzing that justice with the XHCI. What you see when this is fuzzed is a multpointer dereferencing a function called GRO pool from frag zero. When we search for that function, it actually turns out that it's not in IGB itself. This is a function in net slash core slash dev.c. So this is a core network function within the Linux kernel, not in the target driver that we were fuzzing, but evidently that the driver was passing some information, in this case, SKBUFF pointer structure to this core networking subsystem, which performs a mem copy based on that information. And evidently, this structure somehow got corrupted while the fuzzer was running. The way we can debug this, the easiest is to use the same approach as the fuzzer is doing, compare the coverage information. We can use either the processor trace information, but we could also just single step the execution of the VM forks from start and compare the normal execution to the execution that triggers the Casan error by recording the execution log and running diff on them. The very first line in that diff will be the first instruction that executed when the fuzz input was injected into the VM. So we can see that that instruction 19,742, this is the source line that executes that only executes when the fuzz input is injected into the VM. If we look at what that code actually is, on the top, you can see that this was the code and the body of that if statement was what executed only with the fuzz input. So evidently what happens here is that the fuzzer actually flipped this bit that checks for some state on the Tori's description buffer and if it's set, it pulls out a timestamp header from that packet. But what happens if that bit is set but there is actually no timestamp within that packet then obviously the SKB buffer is going to be corrupted. So the fixed code, if you look at the upstream kernel today they'll be performing additional sanity checks before actually modifying the SKB so that it avoids triggering null pointer dereferences in case the RS description buffer is unreliable. Now let's look at the couple other bugs that we found for example, in this one, can you spot the bug? The idea is very similar to what we just saw in IGB. There is some information that's retrieved from DMA sourced memory, in this case the slot ID. You can tell that it's DMA because it's actually goes through the ND in this conversion functions and then it's passed to this other function command reset device where that slot ID is used as a array index. So that would be the bug that we just saw that triggers the Ubison array index out of bounds error. Now, if you look at the code even further what you see that what would actually happen here if that array index was within bounds but it might even point to a slot that could contain a null pointer and but that VDEV pointer is never checked whether it's null or not. So this could actually be both a null pointer dereference or a index out of bounds error. Now, let's look at this one as well. Very similar idea, slot ID is determined from some DMA sourced memory. The switch that follows also is from DMA source memory. So the device actually has quite a bit of control over what code executes within this kernel module and then that slot ID again is just passed down to that function and is used without any further validation. Using this approach, we found nine null pointer dereferences three array index out of bounds but we also found the infinite loops in the interrupt handlers. And there were also two instances of the kernel code trying to access user memory ranges. So good job, right? We successfully fuzz these kernel modules and we are done. Unfortunately, the way we were discovering what parts of the kernel need to be fuzz was all manual. We never were able to really get a good handle of quantifying how well we fuzz the kernel. What if there was some DMA source input that we missed by doing manual source code reading? So we had to do better. The idea was to use a standalone EP default monitoring tool that we developed for this called DMA monitor. And what this does, it actually hooks the kernel's DMA API to be able to track which pages are used for DMA and remove the EPT permissions from those pages on the fly. By doing that, every time a kernel code is trying to fetch memory from pages that are DMA accessible, we would be able to track and we would be able to log that instruction pointer so that we would have a whole list of locations in the kernel that we can see performing DMA accesses. Now let's look at this in practice. We start booting up the same Ubuntu 20.04 VM and on the right hand side, I'm firing up the DMA monitor tool and we can just start booting that kernel, the DMA alloc attributes function gets hooked and as the kernel is starting to access DMA memory, this log is starting to print out all of those accesses. So as you can see, there were already a bunch of DMA accesses happening. You can also see that there are quite a few pages allocated for DMA and then we can just process this log as the VM is booting and get the instruction pointers out from this log, sort them, take the unique ones, and then actually just feed it through address to line to actually translate these instruction pointers to source lines in the kernel. And that way we actually have an explicit list of all of the locations within the Linux kernel that we're performing DMA. This makes it a lot easier to go through and verify and do the source code review to determine which of these need to be fast than just looking at the source code itself. This gives us empirical evidence that these are the input points that the kernel is using for DMA. There were still some problems though that the DMA monitor was not able to address. For example, one of the times the code that the DMA monitor would tell us performing DMA access would actually just copy that memory from DMA accessible memory to some local buffer that is private memory, but then it would not actually use it for anything immediately. Obviously that data is still painted because it comes from DMA. It could potentially be malformed then. Whatever code locations you actually ingest that data might actually not expect malformed data. So we would need to verify those locations as well, but how do you fuzz something if you don't know where it is? For this, we had to develop another tool. This tool was created to perform full VM taint analysis. The idea is to track that data propagation that we know comes from DMA and track it as the kernel is executing. So we will be recording the execution of the kernel using VM trace aka Intel processor trace and then replay that instruction stream using a taint engine that the Triton DVI system provides. And then when that taint engine is processing those instructions, it's able to track how it affects the execution of the system and we can just take a look at the instruction pointers that get tainted. All of the instruction pointers that get tainted mean that those are the locations where the kernel's execution itself relies on data that is influenced by a potentially influenced by a malicious device. And those are the locations that we would like to look at to determine whether it needs to be fuzzed or not. Now here's another demo that I can show you how it works. I know that this VM is paused right when it will perform a DMA access. So if I create a VM fork and run DMA monitor on it, we will see that there is a single DMA access that was performed while that fork is executing. So we know that there is a DMA access but this code is actually just copies that data into some local buffer and then it will be used by some other locations. To actually perform taint analysis, we first save the state of the CPU where we start from. This is required for the taint engine. And then we can create a fork and start the collection of the processor trace buffer log using VM trace. So now this will dump the processor trace buffer into a file. They can on pause the VM. And now as that VM is executing, they're recording the stream using the CPU itself. We let it run for a couple of seconds, we pause it. And then we can go and decode that into a processor trace log and feed it through the taint engine and get the output into a file called taint.log. And while that is decoding and feeding it through the taint engine, we can already start looking at from the start what it found. And we see that there is a memory access performed and we see which VCP registers got tainted. So let's simplify that data a little bit. We're just gonna look for the tainted registers. Here you can immediately see when the instruction pointers are tainted. So this will simplify source code review because this taint analysis engine will tell us immediately what locations need to be looked at because these are the locations that are influenced or can be influenced by DMA sourced data. And then we can just go read the code, determine whether it needs to be fuzzed up a harness around it if it needs to be and we can start a fuzzer easily. So thank you. This was my presentation. If you have any questions or comments, please reach out and thanks goes to a lot of people who helped this project. And if you want to try it out, the code is up on GitHub. It's in kernel fuzzer for Zen project and the VM taint project are both up on GitHub. So thank you.