 Hello everyone. My name is Bandin Das. I'm a virtualization engineer at Red Hat and today we are going to be talking about the state of fuzzing in Kymia. So with me I have Alex who works at Red Hat Research and is also a PhD student at Boston University. And Alex works on fuzzing. Alex would you want to give a quick introduction? Sure. So I started working on Kymia in 2019 during the Google Summer of Code. Specifically I was working on building out a framework for fuzzing virtual devices and I've stuck around since then and I think it's a very exciting topic and I hope you enjoy our talk. All right. Thanks, Alex. So let's move on. So before we talk about fuzzing, so the first slide just talks a little bit about Kymia and how we decided on what would be a good interface to fuzz. Not that Kymia does not have a lot of interesting interfaces. Kymia is a vast emulator. It's the millions of interfaces through which guests interact. This slide just talks a little bit about what are the more interesting interfaces where fuzzing can benefit Kymia. So as you all know, Kymia is an emulator. It can work with KVM. It can work with DCG. And a big chunk of Kymia is basically implementing virtual devices which enable guest environments to do IO of some sort. So these devices could be actual implementations of, you know, like emulations of real hardware or they can also be para-virtualized like virtual devices. So if you look at the figure on the right, we show a very, very simplified version of the hypervisor that's sitting on top of hardware and there's this virtual devices layer. And then we have guest devices basically running on top of the hypervisor with their own apps running in this sandbox that the hypervisor has provided. Now due to attack surface that's probably lurking in somewhere in the virtual device layer, a malicious app can, if it knows about this, you know, this attack surface, a malicious app can try to take advantage of it. Now the app itself is malicious but the guest always might not be necessarily so. So the guest might be an unwilling participant in exposing some interfaces that the app can use to escape out of the sandbox that Kymia has provided. Now, so this just basically shows how, you know, the virtual device layer that's being exposed by Kymia is very, very critical. It needs to be hardened on a continuous basis. And we always need to kind of be one step ahead of, you know, any vulnerabilities that might get exposed out in the wild. So with that, let's move on to a little bit of discussion about, you know, the basic code analysis techniques, right? So basically there are two kinds of, you know, code inspection techniques that you can think of. The first one is static analysis, where you basically use this program special special program called a static analyzer to, you know, feed in your test program and then the static analyzer based on a set of rules, basically checks for inconsistencies in syntax or semantics. And the advantage of a static analyzer is basically that you don't have to make major changes or probably no changes at all to your test program, because you are essentially running your test program offline. And there's also this well known, you know, fact that you get a lot of, you know, false positive with static analyzers. And the way to get around that is basically by teaching your static analyzer special specialized rules, how to kind of, you know, detect certain things like, for example, your code might be written in such a way that even if the variable seems uninitialized in the, it's basically impossible for the code logic to actually use it uninitialized. And those are the kind of things where, you know, you might get some kind of false positive from static analyzers. Now, the other technique, which is, which is basically where fuzzing false is dynamic analysis. And that is where, you know, you actually change your you know, test program and integrate, you know, specialized, you know, input paths, which basically allow your test program to, you know, go through a series of very, very various inputs. And so that what this does is it allows coverage of all the test paths that your test program can take. And as you, as you can already see, this definitely is a more intrusive approach than, than static analysis. And there might be false positives also with dynamic analysis, but suffice it to say that when you, when you're doing dynamic analysis, there's definitely, you know, a problem somewhere is up, is, is just a question of whether that, that code path is actually, you know, exploitable is the real question. Otherwise, sometimes fuzzing will actually, most of the times will actually detect a flaw in a specific code path. So static analysis and dynamic analysis are kind of complimentary. They, they are recommended for, you know, as a good security practice, it's recommended that you run both static analysis and dynamic analysis. Sometimes people even recommend as a good security practice to, you know, teach your static analysis, the findings that you, you know, get out of your dynamic analysis tools like fuzzing. So, you know, so that you get the best of both worlds. So with that, let's talk a little bit about the history of fuzzing in QMU. So QMU is not new to fuzzing. There has been some work in this area in different subsections of the QMU code base. Some of the examples that are mentioned on this slide include things like the QCOW2 fuzzer. This was integrated in 2014. Then around 2015, there were some patches to the Megasass virtual device layer. There was not a lot of information as to how this this worked, but it's probably a modified version of QMU with, you know, with AFL integrated that was able to detect some bugs in the PCI bar space for Megasass. More recently, we had work in 2019. Basically, DEMA presented about what IO device fuzzing using AFL at KVM Forum, which where he talks about how he used a setup of AFL, a proxy, and then, and the QMU test program using QTest to, you know, fuzz virtual device interfaces. And around the same time, we also had Google Summer of Code that was going along in parallel, I guess, in a different approach where Alex, who is also presenting with me, you know, was was kind of researching and, you know, trying to approach this problem using live fuzzer, which is the more kind of the LLVM backend. And we have some, we have some slides later on in our talk where we talk about those approaches. So based on our, you know, past, you know, exposure to fuzzing, we kind of identified some missing pieces. One of the most important missing pieces is basically, you know, you don't want your fuzzing environment to be just limited to the developers, you know, personal workspace. The main benefit of fuzzing comes out of, you know, making it more available, making it more generic so that all developers can, you know, utilize fuzzing without major hiccups. And also, you could run fuzzing on a continuous space is more like CI, because QME is a very, very active community with, you know, lots of patches coming in every day. So, so, so, you know, integration brings up the topic of continuous integration, QME integration brings up the topic of continuous integration. And so that's how, you know, you are gonna actually catch bugs as they come in new patches, for example. And also, the other important aspect, again, closely tied to continuous integration is where exactly are we gonna run these continuous integration tests. So, and that's, that's, that's kind of, you know, some of the, you know, problems that kind of influenced where this QME fuzzing work took us. But before we go to, you know, the solutions, let's also talk about the challenges, which probably have been mentioned at some point in other, you know, discussions as well. But, you know, the more the more kind of obvious and more kind of interesting and important challenges with fuzzing QME is QME is not a regular general purpose, just a simple piece of, you know, software that just does something specific. It's basically changing states all the time, based on guest behavior. And it implements a large amount of virtual devices. And all of these virtual devices have their own, you know, quirks, and they have their own implementations. And, you know, this just adds to the complexity of the kind of input that you will probably provide to to QME. So the, so the fuzzer has to take care of this. The framework. So although there's not a lot of, you know, options, and when it comes to prebuilt fuzzers, we also need to decide on what would be a good fuzzing framework for QME. Should we have a custom framework or should we just pick up something that's already out there? You know, the advantage of picking up something out there is basically you save a lot of time. But the disadvantage is, of course, now you have this challenge of kind of molding this prebuilt fuzzer to work with QME. And equally important is also the concept of state changes. So as I said, in the first bullet point, you know, QME is constantly changing state. And fuzzer basically does not know about, you know, states. And to have a reliable fuzzing run, you probably have to feed in inputs at some consistent state of the virtual device or the subset of, you know, the virtual machine that you are dealing with. So these are, these are some, you know, big problems with fuzzing, you know, QME, but, you know, any, any complicated, you know, test program. And so we are going to quickly cover a little bit more in detail these, these specific points. So with respect to fuzzing frameworks, we have AFL, we have lip fuzzer that we discussed a little bit, you know, in some of the previous slides. So the advantage of lip fuzzer is basically that it's, it's integrated into LLVM and Google's OSS fuzz runs nicely, you know, with lip fuzzer. And OSS fuzz, as you might already know, is basically, you know, Google's infrastructure where you can submit your, you know, programs, your, your project, and they are going to run continuous integration on fuzzing runs on your, on your program. So the disadvantage, as I mentioned before in the previous slide, is pre-built fuzzers are kind of, you know, they need some type of molding so that you can use them with your, with your test program. The other option is obviously, you know, why not just use a custom fuzzer. You know, you, you will start from scratch, you will, you know, have better integration, but obviously it has downsides as well. So, so this is a, this is, this is the, this is the problem with the fuzzing framework. What kind of fuzzing framework should we use? The next challenge is, you know, the interface that we decided to fuzz. So we decided to fuzz the virtual devices which we identified in the, one of the previous slides that it is an important area to fuzz. And for all practical purposes, because we are using running general purpose OSS as guests, you know, virtual devices are most or more or less identical to real devices, real hardware out there. And all devices, as you already know, you know, have different mechanisms to, you know, to help, or to talk to them. So the most, the most common, you know, legacy devices, almost all devices will implement some form of port IO. But at the same time, you also have, you know, memory map IO where you can map specific regions of the hardware to, you know, regular memory. And, you know, that also, and you can write to these memory locations and they are going to, you know, enable some form of functionality in the hardware. So the way port IO and MMI is configured is definitely, is not, is not set in stone. You know, it depends from, it varies from hardware to hardware. It might also depend on, you know, the device driver developers. So there's a lot of, you know, non standardization as to how you can enable the port IO space and the MMI or space. And that's, you know, and that's something you have to know to efficiently fuzz, right? So on top of that, almost all devices today also have DMA, which basically lets the CPU kind of, you know, save some cycles, or maybe use, use some CPU cycles for something more useful or maybe even get into a low power state while some data transfer is going on. Now, as you might already know, DMA transfers, you know, deals with descriptors. These descriptors can be nested. They can be, you know, complicated chains of, you know, memory locations that are fed to the DMA controller. So that also adds to the complexity of the input space that we are dealing with here. So now if you combine everything together, as you can imagine, for just even a single virtual device, we are talking about a space that is, you know, really, really, you know, huge. And that makes it very, very complicated. So the next problem is the concept of a state. So as we discussed before, you know, QMU is changing state all the time based on the events that are happening or what kind of, you know, behavior the guest is, you know, displaying. And so the problem is fuzzing does not work that way, right? I mean, you always have to get to a specific state before you put in an input and then see what coverage it gives you. If your state changes, your fuzzing runs won't be reliable. And that basically means that sometimes your fuzzing run wouldn't make any sense. Or maybe even your bugs that you find with your fuzzing runs won't be, you know, repeatable. So maintaining state is a very, very important topic. So how do we maintain state? Now, you know, QMU already gives you, you know, some existing, you know, mechanisms, like for example, you could probably reboot the instance of your guest machine that you that you are using for your fuzzing runs. QMU also provides you the concept of snapshots that is helpful in migration. You probably could reuse those here and, you know, have your fuzzing infrastructure integrated with the VM state functionalities to save and restore snapshots. And you can also probably go one step ahead and try forking where basically you fork up these fuzzing threads and they, you know, they exist from a common point and then they, you know, merge back fuzzing results back to the parent thread. So these are some of the challenges that, you know, we identified with respect to fuzzing, you know, and QMU. So with this, I think I would like to hand it over to Alex, who's going to talk a little bit about how we went ahead and started kind of looking at these problems in detail and what we did about these problems in terms of solutions. So Alex, do you want to take over? Yes. Yes. Thank you. It's useful to look at the ways we've been testing QMU up to fuzzing because fuzzing is essentially doing the same thing as these normal virtual device tests, but it adds a randomization component to that. And in QMU, we have some, some great APIs for for unit testing, virtual devices, starting off with QTest, for example, where QTest provides provides some instructions for performing input and output with devices such as memory writes or port IO writes. And it also lets you control QMU's clock. And this is a this is a great facility. There's lots of tests written using Lid QTest. But for some more complicated devices, you quickly run into a problem because complex devices require complex initialization, complex protocols. So there's a library built out on top of QTest called Lid QS, which lets you leverage some some common APIs and sort of a it's almost like a driver, a testing specific driver for a bunch of devices, which, which lets you bypass things like implementing PCI enumeration and mapping PCI base address registers for every single device, because all of this is abstracted away into this nice API. And through Lid QS, you can you have access to these higher level APIs for bus access, allocating space in RAM for for data that will be transferred over DMA and things like PCI, controlling the PCI configuration space. So we have these really nice APIs and it makes sense to leverage them for fuzzing as well. So moving on. The way we built out the fuzzing framework within QMU is is that we want to make it as simple as possible for somebody who's familiar with building out QTest and QS tests to to build fuzzing tests as well. So the the basic API is very similar to to something like a Q test where you provide a name for your test, you provide some method for getting the arguments that you need to pass to QMU to set up the device that you want to fuzz and then also you need to specify a function that will perform the actual fuzzing. And if you look at these actual fuzzing functions, they look very similar to to something that you would use to to test a device. The only difference is that you have your function accepts a randomized data buffer and the size of that buffer. And basically, the fuzzing targets job or the fuzzing functions job is to take that randomized buffer and convert it into these IO actions with device. So in this case, we're fuzzing a PCI controller with with two registers, basically the CFG and the data registers. So you might take the first byte of the of your input and interpret that as do I want to read or do I want to write from the register? And then the second byte might be, okay, which register do I want to write from to out of these two registers? And then the third, right, the the the last set of bytes might indicate, you know, what data do I want to write to the register if I'm doing it right? So it's quite similar to, you know, a standard test that you would write, you just have to guide the test, according to some randomized data. This one is actually it's basically doing exactly the same thing. It's more of a demonstration test. But it's leveraging the QoS framework to do that. So you can see here that instead of these simple q test out or q test in instructions or function calls that we had on the prior in the prior example, you can actually leverage live QoS bus APIs to to to configure bytes. This really comes in handy when you're testing more sophisticated devices that just never controllers or or disc controllers, you can go even further. So one thing that we've spent a lot of time doing is actually building out a generic fuzzer program that leverages Q test, but instead of requiring the developer to build out a function for every single device that they wish to fuzz, they can just specify a couple environment variables. The first one being, you know, the set of arguments that you want to pass to QMU to set up a virtual device. And then the second one being a set of order, a set of strings that basically indicate the rules that you use to match the names of memory regions or objects that you want to fuzz. And we've we spent quite a bit of time building out that fuzzer. And I'm not going to going to go into too much detail about how that works under the hood. But it can actually generate some pretty interesting and complicated inputs. So, for example, the one on the left is a real crash that it found for for a network device where you can see how in the beginning, it's performing all these out core instructions is basically automatically doing the the PCI setup and mapping the base address registers for this device, and then interacting with the memory map to your registers in orange. And then finally, in the purple, this is actually it writing to a DMA buffer that will be read by by the network device. And then in this case, the bug was actually that the address in red is also a DMA address. But instead of being located in some free location in RAM, it happens to be overlapping the memory map IO space of the device, which you can see if you just flip the end in this of those bytes in your head. We built out some scripts that convert these the resulting crashes into normal q test test cases. And, and we also minimize those crashes to remove bytes that are not needed to reproduce the crash. And the crash reproducers are usually small enough that they can be included inside an email in a report to the mailing list or, or even inside a commit message. Which you can, you can see here on this small movie where basically, as a as a developer who's receiving a bug report, all you have to do is copy and paste a command from an email to, to get our reproducer trace, but that you can then look at it gdb or attach some tracing events to it. And our project has already been accepted to OSS fuzz. This means that CUMU will be fuzzed basically all the time. And any any new code that gets upstreamed will be fuzz and hopefully we can catch any new issues that show up before, before the next release. And using the fuzzing approach that that we've developed and the fuzzing frameworks that we've developed, we've not only have we been able to find brand new bugs, but we've also built out reproducers for bugs that have been reported a long time ago. But there was no reliable reproducer for them. And because of the reproducer that we found now, it's a lot easier for the developer to work on fixing those bugs. And we've also found some bugs that that uncovered some deep problems in some of the architectural decisions in QMU, such as the way devices access access data over DMA, and also the memory access API itself with with some problems with the alignment and memory access sizes. So far, we've reported over 50 bugs on on launchpad and six bugs have been assigned CBE IDs. And we're still very much trying to work out how to deal with reporting these bugs in the future and working out a process because fuzzing on OSES fuzz is still quite a new concept for QMU. So we're actively looking for input on the best ways to to get reports to a developer that can fix them as smoothly as possible. And in the future, there's there's so much stuff that I'd love to talk about. And I think there's a lot of places where this work can go. And just to mention a few devices are often hooked up to complicated back ends, such as Spice, VNC, Slurp. We want to be able to fuzz those as well. We also want to be able to fuzz migration code. So save VM load VM and VM state descriptors and reboots. We also still don't have a great way to to fuzz to to reproduce crashes that require, you know, 1000s and 1000s and 1000s of interactions, because libfuzzer sets a cap of 4k or 60k for each input that it tries to use to fuzz. And then we can also talk about, you know, fuzzing fuzzing devices that rely on kernel components in the kernel, such as the V host, and you know, some devices are moving out of the process with with stuff like multi process QMU and V host user. And, and I think overall, there's there's still a lot of work to be done. If anybody is interested in any of these topics, I'd love to talk to you about it. So Alex, how would you suggest somebody who was, you know, not very, very familiar with, you know, the fuzzing infrastructure in kernel? How would you suggest they get, you know, started to, you know, possibly kind of even, you know, contribute to these topics? Right. Awesome question, Bandhan. I saw here are our contact details. First of all, we're always happy to talk about these bugs. If you want to take a look on your own, we have documentation for the fuzzer. It should be pretty easy to find it. But really, all you need to get started with building the fuzzer and writing a new fuzzer is reasonably recent version of clang. You have some very simple examples as I showed earlier on in the presentation. And yeah, as I said, we're really happy to talk to anybody who is interested in this. I want to give a huge thanks to everybody who helps with reviewing the code and working on the bugs that we reported. And with that, I guess, if anybody has any questions, we'd be happy to take them.