 I'm Chris Gardner and this is chatting with your programs to find vulnerabilities where we are going to learn how to become therapists for our programs. So to start off I'll introduce myself. I recently graduated from UMBC which is where I did this research and while I was there I took a couple classes on machine learning which in some organizations makes me a data science expert. I also have done a lot of reverse engineering over the years and competed in many capsule fly competitions so I know a little bit about vulnerability research. And speaking of vulnerability research it's pretty expensive to do these days. Experts in the field are in high demand and costs a lot and plus they take a lot of time to do their jobs. I think it's high time to replace people like me with a computer program and DARPA agrees with me. So a couple years ago DARPA ran the cyber grand challenge here at DEF CON or CDC which challenged eight teams from industry and academia to create fully autonomous cyber reasoning systems to find exploit and patch bugs in computer programs. As part of this DARPA created Decree a special OS designed to be easy to model and analyze. In Decree you don't have to worry about networking, file IO, thread scheduling or many of the other hard problems in computer science and software engineering. Since there are no environmental factors each program written for Decree is deterministic which makes Decree a good target for research projects. So DARPA ran the cyber grand challenge but we're not out of a job yet. The winning cyber reasoning system went on to play against the humans in the DEF CON CTF and it didn't crash and burn horribly but it did get last place. And additionally during the competition only a fraction of the programs actually had exploits written for them. So we can do a little bit better. There's space for new research in the field. So if you want to build an automated vulnerability finding machine you basically have one option in terms of overall architecture. We use a tool called a fuzzer to find crashes in a program and then use a technique known as symbolic execution to solve for an exploit. Symbolic execution is somewhat of a solved problem from a research perspective but there's a lot of space in the fuzzer level to apply cool AI tricks. So let's talk about fuzzing. Basically if you're not familiar fuzzing is fuzzing takes an initial corpus of test cases which are inputs that cause some program to do something and randomly mutates them in the hopes of finding a bug. Many fuzzers called coverage guided fuzzers measure the code coverage of each mutation to try and make sure that different parts of the code are tested and take a little bit of the variance out of the process. One popular fuzzer called American Fuzzy Lopper AFL actually uses some AI to do this. It implements genetic algorithms to evolve the inputs using code coverage as the fitness function. And fuzzers in AFL in particular work really really well. In fact they work so well that combining AFL with more advanced techniques such as involving symbolic execution in the fuzzing process only marginally increases the number of crashes found. But unfortunately fuzzers aren't quite perfect. Coverage guided fuzzers only measure how much of the code is tested. They cannot reason at all about what the code actually is. They rely on random chance in her as six to boost the amount of code tested. In a case like this program where I asked you to enter one two three four five six to trigger a buggy code. The fuzzer breaks down. Even though the correct value is in the program it's even outputted to whoever is using the program. The fuzzer cannot make use of this information and in order to trigger the buggy path the fuzzer has to basically just guess 48 bits of information which is not really going to happen on any reasonable time scale. This brings us to the limitations of fuzzers. Fuzzers are good at making small changes to find interesting execution paths but tend to be bad at making big coherent changes. And the way you get around this is by providing the initial test case corpus. If you're fuzzing a PDF reader the fuzzer isn't going to be able to create a valid PDF file from nothing most of the time so you should give the fuzzer some valid PDF files to work with just to start it off. And so in a sense the effectiveness of a fuzzer is a function of the quality of the test case corpus. And a good quality test case corpus should include inputs that exercise the different parts of the program. So for an image processing library good test cases would be a PNG file and a JPEG file and maybe like a TIFF or other weird formats. Generally test cases are easy for unskilled humans to create but hard for machines to create automatically but we can do something about that. So to generate test cases automatically there's a few different solutions people have thought of and here are the three. So the first method is use a symbolic execution which I mentioned a little bit earlier to aid in the fuzzing process. If the fuzzer ends up at a junction in a program so there's two paths and the fuzzer can't get the mutation engine in the fuzzer can't get the program to go down one specific path it can invoke a symbolic execution engine to solve for the input required for the program to take that path. This has the advantage that it always works. However it's very very slow we're talking execution times in the order of seconds whereas AFL mutations run in microseconds and this needs to happen thousands or millions of times during one fuzz cycle. And one example of this kind of fuzzer is Driller by the team Shellfish which competed in the CDC but many of the other teams in the CDC also use this technique. So another option is we can cheat and just add humans into the equation and a bunch of the Shellfish guys developed a system where they would have random people on Amazon Mechanical Turk interact with the programs and then use those interactions as the initial test case corpus. This actually worked really really well it almost actually worked better than just using symbolic execution in the fuzzing process. However it is cheating because we're using humans even though it's like a little more automatic since Amazon Mechanical Turk is pretty fast at finding workers. But and it's not quite automatic and the humans do take some time so you're still looking at runtimes in the order of seconds or minutes. And while the humans didn't find any bugs they were in this paper they were able to create great test cases and AFL was able to take those and find bugs with those. Lastly we can try to apply AI to this problem. There have been a few applications of AI to fuzzing namely using genetic algorithms in AFL which I mentioned which has been very successful. There's also a paper called Deep Reinforcement Fuzzing which outlines a method to use reinforcement learning instead of genetic algorithms in fuzzing. The method is a little young and I don't think they released any code from that paper but early results seem to outperform AFL by about 10% which is definitely worth the effort. And lastly there's our idea which approaches the problem in a different way than these two methods. So we have a bunch of conversations between unskilled humans and these programs. It sounds like we have data. Many of the programs from the cyber grand challenge are console programs meant to be used by humans. Sounds like a good application for a chatbot. And that's what we did. We trained a chatbot in the results of the human assisted CRS paper that the shellfish guys wrote and sampled that on the input from new programs to provide test cases. And also a big shout out to Zardis from Shellfish for giving us the data and making this research possible. So despite the opinions of my machine learning professor I'm not actually good at machine learning. So instead of writing my own like chatbot implementation I kind of just pulled one off of GitHub. It's just kind of a standard chatbot uses recurrent neural network sequence to sequence blah blah blah buzzwords buzzwords. And basically I just modified it to work on a byte level instead of a word level kind of discarding all the optimizations that people can do with words and let it train. And speaking of training since no one likes to fund undergrad research projects we only trained the network for about five hours on a CPU although it wasn't a CPU. For the data we used about 3,000 test cases from the HACRS paper which corresponded to about 50,000 question response pairs. So 50,000 samples. And for the test set we had 200 programs in the cyber grand challenge that we had data for. So we excluded six of those from the training set and that was our test set. And so the way we tested it is we measured the code coverage of the generated test cases and then compared that with the code coverage of the HACRS test cases. Code coverage the way you calculate is it's the number of basic blocks in each program. The basic block is one uninterrupted sequence of assembly instructions with no jumps. And we just you take the number of basic blocks you hit you measure using instrumentation and divide that by the total number. And basically we or I decided if a test case has higher code coverage it's a higher quality test case. That's not the best way to measure these things but it was a quick and dirty method and it worked really fast so I did it. And specifically we generated 10 different test cases for each program and then we took the average code coverage of every test case and then also took the union code coverage of those 10 test cases and then compared those with the HACRS test cases. And we had about 30 test cases per binary to train on. So the results as expected we didn't really do quite as well as the humans but we only underperform them by a single percentage point in some cases. And the worst we ever did was only 10% under the humans which is pretty great. And we way out performed restoring random data or using the word fuzz as your initial seed for everything which some teams did during CDC. And thankfully this system is also really really fast. Even though it's not quite as good as humans or using symbolic execution it runs in milliseconds compared to seconds for symbolic execution in humans. And the actually the big bottleneck for that is not even the neural network it's just communicating with the target program. And also the network wasn't just like throwing garbage and like hitting error cases and doing stuff. It was actually navigating the menus of these applications and responding with sometimes invalid but a lot of times mostly valid data. It wasn't perfect and so sometimes you only got halfway there. So for example here's one of the programs tested. The program prompts it asks for a number between 100 and 1000 and it responded with a number but not in that range it responded with 40 with a leading zero. But it's important to note that that is actually only one bit away from the correct response it can just flip a bit to turn the leading zero into a one or two or whatever. And then the fuzzer should be able to easily make that change and take it from there. And then it's presented with a menu and it actually does pick a pick a valid option from those choices. I'm showing that this approach is capable of recognizing and navigating the basic menus that are in many of the CDC programs. The approach does have some limitations though and the big one is that it only works on CDC programs or and most of the CDC programs are toys. This doesn't quite work as well in the real world because the CDC programs are a little contrived compared to real the real world programs. However it might be possible to apply this to other like human interactable programs such as like GUI programs or web forms or other like simple console programs or stuff like that. And lastly this research is going to make many waves in file format fuzzing which is what a lot of the new research is focused on. That's made AFL is mainly a file format fuzzer. And since this kind of relies on output from the program before the program does like really any major parsing in most like file format fuzzing situations it doesn't work. There are a few ways this research could be extended though. Mainly we can train do we can train for longer add more data. Like I said we train for five hours in a CPU which is barely nothing in machine learning time. And we were only able to acquire a small subset of the data in the HICRS paper so might be helpful to get more data. But also try training on different data might be possible to train on exploits or crashes instead of just test cases and then get crashes from those. In the biggest area to combine this to move this research forward would be to combine this technique with a symbolic fuzzer like Driller giving it another fast option to try when the mutator engine gets stuck. Instead of just diving into a long symbolic execution operation the fuzzer can try to approximate it first using machine learning. This would allow Driller to focus on solving the hard constraints like a hard coded password rather than wasting time on navigating simple menus. So to wrap this up here's a quick summary here's what I told you. Vulnerability research is expensive takes a really long time and it's high time we work on automating it. Cyber reasoning systems are one approach to that and they work by or they automatically find exploit and patch bugs generally using a combination of fuzzing and symbolic execution. Fuzzer effectiveness is often limited by the quality of the initial test case corpus which machines are currently bad at making automatically. We can use symbolic execution to make test cases slowly or we can do it really fast with a recurrent neural network chat bot which was trained on human interactions and the chat bot works pretty well way better than just the word fuzz. And finally we did really well with just a few hours on a CPU it would probably work a lot better with a GPU more time or more data. Alright thank you guys so much for listening to me that's the end of my talk. My contact info is listed if we don't have time for questions now please feel free to reach out to me on either of those I'm very responsive. Yeah please questions comments complaints private party invites you know what to do. Alright do I have time for questions? Alright any questions? No okay. Alright thank you guys so much for listening.