 First, we're not going to let you deal with a hangover very well right now. We're going to start with Joe, who is going to talk about fuzzing. Yeah? Good stuff. Let's give our first speaker a big round of applause. Good job. Thank you. My name is Joe Rosner, and this is retargetable, grammar-based test case generation. So I want to start off with a story about how we got to where we are today. And ultimately, the lesson of the story is that testing parsers is really hard. About four years ago, at my current company, I started implementing a SQL runtime and parser for a handful of different dialects of SQL. And the goal of this was to evaluate SQL queries and determine whether or not they were dangerous. And in doing this, we built out mostly black box implementations of the vast majority of the SQL dialects out there. And all told, it was something like 35,000 lines of antler definition. So this is like grammar definition where you are defining what makes a language and you pass it into a program, and it will generate you a parser that can recognize that language. Most of this was done through looking at poorly written documentation. It's not documentation about how the system itself works, but documentation for a developer trying to learn how to write SQL queries and what's valid. Typically, they're incomplete. They don't have the information you need. And you're kind of fumbling around trying to figure out just what it is that's correct and what makes it correct. Things will be contradictory, and you never really know what the answer is. And in a lot of cases, these are closed source systems, DB2, Oracle, MS SQL. So you can't just go to the source code for them. Now, we used antler, but most of these tools like Bison or Yak, things like that, you will use an LR style parser or recursive descent parser, which is vastly different than what antler uses. Antler uses an all-star parser, which is a top-down parser. And so the way that the actual algorithm works and you recognize a language from this grammar is going to be different. And that leads to all sorts of weird cases where you might have two grammars that are identical in terms of definition, but they're going to be recognized differently because of the algorithm that's being used. And most of the time these lack public test cases, because they're closed source software, the vendors don't care if you can verify that what they did is correct. They just provide their software and wait for you to complain, and then they fix it maybe. So this issue that we ran into is really what gave birth to this idea and where it came from, but the security use cases for it are what vastly expanded the scope and kind of further drove the usefulness that I've found and why this has kind of come to be. So there's a few different problem areas in parsing when you're trying to verify whether your parser is correct. And we're going to talk about the three of them here. The first is overfit and underfit implementations. So the goal of a parser is to recognize a language. And the question of overfit and underfit is am I correctly accepting and correctly rejecting sentences that are part of that language? So if I have the reference implementation and it accepts something that mine doesn't, I'm now underfit. I do not recognize the same language that the reference implementation does. And if it doesn't recognize something that I do, then I'm overfit and I still don't recognize the same language that it does. And this is a problem that is specifically based, not specifically based, but it's a problem that is exacerbated by poor documentation and not having a firm understanding of what is correct and the internals of how the parser works because without understanding exactly what the algorithm is that it's going through and what it's using to determine whether or not the sentence is correct, it's really hard to replicate that same behavior and get to the point where you have something that is going to represent the same language. And so without having public grammars and without knowing about the algorithm, you're limited to reverse engineering and experimentation with the reference, trying out things, trying to discover what sort of behavior it's using and basically just trial and error until you can get to that point where you really understand, okay, this is how I need to define this section of the language and go on from there. As I've already said, there's differences in parsing algorithms. If you've ever tried to reverse engineer one of these generated parsers, it's basically impossible. It's a giant state machine built out of go-tos, essentially, and you just kind of have this state where you're constantly mutating it and it's a really bad time. You shouldn't do it. And in these different parsing algorithms, they typically have different behavior for a handful of different properties. Things like ambiguity are two different production rules going to be interpreted differently. Do you have left recursion? Are you going to continually go and repeat a specific rule and precedence? How are you expressing precedence for operators? These are all things that every single parser generator you use is going to handle differently. And so taking a grammar from one to another is going to behave very differently in all of these areas. And the worst part about all of this is that universally proving two context-free grammars are identical is undecidable. This is a problem that, for the general case, is not decidable and you cannot write an algorithm that does this thing. Now we can slim down this problem and we can take a subset of it and we can maybe do it for that, but in the general case this is not something that we can do. The next problem you worry about are tree generation flaws. So once you've recognized a sentence in a language, typically your goal is to create a parse tree or an abstract syntax tree. And basically what this does is it shows the syntactic relationship between various components or tokens of that sentence. So it has some kind of root node which is the top point of the tree and going down there this provides context for what a specific token is and how it relates to every other token in that sentence. And the whole process of building trees and working with trees is super complicated as you move from generator-generator run time to run time. Most parsers will require manual tree construction and this is something that has kind of gotten better recently in newer tools like Antler, but typically you'll have a production rule which it will try to recognize and when it does you'll then run a block of code and that code is responsible for building up a node and then inserting it into the rest of the nodes at the right point to build that representation. And this is typically the point where you're going to see things like if you recognize a number you're going to parse that into a numeric representation for the language you're in. So the number one, two, three, four gets read as decimal and converted into an integer in the language you're building this in. And we'll come into issues with that specifically shortly. This is an example of something like this. This is a LallarPop which is a parser-generator for Rust and basically on the left-hand side of the equals arrow you have the different production rules and so this is what it's trying to match and this is what it's trying to recognize and on the right side you have the action that's being called. So for each one of these going down, these are all alternate so it's going to try to match. Oh. Sorry about that. I'm trying to juggle two laptops so I can get speaker notes and it's not going so well. So these are all production rules and each one of these is an alternate so it's going to try to match one of these and everything on the left of the arrow is going to be a rule. Everything on the right is an action. So once it recognizes it, it calls the action and that is what builds up the tree as it goes down or up depending on the algorithm that's being used and whether it's top down or bottom up. The third problem is unsafe and incorrect use of validated input so once you've recognized a sentence and you think you understand what it's doing what are you doing with it? Typically recognition isn't like the only step it's typically the first step and many more steps you will take to use that information. And so most parsers will say yes this is a sentence but don't necessarily say yes this is safe or the information here is okay to use. So the only goal of parsers to ensure that a sentence is a member of that language. And so once you've recognized it you have numbers you have strings. What do you do with those strings later? And essentially what this comes down to is if you can get past the parser with something that is syntactically valid but is still something dangerous maybe down the line you copy this to a buffer somewhere it's just back to any other kind of typical vulnerability exploitation at that point smashing the stack, logic flaws and it's that simple. Your goal is to get past that point it's the gatekeeper. Reasons why this is especially bad is that a lot of defining grammars is really hard when you have complex languages so you'll have places where you might have opaque tokens where you say I need to read in a character set for instance like in a regex but I'm just going to read that in as a string anything between like two square brackets and then later on I'm going to go ahead and figure out what that actually does and this is super common you'll see it all over real parsers so you might recognize this as a sentence in the language but this small subsection of that you're not going to understand like what it's actually doing or not it has some like bad value in it so these opaque tokens are kind of rough and this ultimately is to parsers not really having a lot of semantic information about the language they're recognizing and there's just not really a good solution to this in many cases because of the tooling and because of the complexity of the languages and if you subscribe to the length set concept you know we should be working to reduce the complexity and make this an easier problem so that some of these can go away so how do we test this? it's great that we've identified the problem areas but how do we do better? how do we ensure that our parser that we're building is going to be a faithful representation of another reference implementation typically it comes down to just getting more test cases and getting to the point where you feel comfortable that your parser is recognizing the same set of sentences that another is and to do this you need a shitload of sentences like a massive amount like having 100, 1000, like 10,000 like just really isn't enough especially for large languages so writing them by hand just really isn't a viable option like there's no way you're going to have enough variety and enough targeted sentences that can cover that like it's just not realistic unless you have people just cranking these out all day for hours so something that we ended up doing was crawling the web and looking at like Stack Overflow and just pulling out really bad examples going to open source projects and grabbing them out of there and this is great because you can get a lot up until a certain point but you have no guarantee that you're exercising different areas of it you don't know really what your breadth is, how deep you're going and there's not really good code coverage tools for this so you're kind of just assuming that you have all of this coverage but at the end of the day you don't really know and even within a specific tree like the individual values for tokens just variation in that one space may have significant impact depending on what the application is doing with that value afterward so the need to generate test cases at a much higher rate is ultimately what I arrived at and so the way to do this typically is using a fuzzer and there's a few different styles of fuzzing that are pretty common today and all of these sort of had issues with the style of testing that we were trying to accomplish so I'm going to walk through why that is and where these types of fuzzing are good for and how we can get to a point where we have something that is more targeted for this and maybe be used for these more traditional types of fuzzing to augment them and make them more effective in what they do so one of the most common ones you'll see today is AFL this is like the hot fuzzer right now and basically what this does is it focuses on path exploration and code coverage to make sure that you are targeting as much of the application as possible this isn't aware of any semantics or syntax it is implemented not in a dumb bad way but it's dumb in terms of how it tries to explore so it instruments the application and watches as it goes through it it will take some kind of seed value and it mutates that and it determines whether or not that mutation allows it to explore new areas and if it does it saves that and it continues down that path and if it doesn't it throws it away as something that is not interesting and then it repeats this process billions of times over and over until you stop it basically and bugs fall out and so this can take a lot of time because it's using random mutation and it's not very targeted and its whole goal is to kind of just like explore so you might spend a lot of time in uninteresting code paths and there are some ways to help kind of guide it and cut off certain areas in some of these types of fuzzers but for the most part it's a very random process and in the case of trying to evaluate the specific types of tests that we want to is whether two parsers are implemented the same it's not immediately clear how to build a really good test harness using something like AFL to do that the way that it determines whether or not something is interesting is by hooking the sig abort and the sig kill signals and then waiting for the program to essentially like crash or terminate and so getting the semantic information that we want to about whether something is underfit or overfit or why did this test case exceed or fail and how does it compare to this other thing it's not really immediately clear how you would encode that information into something like a test harness for this and there's a great blog post showing some of the use cases where AFL actually can be used in some cases like this so basically the link at the bottom here walks through taking the string hello and passing mutations of that into a image viewer and basically getting from the point of the string hello to valid JPEGs and I have a few quotes that I want to share from this and basically just kind of showing where this is helpful and where this isn't helpful for this type of task so the first image hit after about six hours on an eight core system and this is actually pretty good and pretty surprising going from nothing but the string hello to something that is a syntactically valid JPEG is like pretty good I mean this is not crazy hardware it's something that anyone probably in this room has access to and anyone who's a real adversary trying to find bugs like this is no problem to get access to certain types of automatically executed checks with a large search space may pose an insurmountable obstacle to the fuzzer and this is ultimately like the real crux of the problem is that if you have something like this in your application where you're comparing some section of a sentence for a very specific value a tool like AFL is going to do a really poor job of figuring out what it takes to get past this check because it's using random mutation it doesn't know what it needs to do to get to the string hacked by pigs here so there's other types of fuzzers do a really good job of this and we'll talk about those in a second but this is a super common pattern you're going to see in cases like shotgun parsers, binary file formats and so things like checksums, magic numbers, offsets into files these are all things that you don't want to spend time really exercising a lot of that because you want to get past that typically to the point where something interesting will happen I mean there are cases where just the code here might be interesting but you really want to be able to get past that in practical terms this means that AFL fuzz won't have as much luck inventing ping files or non-trivial HTML documents from scratch and the reason is because these are very very syntactic and complex file formats there are aspects of it where things need to be in the right place in the right order and simply moving things around won't produce usable input and so this is one of the areas that I'm really hoping to improve by releasing this code today the next type of fuzzing we're talking about is instrumentation and solving and if you followed any of the CGC stuff a lot of the solutions that were built for that used this approach as part of their solution and like AFL these also will focus on path exploration and code coverage but they do so in a different way rather than trying random mutations they convert the problem into something that they can execute some amount of and when they get to a point where there's some kind of input or something that is non-deterministic they convert that over to a symbolic value and then you use something like a satisfiability solver to identify what it will take to move past that branching point generate me a value that will take each path and will allow me to explore those so this is how you would solve that problem of very specific strings where you need to know or check sums or anything like that where you need to know how to get past this step is it will provide something that will solve that problem for you and then generate you an input that matches that this still doesn't care about syntax or semantics it still has no real concept of what any of that is or what the language is even for that matter it's just trying to figure out how to get deeper into the program or to a point that you want in that program even with these they're still not typically easy to build test harnesses for a lot of this is still automated but you could probably hack in a lot of more interesting stuff in it if you wanted great examples of this are our Klee open source software there's like the mayhem paper which is really great and you should go read it if you haven't and then the last big type that we'll be talking about today is grammar based and this isn't a new concept it's something that has existed and there are a handful of projects out there that do it but there's a whole bunch of problems with the approaches that have been built in terms of what we were looking for so what these do is it takes a grammar file and it will generate you input that is syntactically correct based on that grammar file and some of the problems with this are that they typically provide their own grammar language so when you define a grammar you have to write it in their language and if you have ten, fifteen thousand lines of grammar for a specific language do you really want to go and translate all of that definition over to another language you're doing it by hand typically which means that you might make mistakes, you might define something that's not the same and just the semantics of that language may not translate very well to the new language for this fuzzer these are mostly targeted toward regular and context free languages and this is fine a lot of languages do fall into that category but for more complex languages things like really complex binary file formats context sensitive languages like these are just not applicable and they're not going to help you they're typically not byte oriented so you're kind of limited to a lot of text based languages and so you're limited in what you can do with these and the use cases you can have for them Mozilla's published Dharma which is their tool I believe it's primarily used for testing out recognition of things like CSS, HTML, JavaScript and ensuring that things don't break as they test our browser but these are like the main types you'll see and we're going to expand on the grammar based one shortly and last I've been talking a lot about traditional test case generation and just how you will go about actually testing and using what you generate as your input and so I've already said there's a lot of inflexibility with being able to design complex test cases and using those in a meaningful way to compare two different implementations or ask questions about how closely one matches to another or what's being done with that and you're kind of limited just to kind of does the application crash and identifying a few different things within it you also don't get a lot of flexibility in defining a feedback loop and this kind of falls into the prior discussion you can't really feed in more than like does this explore more or does this did this do something interesting and so being able to have more flexibility around that is something that's really nice and so there are many tools out there that do some parts of this but we can do better ideally we want something that's more flexible something that we can really program our own test harnesses something that we can do whatever we want we can design a test that does whatever we want for any case that we might have we want to be able to use the grammars we've already written why should we have to translate these especially by hand to something that isn't used by almost anyone and isn't going to provide any meaningful value to us other than our test generation it should be expressive enough for everything that most of the other fuzzers out there do right now so covering context free and regular but we want to support context sensitive we want to do text and binary we want the ability to kind of provide value for any kind of language that we can think about and might want to test being embeddable from any language that we might want to use we don't want to have to go out into C specifically and write this if I'm building something in Python or Ruby I want to be able to just call this directly if I have functionality around the parser generator that I'm using in that language I want to be able to do it directly in there I don't want to have to go and build a whole separate ecosystem to bring this over to somewhere else and we should focus on as much code reuse as possible you know if I have a whole bunch of components built I don't want to reinvent the wheel every single time I want to design a new test or I want to do something new with this and so the project I'm sharing today isn't the first it's not the most feature complete or even the most novel solution to this but it is more flexible and it is more extensible and it is something that can get to the point where it is most of those things so I want to share with you sin fuzz I'm releasing it today and basically this is a platform for building test generation and test harnesses and fuzzers it's organized like this at the center you have an intermediate representation which is a bunch of data structures and some helper functions that allow you to construct complex grammars outside of that you will have your various front ends for your different parser generators and combinator libraries so things like antler and regal and ebnf bison and essentially what these do is you take a grammar file with these front ends it parses them and it translates them into the intermediate representation the test harnesses are reusable components that you design and these can be as simple or as complex as you want you have the full availability of any programming language you want to use and you can design the test to do these and so you consume the intermediate representation you give it a starting rule that is the root of your tree and it will start generating new parse trees and then from the parse trees will generate you sentences based on that ideally the only piece that you as an end user will have to build is the test harness and the reason being that it likely is going to be something that's non-trivial there probably could be very reusable components and that would make the job easier for you but that is ideally the only piece that needs to be built for you antler is currently the only front end that I've written but I'm planning to expand pretty heavily out from there this is a list of a handful of the features that you get so like most combinator libraries you're going to get a lot of logical operations as well as quantification specific values for generating specific types of values right now everything is byte oriented so it has unicode support but it will output to a byte stream using the framework is something like this the core is written in Rust but the plan is to release C bindings that can be brought into basically any language but essentially the way this works is you bring in a grammar file and you read it you pass it into a generate rules function and it will take that grammar and parse that into all of the rules that are expressed by that grammar you then provide it a root node and tell to generate and then you spit out the generated string and do whatever you want with it any communication is available with this pipes files it builds cross platform and can likely target most embedded platforms which is cool so if you wanted to do anything with this pretty much anywhere you can and I have a quick demo and if you'd like a live demo you can catch me afterward and I can show you more but essentially what this does is I basically have a while loop that is really hard to read I'm sorry is any of that legible? cool I'm not allowed to plug my laptop in but I will talk you through it and if you'd like to see it I can show you after but basically the way that it works is I have a while loop here and it's calling the program and what it's doing is it's generating a string and then passing that into antlers parsing tool that you specify a grammar and a root node and it will try to recognize that language and then visualize the parse tree that it's building and in this case I'm generating dot so like graph is files and so it's basically just running through trying to generate new ones every single time that it gives me a nice little parse tree here it'll happen again and again and again it might be kind of slow so this is like you see it's a totally different tree it's bigger it has different values in it structure is different and so it's the generation is actually very fast the slow part right here is parsing it for for antler itself but it is multi-threaded it's thread safe and so you could spread this out as horizontal as you want to and put as much power on this as you want to to build test cases and that's probably pointless to keep showing because no one can read any of it so designing test harnesses this is the part that you probably will care about and so let's see a few different ways we might want to design test harnesses for something like this the first most common case is does it crash this is what most fuzzers will do and why you're probably using a fuzzer in the first place is I want to know when this crashes and identify whether this is something that's exploitable so basically the way this works is you will start the process you will generate a string of some sort and you will feed that into the application you have to listen for the SIGSEG V or SIGABORT signals and basically just watch for it to crash when it does you get core files and you investigate the crash get your right what where see if it's something that's meaningful to you so this is the most basic case and something that you will get with any other fuzzer it's pretty straightforward, pretty simple there's tons of examples out there for it you basically just register a SIGACTION handler on the UNIX side and wait for it to do one of those things now we get into the more interesting cases and these are the cases that really gave birth to this project so identifying whether a re-implementation of a grammar is overfit for a reference implementation to do this you would generate a test case using your own re-implementation grammar and you find an oracle within the reference implementation that specifies whether you're getting something that is either accepted and is valid whether there's a syntactic error or there's some kind of runtime error that's happening not because of a failure in parsing and understanding the sentence but a failure because of something happening in the runtime itself you feed the test case in and then you categorize what comes out so if it succeeds everything is good you have a sentence that's valid your parser recognizes it and the reference recognizes it if there is a failure if the failure is a syntax error in the reference implementation your parser is overfit you are recognizing things that the reference does not and that's an indication that you probably need to redefine your grammar and change it to be more like that it should kind of give you an idea of where those problems are so you might kind of guide you toward what specifically is causing that and you can manually go and change it to identify whether or not it's meaningful or modify the sentence that it spits out so that you can identify maybe why you're wrong but there's not really good tooling around doing that yet other than trying to get good error messages from the reference and parser error messages are usually pretty terrible they're kind of going to be on your own on that one if the failure is a runtime error it may or may not be a sign of being overfit or underfit this is especially common because of the opaque tokens that I spoke about basically it might recognize something that like is syntactically valid but like one of the tokens in there is just like totally batshit insane and like it recognized it because it uses an opaque token in that place it also could just be an environmental thing so like in in the case of like MySQL if you've ever written queries for it you're probably very used to these types of errors the top one is a syntactic error you see the top query here the select star from A where ID and a bunch of carats 3 and this tells you specifically this is a syntax error this did not parse and it tells you like exactly where it is so this is an example of the case where your parser if it generated this is just wrong that should never happen below that however is a runtime error where select star from A where fake column equals 3 this is a runtime error because one of the tables referenced here doesn't exist and so this is something that the parser passed and it generated something that's likely a valid sentence in that language but because of the environmental issues with what it is doing with that sentence it wasn't able to do that and so this is probably a false positive in that case and so having the oracle that can tell you about these things allows you to really cut down on false positives and in this case the next is underfit this is the other big problem that you were likely to worry about so to do this you generate a test case from the reference and this is something that you can only do if you have the reference implementation grammar so in the case of like mySQL or postgres sqlite you have that grammar definition and this is something that's open source for something like db2 oracle msql this isn't a viable option and you can't do that because you don't have the source code for that grammar but you would parse it with the re-implementation once it generated a test case from the reference and see if your parser can read it and then you categorize it if it fails then the re-implementation is underfit you have generated a sentence that is valid in the reference implementation that is not valid in yours if it succeeds then the re-implementation is likely correct for that sentence and that section of the grammar that you have used so what's ready today the code is up here it will be made un-private after this crates are available for people who want to use Rust and bindings will be released over the remainder of the year for as many languages as people are interested in and I have time to build or other people have time to build MIT licensed you can submit pull requests and I will happily accept them right now you get an antler forefront it mostly is functional there's a few little things that aren't supported yet but you can get the vast majority of many of the common types of parsers you will generate with antler what's next there's a lot the biggest area right now that's a huge issue is dealing with cycle detection and forest progression so with recursive grammars which are pretty common it does a really bad job with ensuring that it doesn't get stuck in a loop because it's essentially building a graph and like walking that so getting cycle detection in there to move forward and not get stuck is at the top of the list basically it will just run and then get a stack overflow and basically crash but for a lot of more simpler grammars like you're not going to have a problem here and this is immediately useful now like I said exposing the CAPI and language bindings it's built in Rust but this will be available I think early plans for languages right now are going to be to Java and probably go probably Python and Ruby as well in the near future the negation logic exists but it's not great the problem with negation is it's not always clear exactly what the opposite of something like one of the combinators is and so pretty much everything has negation logic right now but it could be better you don't get quite as much entropy as I'd like and so if you're using a lot of negation you might just not get as much variation context sensitive support and being able to do introspective things on the data that's being generated so if you need to generate checksums that are valid things like that having support to do that as I said it's all byte oriented right now I might be interested in building out bit level support there's a lot of wire protocols that have some of that and it would be useful for that additional front ends this is the big one also bringing support for more platforms is a real goal to make this significantly more useful and then grammar coverage information how much of this grammar have I explored and how much time have I spent in a specific area and then being able to reuse a tree and change individual tokens so I can just change the areas that are going to be dynamic in that and really really hammer sections of code if I care about that again this is the link if you're interested in the code and questions sure so how do you implement an article basically it comes down to the integration that you build in your test harness so in the case of MySQL we're literally using the MySQL library that is like the client and the error messages coming out of that will provide us the oracle and we just do comparison for what is being provided there that's something you're going to have to find if you have source code it's possible if there's something you can do to inspect the binary or just get meaningful output of it it usually comes down to just finding something in there and if you can't then you kind of just fucked yeah I don't yet the plan is to start doing some of that research and putting that together this is something I've been working on for maybe the last eight months or so and it's at the point now where it's usable in a meaningful way for us part of the problem for us was building test harnesses that gave us meaningful feedback but I do plan to start doing a valuation of that and I want to have something published probably next year to actually find out whether or not this actually is useful and provides more value than something like that anyone else cool thank you