 Hello. Good afternoon all. How's everybody doing? Joy in DefCon. I know I saw a lot of people at Black Hat. Probably most of the people that saw it there aren't gonna see it again. I don't blame them. It was worth seeing once, I think. So, okay. Well, we got more people coming in. One of the things I like to do at the beginning of the talk is just sort of have everybody take their right hand, stretch it out like this, and shake the person's hand next to you because you're gonna see them next year if you come. I guarantee it. Every year you see the same. Well, more, right? Some of the same people, but more. So, let's just get into this. The goal of the talk is to try and build a better fuzzer. That's the goal of my research. That's kind of what I've been doing for my academic work for the last year. And I'll briefly go through some background material. Kind of just real quickly what's fuzzing software testing, evolutionary testing. Then I'll talk about the actual system I built EFS and then talk about some initial benchmarking and then talk about some initial results on real-world applications. So, there's no definitive proof that EFS is actually better than, say, like Spike or Sully or any of the other fuzzers that are out there. Mainly because it depends a lot on the application and it also depends a lot on sort of like what you were trying to do. And I mentioned this at the very end of the talk, but like, for now, file fuzzing as one of the ISEC guys said at Black Hat, it's still depressingly easy just to take a file, flip a bunch of bits, feed it to the application, and the application crashes, right? So, there's no need to try and set up an evolutionary fuzzing system when you can just flip some bits. So, it kind of depends on your target, you know, what you're going after. Whereas, like, network applications are generally nowadays quite a bit harder to fuzz. They've been kind of hardened throughout the years, right? You can't just flip a few bits in an HTTP session and crash a patch. That's not going to work. So, it kind of depends, you know, on a lot of what's going on. To my knowledge, it's the first evolutionary fuzzing system that's been released publicly. You can go to my website, www.vda-labs.com, and download everything I'm talking about, the slides, and the actual code, and GPF is there as well, general-purpose fuzzer, something else I created. So, section one. Software testing, this slide is just to say, and I mentioned this last year at DEF CON, that software testing is just hard. That's basically all this slide says, and that it's not an easy problem. It's not been solved. It's kind of an open problem. And there's a lot of companies looking at a lot of different ways to do it. But you got to test, right? It's the only way to give you any sort of confidence that your software is good. And fuzzing, as I said last year as well, is not the only way to test. There's a lot of other things, just functional testing, and GUI testing, and quality assurance, and formal methods of development, all that kind of good stuff. But fuzzing is just one element. So, the basic premise of fuzzing is you get a test case delivered to the application, and then if the application has a problem, then you obviously record that, and if it doesn't, then you keep going on to either the next randomly generated test case, or the next test case that comes from a list of cases, or however you're getting them. You sort of move on. And of course, the most interesting places to fuzze are where there's a change in privilege, right? Like, code that comes across a network interface is unprivileged data. I mean, that comes across a network interface. And if the service is running at some privilege, and you can get a buffer or fall right, you can elevate your privileges from nothing to something. So, that's kind of part of the key, right? And any type of fuzzing is you're looking for some sort of elevation of privileges generally, or maybe you're just looking for robustness issues, whatever you're looking for. So, this is just to say that what you can fuzz is generally called the attack surface, and I think people understand this pretty well. I asked people at Black Hat as well, like, how many of the speakers have you heard mention that were either fuzzing or attack surface? And quite by and large, probably half the crowd raised their hand, which means that I think people finally sort of understand what these terms mean. The attack surface could be, you know, a local parsing of a file, the network interface, inter-process communication, any bit that could be influenced by dirty data, user data. So, evolutionary testing uses GA's genetic algorithms, and I'll talk about kind of how they work and where they failed in the past, just real quick, and kind of where they're going. So, basically the idea, if you haven't heard about this from last year or whatever, it's just sort of like survival of fittest mentality. Take, you know, mom and dad are so-so fit, and they mate and have kids, right? And then the children, ideally, are a little more fit, so it's kind of, each generation gets better. That's just, from a high level, that's the basic idea of genetic algorithms. They're a searching algorithm. They're searching towards something. So, if you have the source code and you do evolutionary testing, you're searching toward slash-slash-target. See that up there? So, here's a function. It's called example. It has four parameters, a, b, c, and d. And those four parameters, a is 10, b is 20, c is 30, d is 40. So, the first check is, and this is like either the randomly generated, or it's evolved this far so far or whatever data, if a is greater than or equal to b, then, oh, but it's not, right? So, a is less than b. So, you stop right there. So, there were two more branches to get to the target. So, the fitness is two branches, two plus normalization of 10, because 20 minus 10 is 10. So, that's sort of how evolutionary testing works in general against software. And this slide is just to say that in the past or even now, people are actually using evolutionary testing. In particular, it gets used in safety critical situations, like somebody might tell you, oh, denial of service isn't too big of a deal, right? I want code execution. Well, I want code execution too, but if I'm on the iron lung and somebody douses my iron lung, that's bad, right? And so, denial of service can be really bad in safety critical, automobiles, airplanes, those kind of things. So, they need to be really robust code. And so, this is one of the places where that kind of testing gets done. Now, this is just to show you, okay, so evolutionary testing, how does it do in some, in all cases? Well, I don't know about all cases, but here's a few cases where it doesn't do so great generally. Can anybody look at the code example on the left? Look where it says slash slash target. That's where you're trying to get. It's called flag underscore example. It takes two parameters, A and B. Does anybody see a problem with that? It'll never get there, as we said. Yeah, well, it could get there, but what happens is there's a notion in evolutionary testing called the fitness landscape, and that's sort of like, you know, you're down here and you want to get up here. How easily can you climb the hill? You know, is it a, it is a contiguous function, continuous function to get there? And in this case, the check is if flag, then target, but the problem is you don't control flag, right? You control A and B. So it's, as a human looking at it, it's not too hard to see, well, if you set A and B both to zero, then you'll get to the target. But, you know, in the past evolutionary testing didn't really take that into account. They didn't have any way of saying, I know what flag, they didn't know what flag is. So there's a way to deal with that, but that's called a flat spot in the fitness landscape. There's also something else called deceptive problems, which is if you look at the code snippet on the right, you can see it's function under test, takes a parameter, and you want to get to where target is again. It says if the inverse of that equals zero, then we'll get there. Does anybody see the problem here? So, it looks like as you give it, say, a hundred, it'll go into the inverse function and do, like, one over a hundred, and then you give it a thousand, and it's like one over a thousand, so it's taking the inverse, so it's getting smaller. So it thinks it's getting closer to zero, right? Your evolutionary testing algorithm believes that it's getting closer and closer and closer to zero, well, actually all along, as it gives it larger numbers, it's getting farther away. So, if you just give it zero, though, it would actually hit the target, but it doesn't really know it. So that's what's called a deceptive landscape. So, all that's just sort of background. This is section one, that's, you know, there's people working on that in academia. McMahon and Holcomb are coming up with what's called the extended chaining approach to do better with evolutionary testing, but the system that I design is better, or I don't want to say better, different because it works on binaries, right? Oftentimes, if you're auditing, say, IIS, FTP, or something, you're not going to have Microsoft's source code, right? So there's no way to really do the sort of traditional evolutionary testing. So you have to use, you know, a debugger in the binary and all that kind of good stuff. And I'll talk about how that works right now. Section two, so what is EFS? Well, it uses evolutionary algorithms, and it also uses fuzzing. I leveraged a tool that I'd written in the past called GPF, General Purpose Fuzzer, and it also leverages Peastocker, a heavily modified version of Peastocker in the PyME framework, which you may have heard of. And this is an overview, a system view of how EFS works. So you see GPF here is written in C down in the left-hand corner, and it's doing the evolutionary algorithms, all the fuzzing heuristics and stuff come from there. And the target monitoring is right there on the right-hand side. You've got the debugger monitoring the target, and they both talk to a SQL database so that each session from each generation can be stored. Okay, well, what is the session and what's the generation? We'll take a look at that, see what the data structure looks like. And I should say right up front that this tool was originally designed with sort of network protocols in mind, but I am porting it to file fuzzing as well. But it is sort of network-centric. You'll see it feels that way. And it has some reporting you can do from PHP. So the basic protocol is just sort of like, you know, the fuzzer sort of asks the process stalker, are you ready? And he says, yeah, I'm ready. And then he talks to the target, and then the process stalker gives the coverage information how many functions or basic blocks were hit, and those get stored in the SQL database. And, okay, so if you don't, it should be fairly clear what functions are. And I think I'd talk about this a little later, too. And basic blocks, just in case you don't know, so you take the binary, you put it into IDA Pro, and you run pitadump, and you actually create what's called a pitafile. So basically you get a list of all the function starts and all the basic blocks start so that you can stalk and either functions or basic blocks. And the basic block is just like in assembly, you have like push, push, move, move, branch. That's a basic block. Whenever there's a branch, that's like starting another basic block. So basically you have the choice when you stalk for coverage. You can either stalk on functions or basic blocks. And we'll talk about which one's better. There's a lot more basic blocks than there are functions. So it makes the stalking more granular, but it also makes it a bit slower. And the fitness function. So genetic algorithms, basically the key to a good genetic algorithm is having a good fitness function and a way to evaluate the fitness function. And in this case, the fitness function is two pieces. It's code coverage, how many functions are basic blocks were hit during the session that just ran, and also a measure of diversity, because what happens over time in a genetic algorithm is there's what's called convergence. And that means that you started out with all these good random pieces and you were kind of diverse up here, and your fitness started out real low if it's all like kind of random data mixed with basic blocks. And as the sessions get more and more fit, they all essentially become a like, right? They all start to look the same. The best guy is sort of winning the strongest guy. And that means also that the diversity is sort of going down, right? Because all those other guys that weren't as strong kind of die out over time. And you don't want that to happen because what happens is you want to cover the entire tax surface, not just the single session that gives you the biggest amount of functions that get hit. Because that's important to hit that guy too, the best session that goes all the way through. But there's all these other little sessions that might, you know, trigger corner case bugs and stuff like that. So diversity is key because diversity will give you more total coverage of the tax surface. And ideally, you'll cover the entire tax surface. And it's a bit hard to actually measure, right? Because you may not know, like, if you, again, if you go back to the IIS FTP example, say you, you know, you put it in IDA Pro and you get a list of all the functions and say there's like 1,000 functions, well, how many of those 1,000 functions can you touch remotely through the network interface? You don't really know. I mean, there's not a real clear and easy way to tell. But if you could tell, if you could tell that it was 200 of those 1,000, then ideally, you'd touch all 200 functions throughout the course of fuzzing. That's sort of the goal. And that's sort of, I use the academic keywords corollary, one and two. The first corollary is that code coverage doesn't equal security. But if you don't get code coverage, then you're even less secure, right? I mean, you've covered, if you haven't covered it, you haven't covered it. So you can't buy bugs. The second one is true, which basically just says if you cover the entire tax surface and you do it with diverse sessions that have fuzzing heuristics, like long strings and dot dot slash and percent N characters and all the kind of good fuzzing heuristics that should be in fuzzy data, then that's sort of the best I know how to do. That's kind of the goal. Okay, so how does the actual evolution work? We'll talk about that first before we talk about some results. So basically it works by kind of recombining data and kind of slamming around, and any bit can be reorganized. And well, I'll show you how that actually works with a session crossover, session mutation, pool crossover and pool mutation because data gets organized in pools. Sessions of pool. I'll show you that in a second. So, but one thing to mention that is this elitism, which means that the session that performs the best in each pool gets carried over to the next generation automatically. He doesn't have a chance of sort of breeding and dying out. He just sort of gets to go to the next generation. That's called elitism of one. So how is the data organized then? Well, at the lowest level you have a token and a token could be of different types, like binary or ASCII or space or delimiter or carriage return or line feed or whatever. So each token, maybe you have a string that has the word user or something like that. And each leg is like a read or a write in the case of a network protocol. And so a session would be a series of reads and writes with tokens, obviously. And then those are all organized in the pool and the idea of the pool is to help maintain that diversity because if you have different pools of sessions that and they don't necessarily breed though they can, they certainly can. Pools do breed, but you can set how often that happens. You can help maintain diversity. We'll see that as well. So how does the actual session breeding take place? You have two fine-looking, beautiful sessions, A and B, and they wanna bada-bing, bada-boom. How does that work exactly? Well, you draw, it's called single-point crossover. You basically draw a line, sort of randomly pick a spot between two tokens and you copy up to that line and A and then after the line and B, you continue copying. And the same thing for B prime. You go up to the red line and B and then after the red line and A and you have two new sessions. And it's sort of unintuitive that this would work, right? Because you don't really know how that data, that data's organized. Like if in order to log in, you need like user space, password, carriage return line feed or something like that, it seems like this wouldn't work, right? Because you're kind of recombining bits and pieces of just either what are called basic blocks, which we'll see or random data. And it doesn't seem like this would actually work, but amazingly it really does. So I know, I was kind of amazed the first time I saw it work. I'll show you that. So here's an example of a session mutation. You see the word Jared just got a bunch of percent ends inserted into it. That's all a mutation. So it also could be like inserting, again, any sort of fuzzing heuristic. It doesn't necessarily always like just insert random data or flip bits, although it could. That's one of the fuzzing heuristics is to just like, if it's of tight binary, just like flip the bits in that token or something. But it might insert long strings or whatever. There's quite a few different things. And you can look through the source code that I have to see what are all the fuzzing heuristics and how do they actually work, you know? So pool crossover is very much like session crossover. And like I said, you'd want this to happen less frequently to maintain diversity. Maybe every nine or 10, 13 generations, something like that. So pools kind of have a chance to evolve a little bit separately to help maintain diversity. It's very similar. Draw a line, copy from one to the other. Pool mutation is basically just randomly adding or deleting a session. You wouldn't want to do that very often either, because if you have a lot of good sessions in there, you don't just want to lose that information. But every once in a while, it's good to insert fresh blood because you might be missing a basic building block. So I'm very much like, just show me a little bit of an example because it's kind of abstract, right? So this is just kind of a simple example. You see generation one starts out with a lot of randomness in it. And you see some basic blocks in there, right? User and pass. And I'll talk about that in a second, where those come from. They come from what's called the seed file, when you originally started. So if you know what the protocol is, you sort of know what the necessary strings are, like if it's FTP. You know somewhere along the lines, you need to see user and pass show up. Those are called the building blocks, right? And what happens is if you don't give them building blocks, it could, ideally, theoretically, eventually find the string user. It's just going to take a long, long, long time, right? Because it would have to just randomly combine bits to come up with that string. And you see by generation 15, in the first generation it didn't really do much, right? Because it just had the best one maybe with session one. He had user space and then some junk. So maybe he got like halfway through logging in. But not real fit. But by generation 15, it's got like user space, which is a delimiter. Some junk line feed, which is good. And then pass, and then delimiter, and then some more junk. So that would actually be more fit. When you're stalking that, you'll see that he actually progresses along. And that's kind of how they work. Just over time you'll see him getting better and better. And if you look, if you dump the data out of the SQL database and actually look at the data that's in the sessions, sometimes you're amazed, like after 100 or 200 generations of evolution, you're like, you're looking at the most fit one. You're like, how is he the most fit? That is ugly. And then you look at it, and you're like, no way. There's a carriage return. There's a line feed. There's a bunch of random, you know, bunch of FFs, a bunch of random binary. But text-based protocols are pretty forgiving. So as long as the strings and stuff that it needs to see are in there, you can still log in and like actually work. As long as it's, you know, it's got the basic things it's looking for. Binary protocols aren't quite as forgiving. We'll talk about that, too. So what are the actual parameters that you give to GPF to start this off and make this happen? Well, obviously you got to give it like the IP address of the stalker and the IP address of the database and all that kind of good stuff, username and password and the number of sessions you want in each pool and how often you want the mutation to take place in each pool. The session mutation happens, session crossover happens every generation and all the other ones are configurable. Pool mutation rate, session pool crossover rate and all that good stuff. And there's some basic parameters to get you started. I'm not going to go through every parameter and what the optimal setting for each parameter is because it varies a lot based on the application you're testing. So, and there's also the seed file. You see that and I talked about that. And here's an example again of the seed file that I'm talking about. So if you're going to fuzz SMTP with an evolutionary fuzzer, you might as well give it a seed file that looks something like that. It's got like hello and mail from and receipt to and stuff like that because that'll just help the evolution happen much quicker. It'll sort of give it a boost in the direction. But the key with the seed files is to give it really small basic blocks. Because if you give it like really big, like if you're fuzzing HTTP and you give a basic block of get, space, slash, space, HTTP slash 1.1 or something, that's like a really big basic block and it'll use that and probably do okay with that. But it'll have a hard time like finding weird stuff that probably won't find as many bugs and stuff. It'd be better if you give it get and a space and a slash and a space and an HTTP and a 1.1 like all a separate seed. Smaller building blocks are better in terms of evolutionary fuzzing but just evolution in general genetic algorithms. The smaller the building blocks, the better generally. So on the other side, not on the fuzzer like evolutionary side, on the other side it's sort of the GUI side that's written. It's sort of like the modified version of PyMEI. What do you need to do? Well, again, you have to put, okay, there's two routes you can go here. When you need to track code coverage, you could like sort of sample EIP or you could use what's called an MSR technique. Or you can create a PIDA file. So basically you take the binary, you stick it in IDA Pro and you run what's called PIDA DOM and it gives you that list of functions and basic blocks that I talked about. Give that to PyMEI and give it the normal start-up and shut-down parameters that you have to give any application. That's kind of one of the key things when you're fuzzing is sort of like, you got to figure out if it's a Windows service, it starts and stops different than if it's a user application. All those kind of things are very important to figure out ahead of time. How's this going to work if it crashes? Because you need to tell it, when it goes to restart it, it needs to know how to do that well or it's not going to take place. So you got to have the SQL database set up, all that kind of good stuff. But here's a quick picture of how it works and those are all the parameters and again I'm not going to go through all those parameters but there's the settings for functions and basic blocks and what to do to attach or detach after a crash or all these different settings and one of the main things you see there, you see the little red scissors, you probably can't see it very good in the picture but there's a notion of filtering and you want to filter because for every session, if you're fuzzing FTP, normal things that happen every time like when you start it up and when you connect, like if you just connect it with Netcat and don't send anything and then shut it down, those are all common breakpoints and you really don't want to stock on common breakpoints because they just slow down the evolution, slow down the running of the fuzzer. So it's best to filter those out so you just manually start it up, connect, shut it down, save those hits to the PyME database and that's really, really kind of crucial in terms of speed. Okay, so that was section two. Section one was the background, section two was how, what is it and section three is kind of like, how do we evaluate it and the way to do that in my mind is sort of create a benchmarking toolkit and this bit isn't, this research isn't complete but we'll talk about the tax service coverage and two servers, a tax server and a binary server to see how well it does against both and then we'll figure out is it better to stock on functions or basic blocks and did pools help maintain diversity and if they didn't help as much is there something else, some other measure, something else we can do to make diversity better? In fact, there is. I just recently added a bit to it that another diversity, and we'll see that. So there should be another paper coming on that as well sometimes. Okay, so like I said, this toolkit could help you see how well it do against clear tax server, how well it do against binary server and also one thing that would be a really good thing to do if you want to benchmark and say, hey, is this as good as spike? Is this as good as sully? Is this as good as different thing or is it better, is it worse, whatever? Well, one other idea too is you could insert bugs that are easy to find and bugs that are hard to find and you can fuzz and just sort of evaluate them, right? Okay, so this is how the tax server works. It's basically just a standard text, you know, a very texty type feeling. The server says welcome and then you send it command x, command space, a number, character turn line feed and it says okay, that one's good and then you send it another number, character turn line feed and then it says the sub command's okay and then you send it, calculate character turn line feed and then it gives you basically the answer. So it's just adding two numbers, very simple thing, but it has, the cool thing about it is that I wrote it so it has three paths. You can set it on low, medium or high and that way it gives you the ability to say, okay, if the attack surface is really small, how does this guy work? Or if the attack surface is larger, if there's more paths, if there's more opportunities for diversity, how well does it work? And that's actually really helpful in terms of kind of evaluating something like this. Here's just an aside that I kind of found interesting just kind of for my own research. So this was the text server set on medium, just starting it and shutting it down was 137 basic blocks of 597 basic blocks. So 23% of the code was just literally like double clicking it and clicking the X, right? So that's kind of interesting. You wouldn't think that there's a lot of code wrapped up in that, but there is quite a bit of code just in starting a service and shutting it down. So the actual network code, which is just like the TCP connect, right? That was 15 basic blocks, 3% of the code and then the parsing, which is the only bit of code that actually has any opportunity to have bugs. So this is like the attack surface, right? This is the part that you want to fuzz. There was 94 basic blocks out of 597. So 16% of the code. That's just kind of interesting. The other really interesting thing to me is that a large portion of the code is just unaccounted for. You know, just like libraries that got linked in that never get used, things like that. You can actually look through IDA Pro and figure out what all that code is that just never gets used. It's really interesting to find out that a lot of it is essentially unaccounted for. The C file for this fake text server would basically just be the nine numbers that are possible. And again, on low, only the number one is acceptable and on medium, like one through four is acceptable and on high, one through nine is acceptable. And then a couple other strings, like character turn line feed calculator, come in. And you should be able to shuffle those all together and, you know, insert fuzzy heuristics and fine bugs and do all that kind of stuff. So it had no problem learning the language, figuring out it found the best session. And by best, I mean the correct session, right, where it completed all the legs of the protocol. That's what I call, that's like the base session or the best session. It does not necessarily all of them, and we'll find that. But it didn't actually cover the entire attack surface. Does anybody know why it wouldn't cover the entire attack surface? Well, I think we see that. No, we don't see that on the next slide. We'll see that in a few. But it's basically because there's a lot of corner cases. You know why. Go ahead. Go ahead. Yeah, right. Yeah. So it didn't cover all the parsing code, which is the attack surface was the question. And the reason why, if we go back and look at the protocol, is I'm not really showing you all the protocol here, am I? Because what happens if you send command and then a thousand a's instead of like a number, like it's expecting? What's it going to do? Well, probably it has some sort of error code, right? It's going to say, like, you know, invalid command try again or something like that. So I didn't really show you all the protocol. And that's kind of an important notion because this is, I think, the place where this kind of testing, sort of dynamic learning testing, where you're getting actual feedback from the subject under test as you test. This is where it excels in that if you create a fuzzer, like a spike file or whatever, and it's RFC compliant, it's going to be a good fuzzer, right? It's going to do well because you design the fuzzer based on the RFC, right? So it's going to get good code coverage because you're designing it to get good code coverage. But what happens if there's some bit of the protocol that's not, like, they didn't follow the RFC when they implemented it, right? And that's the whole idea. You're looking for implementation errors. There's either like a hidden command or a command they, you know, they allow one command earlier in the protocol when it really shouldn't be allowed until later. That kind of a thing, like, you know, you can do pass first instead of user when really you shouldn't be able to do pass. It should, you know, first should be the user, right? So those kind of things, this is going to be particularly good at doing that. And it's also particularly good at finding sort of like places where you need, you know, weird or fuzzy stuff throughout the various legs of the protocol, right? Because typically a spec file is going to, it's going to, like, fuzz the username first. And it's very sequential. And then it's going to be a valid username. And then it's going to fuzz the password next. And then it's going to fuzz whatever else is after that. So it's going to be a very organized, sequential kind of way of fuzzing, where this is kind of craziness, right? It's going to be slamming all over the place, going up with weird sessions and all kinds of stuff like that. So that was the tech server. The binary server is very binary looking, right? It has the first four bytes are the total length of the packet. The second four bytes are the session ID. The second or the next two bytes are the length of the command. And then the final bytes are the actual command that you want to send. And the server response is very much like that. So the question is, will a system like this be able to learn something like this? Because now you have to have a link that has to be correct and cover the whole packet. Else the packet gets rejected sort of as the first check, right? That's kind of how binary protocols work. And sadly, I don't have much testing done on this bit yet, but it shouldn't be too hard. Talked to Charlie Miller, and he did a little bit of testing on one binary protocol, and it worked. So more to come on that. So next question. So which should I stock on? Functions or basic blocks? Does it matter? Well, in my experience, it didn't seem to matter a whole lot as long as the application was sufficiently large. But here, this is an example where Tech Server was being run on low, and it did make a difference because there was only six functions. The best session is six functions long, and it only found four of the six, because the fitness landscape is very flat, right? It makes a breakthrough, but then it's flat, and then it makes a breakthrough. It's not a nice smooth. Where you see the basic blocks did a little better, right? It was a little on the right-hand side, and the graph was a little smoother as the diversity kind of went down, and the fitness kind of went up. And it actually found, in the case of the basic blocks, it found 40 of 37, meaning that 37 is kind of that base case where it did everything right. And it found 40, which means that it must have found one of the error cases as well, right? Because it found more than just the base case where the other one didn't. But if you run Tech Server on medium, you see that actually the functions did just as well and did it quicker, maybe slightly better in this case. So the graph isn't quite as pretty because with basic blocks, things tend to look a little smoother. So you see diversity coming down a little smoother and going up a little smoother. But either way, you found all six functions. So there to say, again, just to prove the point, all I'm trying to say with this slide is that if the application is sufficiently large, it's okay to stock on functions, which is kind of important because it speeds up the process enough that you'd rather, in my case, I think I'd rather stock on functions if I can. It's just that much quicker because the difference could be if you have a particularly large application, what does it really mean I'm stocking on functions or basic blocks? But what it really means is each of those are like a breakpoint, right? So if you've got a list of all the functions in the application, when you start the application, you have to set a breakpoint on every one of those spots. In order to get code coverage, you have to hit it, uncheck the breakpoint and keep going, uncheck the breakpoint. So if there's like a hundred functions, if you're hitting, say, 50 to 100 functions versus, you know, 600 to 800 basic blocks, it's going to run slower with basic blocks, right? Because there's a lot more breakpoints that's going to have to be unchecking, essentially. So in the next slides, I just want to show what were the effects of the pools. Did it actually help? And it turns out it did help having pools that help keep diversity up, but it still didn't cover the entire attack surface, right? And I wanted to add another bit to that to help it do even better because the pools help, but this next thing actually helped even more. This is called niching or speciation or some way of rewarding those guys that look different, right? So all the sessions as they're growing, they all want to be like the doctor or something like that, but this guy, he's like different. He wants to be the fireman or something. So we want to kind of reward them. Even if they're not as fit, they're not hitting as many functions, we want to give them some sort of reward. And I'll show you, that's the algorithm. I'll show you just real practically because it's a little mathy looking, not terribly, but a little bit there. So real clear example, session one got 10 hits, session two got seven, session three got five hits. Well, so, and I actually show you the hits, right? A, B, C, D, E, F, G, H, I, J. And what are those letters represent? The letters represent basically an address and memory. That's all they represent. So it might actually have been 08041234 or something like that. And so your initial thought might be that, well, session three doesn't really deserve to do, you know, he doesn't deserve to breed as frequently as session two because he didn't do as good. He only had five hits versus seven. But the thing you see there is that session seven looks like he's going to grow up to be just like his dad, right? He's kind of an immature version of the 10 hits. He's got all the exact same functions. He's headed along that same line where the other guy, he's hit five functions that the other two have never hit before. So you give them a reward. You take the unique hits divided by the total times the best minus one. So he ends up actually at the end because of the fitness boost that he receives. His total fitness is 9.5. It's actually better than the second one, which I think makes sense. He should actually be better in terms of overall fitness because of the fact that he's following a diverse path. So graphically, let's see how this works. So in the first case, we can see that we had only, this is the first one on the upper left. We only had one pool and a whole bunch of sessions. And we see diversity kind of going right down as fitness goes up. And that kind of confirmed my suspicions that as everybody's becoming more and more. Question, go ahead. Yes, I would love to. The first one does, but he's the best. So everything's baselined off the best. He doesn't get any boost for being unique. He's the best just based on the count, on the number of hits. So he hit 10 break points, whether it be functions or basic blocks. Essentially that's what it is. 10 equals 10. So you see he had 10 hits, but he didn't have any unique because he's the best, so we don't count any uniqueness for him. So his overall fitness is still just 10. 15 uniques? Well, overall, yes. So what you would call the total diversity, and that's what we see in the next slide when I talk about total diversity peak. See how it's measured like 80, 87, and 107. The diversity peak is the total count. So the count across the entire pool of diverse hits. And that's what you want to maximize, right? Because say if there's 150 total possibles, you want ideally the total diversity peak hits 150 at some point throughout the evolution. You've covered the entire tech service. That's the goal. That's what we're heading toward. So I didn't actually get that. But what I got is that when I added multiple pools, you see the trend of diversity is kind of up and down. Total diversity, what you're talking about, the count of all diverse hits throughout the pool is the green line on top. And it did better than just with one pool. So multiple pools did seven better overall. And when we added that final metric of boosting or giving a boost to diverse sessions, then you see the total peak was 107. So it helps substantially to give a boost to people who are less fit, hit-wise, but more fit diversity-wise. So that was the evaluation. Sections one, two, and three, and four. Four is the actual results. So just some results. After fuzzing, Golden FTP found lots of bugs. I don't know if you guys have heard of that before or use it, but I wouldn't recommend using it. IAS FTP and SMTP did not find bugs there, but did find some instability that could probably mostly be attributed to the debugger. Wasn't able to identify those to any one session. So the kind of moral of the story here is that really more testing needs to be done. I tested against lots of different applications and lots of different scenarios for any real evaluation to be made of how well it does compared to other fuzzers. So this was just some pictures that I had generated a long time ago when I first started using it. I just liked this because I thought it was really cool when the little bubble popped up in the lower right-hand corner saying that GFTP, a user had logged in. I didn't even know that happened when I just installed GFTP and just started fuzzing it without even really using it. So when the bubble popped up and I saw that a user had logged in, I was like, wow, that's really cool. It worked. So I was very happy and impressed about that. And here's just a picture that I'm showing that it found a bug and it's kind of dumping the registers to the screen and stuff, which it also dumps to a file and also dumps to the database. And you can query the database later. And that's how the overall, you can tell how many total crashes you got, right? Go ahead. Well, you mean does the debugger have to be on the same machine as the process you're debugging? Yeah, kind of. So the answer is yes. I mean, if you wanted to fuzz a Cisco router, then this wouldn't work, right? Because PyDebug isn't ported to Cisco routers. In fact, PyDebug only works right now on Win32. So it needs to be running on a Win application. There's people porting, I know Charlie's porting it to OSX, and there's other people porting it to Linux. But if you're on Windows and if your target is on another Windows box, you can actually debug remotely. You could have the debugger running on a different Windows box, as long as they could talk to each other. You can do PyDebug remotely. So it's kind of, yes, it's kind of an answer to both those. Again, yeah, feel free to ask questions any time throughout. So this is just kind of a showing kind of the coarseness of the landscape initially, how it's not particularly smooth, especially in terms, this is called Best Fitness. And you see where the big spike is where it learns how to log in. And then shortly after it learns how to log in, it starts finding bugs. And this is just the same kind of graph smoothed out. So basically, the moral of those two graphs is that after it learned how to log in, it started finding bugs pretty quickly, shortly thereafter. Which is good to know. This graph is just intended to show the same thing as the earlier graphs. The green line on the lower right-hand graph, you see how it kind of keeps creeping up. It has an overall upward trend, as opposed to the red line is one pool and the diversity kind of went down and up. So it's just kind of to show you again that same thing that you can help maintain diversity with pools. And this one just kind of shows that 10 pools did better in terms of finding unique crashes. So in the first case, with only one pool, I didn't even count. It's like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 unique crashes that it found, bugs, if you will. But with using 10 pools, it found like 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, I don't know, I'd count them up, but it found more unique crashes. And that's really key, that's really important, and it sort of speaks to the fuzzing, the diverse nature of the fuzzing. So let's see, that kind of went through that pretty fast. Good. Future work, importing it to fuzz files now, that's a big one because file fuzzing is kind of a, it's kind of like the bomb right now in terms of fuzzing, right? Because fuzzing network applications is pretty hard, and fuzzing applications that receive files is still pretty doable. Go ahead. Right, so the question was, how long does it need, does the tool run like minutes, hours, days, specifically how many generations is what you ask, how many generations is it? It's not like terribly long, it is long, maybe like a day or something, it's not like months or anything like that, but it takes a little while, it's a little bit slow because if each session takes, you know, say, I think my default setting is a session should take about 0.6 seconds, and if you have 100 sessions in a pool, then it takes like 100 times 0.6 just to do one generation basically, there's a little overhead at the end of the generation when it does the breeding and stuff, but not terribly long. So, go ahead. Right. Yeah, that's an integrate point. Good to see you again too by the way. Yeah, so the question is how do you know that it's unique bugs you're finding, right? Because that's what you want to find in this sort of thing. You don't want to, at the end of the day, you don't want to say I found 30,000 crashes and then have to go and figure out what all 30,000 of those crashes are. And well, I guess I sort of half solve the issue in that they're bind by exception address basically. So as long as they have a unique exception address, like the exception address, I mean the actual line of code, right? As long as EIP is different for each crash, then they get put into a different bin to see that. I'm sure you can't read the numbers, but in the first case see how the big portion of the graph is purple. The actual number above that's like 31,000. So I found 31,000 of that bug. That's just one bug basically. So it only gets counted one time, right? And that's true of all those, right? Because it's going to keep hitting the same bug over and over and over again. So I actually thought about adding a sort of a fitness minus on terms of sessions where you've already found a crash, it's sort of like penalize them if they keep hitting the same crash and don't let them kind of grow down that road. But it didn't really seem to be a problem because eventually they sort of one that doesn't crash will get further and he becomes more fit. So it kind of ends up doing that by itself over time. You just get a lot of repeated crashes. But as long as you can bin them like this it doesn't really matter. You only count it once. So go ahead. Okay, I didn't quite hear at all. I'm sorry. Can you say yes, the basic building blocks you give it in a seed file are critical? Well, it does both actually. So when you give it a seed file, it doesn't just pick building blocks from there, right? It has like, I forget it's either a 30 or a 60. I don't remember what the percentage is. But there's a percent basically whenever it goes to build a token, it's sort of like okay, I'm going to build a token. Should I grab it from the seed file? Or should I build it randomly? And if it grabs it from the seed file it's not just the piece of ASCII or binary data, but also you can show it in that so it wasn't a perfect explanation of the seed file. But if you actually look at, you can download all this code. If you look at one of the seed files that's in there you'll see that you give it like a type too. So you'll like say the length is four, the type is ASCII and whatever, and here's the data. So it might grab that token and just use it or it might just totally randomly create one, right? Like say I want to create a binary token that's 2,000 bytes long, okay, five minutes. So it kind of does both. I completely answer your question. Your question was what if you don't give it a seed file at all, and you just truly, truly go random. It still will evolve, like I mentioned Charlie, he actually tried this, right? He didn't give it a seed file against a binary protocol and it actually did get better over time, right? Because it was flipping bits and stuff like that and eventually found that this sequence of bits did better than this sequence, right? It doesn't really know anyway. It's just a way where, so particularly if it's a small binary protocol maybe you don't want to give it seeds, but for ASCII protocols I think you really want to give it seeds because it's going to take it a long time. Like if it needs to find, before it can really do much fuzzing, if it needs to find user, space, Jared, character, turn, line feed, pass, space, Frank, character, turn, line feed, it's going to take it a long time to find those strings, right? Okay, I think I have time for maybe one or two more questions, so let's see, did I, I didn't really finish the challenges, normal fuzzing challenges, um, yeah, file fuzzing, oh I mentioned that early on that depending on your target is going to make a big difference in how you go, so I think I mentioned most all these anyway, so any more questions? Go ahead. Micro economies? Yeah, I think I've heard of that. Yeah. Right, so the question I think overall is sort of like, have you tried other evolutionary techniques, other genetic algorithm techniques and stuff like that? The answer is I haven't, this is all I've tried, probably wouldn't hurt to try, right? It just takes a long time to implement all that kind of stuff, so I haven't done that yet. Any, go ahead, is there another question over here? I think that's it, I think I will be going to the question room after this if people want to talk more, please follow me there, so thank you very much.