 Hi, hello and welcome back to the program analysis course. This is the lecture on random testing and fuzzing and this is part 3 of this lecture where we will look into a form of random testing called gray box fuzzing and in particular we will look into a tool called AFL which implements many of these ideas into a practical tool that is actually widely used in industry and also academia. Most of what I'm saying is based on a description of this AFL tool that you can find here. So if you want to know some more details, please look it up there and of course you can also try out the tool so it's very easy to actually get hands on experience with gray box fuzzing. Let's start by defining what gray box fuzzing actually is. So it's a form of fuzzing where you want to guide the generation of new inputs toward a specific goal and of course if you have a specific goal in mind you need some way to measure whether you are actually making progress towards that goal and in the case of gray box fuzzing this is done through some lightweight program analysis which tells you something about what is happening inside the program and this is then used as guidance to tell you whether you're getting closer to your goal. So for example this goal could be to increase the coverage of the program and try to cover more and more lines or basic blocks in the program. It's a lightweight program analysis because it's not really analyzing all the details of what's happening in the execution, it's for example not looking at the different conditions that triggered the program to take a particular path but it's only something like coverage that can be measured with reasonable overhead, relatively low overhead and still give some information about whether you're getting closer to the goal that the fuzzer has. Now given this oval idea gray box fuzzing typically consists of three steps. The first one is to randomly generate some inputs where basically some random input is created and then given to the program. Now while the program is executing the fuzzer gets some feedback from the test execution so for example if the goal is to increase coverage then this feedback would be what code is actually covered do we maybe cover some new code and this tells us which of the generated inputs are actually useful in the sense of getting closer towards the goal. And then the third step is to mutate these inputs in a more or less random way but to focus on those inputs that have actually been good in terms of getting closer to our overall goal. For example the fuzzer would mutate only those inputs that have covered new code hoping that if you have covered some new code then maybe there's some more code to be covered on top of that if you mutate the inputs a little bit. This oval idea of gray box fuzzing could be implemented in many different ways and actually there are many different implementations here in the lecture we'll focus mostly on a tool called American Fuzzy Log. American Fuzzy Log is also the name of a rabbit breed which you can see here. It's a very cute rabbit I know but this is actually not what the lecture is going to be about because after all it's a lecture on program analysis but instead we're going to talk about American Fuzzy Log as a simple yet effective fuzzing tool. It's abbreviated AFL so if you just google for AFL and fuzzing you should find the tool and you can actually play around with it. It's a tool that targets C and C++ programs and the inputs in this case for example files that are read by this program or more broadly all the inputs that the program as a whole is taking. So it's typically used to fuzz entire programs. You could adapt it also to fuzz for example individual functions but the most common use case is to fuzz entire programs and these programs are typically written in C and C++. AFL is widely used in industry so it has been supported by Google for a couple of years now and Google and also other companies are using it to find a lot of security related bugs. In particular they're also using it to find a lot of bugs in open source software such as open SSL, PHP, Firefox and many many others and by now AFL has found hundreds of security critical bugs in these programs which have probably prevented a lot of attacks that otherwise could have happened. I'll now give you an overview of the AFL tool and then we look into individual components of the tool separately. So the core concept of this tool is that there is a queue of test inputs which holds a set of test inputs that have already been used and which will then be used to generate new inputs. To initialize this queue what is often done is that there's some set of seed inputs which you can think of as a small set of realistic inputs that a human provides knowing that these are inputs that are typically used to exercise the program on the test and these seed inputs will actually help the fuzzer to more quickly get into interesting parts of the program because if you start with completely random inputs then it'll take a while until the fuzzer actually finds out how to bypass simple sanity checks that may be at the very beginning of the program whereas if you start with realistic seed inputs you easily bypass these sanity checks and can then try to cover more code deep inside the program under test. Now given this queue of test inputs there's a main loop that is executed over and over again until the fuzzer runs out of time and that starts by choosing one of the inputs in the queue to use next and there are different ways how this could be done. So let's assume we are taking some test input t here and then what happens is that this input is mutated meaning that a different variant of this input is generated and actually this is done not only once but multiple times. So we are generating multiple different test inputs t prime and t double prime and so on from this one input and now each of these inputs is used to exercise the program and by doing this we can get some feedback from the execution of the program which we can then use to determine whether this input is actually interesting. Now what does interesting mean? Well we'll see what it means it's basically based on the coverage and there's some more tricks involved here and then if it is not interesting then we are discarding this input so if the answer here is no then it's just discarded and we will never look at this input again whereas if the answer is yes then this is some input which we will put into our queue so that the queue then has more inputs and this loop continues over and over so it's repeatedly mutating inputs that have been found to be interesting until it has well either theoretically covered all code but in practice that rarely happens or until there's a timeout and the user decides that this was enough fuzzing. Before looking into some of these components in more detail let me get back to one of the core concepts of program analysis that we have already looked at early in this lecture and that's the idea of a control flow graph so just to refresh your memories the control flow graph is a graph representation of the code where the different nodes correspond to statements in the program and the edges tell you in which order these statements may be executed one after another. Now in the control flow graphs that we have looked at so far every statement was represented as one node what I want to do now is show a variant of this idea of a control flow graph where we have so-called basic blocks which are groups of statements that are always executed together. So as a simple example let's consider a program where we have some statement a this could be any single statement I'm not really writing down the detailed statements here then we have some check for example we could check whether x is larger than 3 and then in the then branch we may have two statements b and c and let's say we also have an else branch where we have a statement d and then once this if statement is over the two more statements e and f that are executed afterwards no matter which of the two branches we've taken so the way we have drawn the control flow graphs so far would look as follows we would have one for this statement a then we would have another node for this check that x is larger than 3 and this node will have two outgoing edges one for statement b followed by statement c and one for statement d and then no matter which of these two branches we take we'll at the end reach the statement e which is then followed by statement f now this is what we've done so far if you want to use basic blocks the ideas that all nodes that are definitely executed together because there's no branching statement in between will be merged into a single node so let me just write down this this idea here so a basic block is essentially a sequence of operations or statements that is always executed together and the reason why this happens is because there's no branch in between all right so doing this for our simple example here we would get just a single basic block for statement a and the check that x is larger than 3 because actually this check is always executed after a simply because there's no other branch in between and then we would have two outgoing edges similar to the other representation but now b and c would be in the same node simply because they are always executed together again so there's nothing in between and when b is executed we know for sure that c will follow afterwards and on the other hand we have the single statement d which has a basic block on its own and then after the f statement we have these two statements e and f and again because they are definitely executed together because f always follows after e they will again be put into the same basic block so our control flow graph if you use this idea of basic blocks will look like this and in this case we have basically these four basic blocks here which still represent exactly the same flow of control just with fewer nodes so now that you know what basic blocks are let's get back to afl and let's have a look at what kind of feedback afl is actually using so the main feedback it's using is coverage so the goal of afl is to cover as much code as possible now when I say coverage I could mean different things because actually there are a lot of different coverage metrics that people are using so sometimes when people say coverage they actually mean line coverage sometimes they mean statement coverage which is not necessarily the same because some statements may spend multiple lines and on a single line you could also have multiple statements sometimes people mean branch coverage and sometimes people mean path coverage which looks at entire paths through the program and tries to cover as many as possible of those so one lesson learned here is whenever someone says coverage you should ask what kind of coverage and what I actually mean here when I say coverage is branch coverage and specifically what I mean is branch coverage where the branches are the edges between the nodes in a control flow graph so these are branches between basic blocks so why is this the coverage metric that afl is using well the reason is that just reaching a code location may not be enough to trigger a bug but sometimes the state that brings you to that code to particular code location also matters if it would just look at say statement or line coverage then it would basically try to reach every code location once but it wouldn't really reach it in different states and by looking at branch coverage afl tries to also incorporate the state that the program has when reaching a particular location into this coverage metric now a more extreme or more precise way of measuring coverage would be for example path coverage but this would basically mean that there are infinitely many paths in the program and also tracking all the different paths would be much more expensive than just tracking the edges in the control flow graph so in a sense branch coverage is a compromise between the two factors that always determine this kind of analysis namely the effort that is spent on measuring the coverage so as we'll see you can measure branch coverage in a relatively lightweight way and on the other hand the benefit that you get from it so the guidance that this information actually provides to the fuzzer and it turns out branch coverage is a nice sweet spot between a more precise measurement and yeah and a less precise measurement like for example just line coverage as an example let's just consider a few executions of some program and in this example I just use letters abcd to refer to basic blocks so what we'll see here is on the one hand the sequence of basic blocks that are that are executed and then on the other side of this little table we'll see the the edges or the branches that are that are covered and these branches correspond to edges in the control flow graph so let's say we have a sequence abcd and e then we would have multiple branches covered here namely ab which is the edge from node a to node b bc cd and de and as another example which is here to show you that covering the same nodes may lead to different branches that are executing let's consider the sequence where ab is the same as before but now we're first executing block d and then executing block c followed by block e so this may actually happen if you have some kind of loop and and maybe an if in a program and now if you would just look at say basic block coverage or statement coverage then these two executions that we see here would look exactly the same because they're covering the same basic blocks and hence also the same statements and same lines but if you look at the branches that are covered we actually see a difference because now we have ab and bd which is something we haven't covered before and also dc is a branch that we haven't covered before and the same for ce so what these these two simple examples show is that by looking at branch coverage we get significantly more information than if you would just look at simple block or simple line or simple statement coverage so now that you know conceptually what afl wants to measure let's have a look at how this coverage measurement is actually implemented so here are three lines that essentially show how afl measures coverage and these are actually three pretty nice lines of code because on uh um on yeah with just three lines they measure uh coverage in a surprisingly efficient way so these three lines are added to the program at every branching point so afl is instrumenting the program by adding these three lines at every point in the program where you have a branch um this this code depends on three global variables called current location shared memory and previous location and let's now we look at what they what they do so the current location um points to the well is it contains a unique identifier or at least probabilistically unique identifier for the current source code location so when afl is instrumenting the program it's at compile time generating a new random um constant that it uses as um the identifier for each location where it's adding some instrumentation code and the advantage of doing this um in this randomized way is that it works well for separate compilations or even if you are compiling two components of your program separately if you combine them um through the um when linking the program um this is likely to still work out because they all have just randomly generated identifiers for all the branching locations and therefore there will be um yeah a minimum number of um collisions the um second global variable here is this array called shared memory which is some globally reachable memory that stores how often every edge was actually covered so the indexing part here um of this array is um the identifier of a particular edge in the control flow graph so of a particular branch and we're getting this um identifier by taking um well by combining the current location and the previous location and by just um taking bitwise x or um of these two which is just combining them in a in a quick way and then this identifier for this edge um is used as an index in the shared memory array and then we're just incrementing um the number of times that it has been seen so initially um this is um this is zero for all the edges and then we of course also as the program is executing need to somehow update this previous location and this is done here where the idea is that um the current location becomes the next previous location and um as an additional little trick it's just um bit shifted by one here and this is simply to distinguish between a basic block a followed by another basic block b from basic block b followed by basic block a so because this bitwise x or here is symmetric um we wouldn't be able to distinguish ab from ba but by just bit shifting um the previous location by one um actually ab looks different from ba and this means that as a result um afl can distinguish between these two edges that otherwise would be indistinguishable now on this overview figure of afl i was saying that afl is keeping the inputs that are interesting and now let's define what interesting really means interesting means that an input is detecting or is triggering some new behavior and the way afl is uh detecting whether some new behaviors have been triggered is by looking at the set of edges that are um covered by the input and whenever an input triggers a new edge then afl says okay this is new behavior because it's something that i haven't seen before um there are many different alternative ways how to do this for example afl could also look at entire paths from the beginning of the program to the end of the program and could say whatever a new path is triggered um then this would be some new behavior the two problems with that one is that it's significantly more expensive to track because instead of just um tracking all the different edges in the control flow graph you would now have to keep track of all the different paths um which is um which is more expensive and of course um there also is the path explosion problem which basically means that the number of possible paths that you have in a program execution tends to explode when you have multiple um um branches in the program which you typically have and in the presence of loops there are even infinitely many different um paths and to avoid all these problems instead afl considers um an input to trigger new behavior whenever it triggers a new edge in the control flow graph so let's illustrate this idea again with an example where let's say our first execution looks like this um we start with basic block a then execute basic block b then execute basic block c followed by d and e because this is the first um this is obviously new so this is considered um something we should keep um and yeah because it's the first execution this is always the case by definition now let's say that afl is mutating the input which will lead hopefully to a different um execution path and let's say that now we are executing a and b and c again but then after c instead of going into d we now get back to basic block a and then into basic block e and now if you look at the edges you'll see that there are some new edges in here for example c a is an edge um that was not covered by the first execution and therefore again um this is considered to be something new so again this input is capped and then um one of the inputs that we have in our queue will be modified again and let's say this leads to a third execution which now looks um a little different from the others because it's much longer so here we have abc followed by another abc maybe because there's a loop around these statements followed by yet another abc oops followed by um d and e so now this is a very different path from the others because we are executing um some loop apparently that makes us go through statements or basic blocks abc multiple times but if you look at the edges that are covered by this then you'll see that nothing in this third execution is actually a new with respect to um the first two executions and afl will therefore decide that this is not new because now um we haven't really seen any new edge here so now you might say well isn't this really um a bad idea actually because if you're executing a couple of statements multiple times because we're going repeatedly through a loop then this might actually lead to new behavior and uh yes you're right and actually afl has a refinement of the previous definition of what new behavior means which will take care of this example that i've just shown and this refinement is based on the hit counts of the edges that are executed so the idea here is that for each edge that is executed the algorithm not only keeps track of whether it is executed but it's also counting how often it is actually taken and now doing this in a very precise way would be pretty expensive because you would have to store a different number for every edge what the instrumentation of afl is actually doing is to use buckets of increasing size and it approximates counts by just keeping track of in which bucket the particular edge is so these buckets are defined as you see here so one two and three are kept separately and then all counts between four and seven are in one bucket all counts between eight and 15 on another bucket and these buckets are getting larger and larger so that basically the difference between the buckets are getting less and less important as the number of as the hit counts are getting larger and the reason why this increasing bucket size makes sense is that the father wants to focus on only relevant differences in the hit counts and it usually is more relevant whether a statement is executed once or twice than whether a statement is executed say 23 or 24 times because if you have executed a loop anyway very often it doesn't matter whether you execute it another time but the fact whether you do execute a loop only once or twice or maybe three times that may be more important all right so now that you know how afl measures coverage let's have a look at how it uses all this information to maintain and evolve this queue of test inputs so as you've seen on the overview figure afl is maintaining this queue of inputs and initially this queue only contains the set of seed inputs that are provided by the user or if none of these seed inputs is actually provided it will start by filling it with a few random inputs now once an input has been used it is only kept in this queue if it has covered some new edge that means if we have generated a new input and it doesn't reveal anything new it will immediately be discarded because there's no need really to run this kind of input again and then of course we also need to put something back into this queue otherwise it would pretty quickly be empty and this is done by mutating the existing inputs that are considered to be interesting because they have covered something new and then new inputs are generated based on those interesting inputs by mutating them automatically we'll see on the next slide how this mutation typically looks like just to give you a feeling for how large these queues are getting so if standalone programs of reasonable size are tested then typically the queue sizes are between 1,000 and 10,000 so there's a significant number of inputs in this queue but it sort of tends to remain at a size that is still manageable let's now have a look at the mutation operators that AFL is using to generate new inputs from those inputs that are considered to be interesting and this is done through a set of mutation operators that are basically random transformations that are applied to the inputs AFL considers the input of a program to be just a plain sequence of bytes so it doesn't really know anything about the structure of these inputs or about what these bytes are meaning and all the mutations are done on the byte level by mutating one or more of these bytes in some way so here's a subset of the mutation operators that AFL is using and this is actually an easy extension point where you can improve the fuzzer by just adding more mutation operators one of these operators bit flips so it's basically yeah flipping bits and not just once but with varying length and also step over so it may for example flip three bits then have a step over of five so leave the next five bits as they are and then flip another three bits and maybe do this repeatedly in order to try to hopefully get some interesting patterns another mutation operator is about addition and subtraction of small integers so it just looks at some of these bytes as integer numbers and then adds and removes some other yeah some integers from these numbers another mutation operator is the insertion of interesting integers so it focuses here on integers that are known to be interesting for example zero and one or maybe the maximum integer simply because these extreme values often trigger bugs for example because we have some off by one error somewhere or some boundary condition is not really correctly checked by the program and then yet another mutation operator is splicing which essentially means that two inputs are combined by taking the beginning of one input and the end of the of another input and then putting those two pieces together in order to get a new input which sometimes may lead to something interesting of course if you think about structured input formats it could also lead to just garbage input that is not considered legal by the program but at least sometimes it it'll work and because the sole idea of fuzzing is based on trying out many different inputs even if it does not always work it's still worth trying so we've covered the basic idea of afl now and before concluding this video i just want to mention a few more tricks that people have worked on for improving the efficiency and effectiveness of an afl like fuzzer there are many many papers that have been written and also many many variants of this afl tool that people have implemented by now so this is just a small subset of possible tricks that you can play to to fuzz faster and more efficiently and more effectively also but but those are three that i just like to mention here so one of them is to use time and memory limits for the executions the idea is that if you happen to generate an input that for whatever reason takes a lot longer than most inputs of this program take then this kind of input will slow down your your entire fuzzing process without really providing a lot of benefit and the idea is to discard an input whenever the execution is too expensive so when it runs for more than some predefined time or when it takes more memory than some some amount of memory that is predefined then the program execution is just interrupted and this input is discarded immediately another trick is about periodically pruning this input queue by selecting a subset of the inputs that are currently in the queue that still covers every edge that we have seen so far so by evolving this queue of test inputs what may happen is that there may be different inputs in the queue that actually cover the same edges but that because of the way this queue has evolved just happened to be in in the queue and by periodically removing these redundant inputs the queue stays not only smaller but also more focused on on new interesting behaviors and then a third line of work is about prioritizing how many mutants to generate from a specific input that is taken from the queue so in the algorithm as i've explained it so far afl is always just taking one of the inputs from the queue and then applies some fixed number of mutations on this input but actually what you can do is to somehow estimate how interesting this input is and how many mutants we should generate for this input and by doing this we can control on which inputs afl is putting more effort and in this way have afl to reach interesting behavior faster so for example one way to do this is to focus on unusual paths that are triggered by an input and to fuzz this input more so to generate more mutants from an input that has executed an unusual path hoping that when you have executed an unusual path there may be more code behind this path that you would like to cover and that you could possibly cover by mutating the specific input even more than you mutate others finally let me close this by quickly mentioning some of the real-world impact that afl and graybrox fuzzing in general have already had in practice so afl was initially developed by a single person as an open source tool and because it was surprisingly effective despite the fact that it's relatively simple at revealing security critical bugs a whole team at google is now working on this tool it's still open source and various improvements have now been proposed by different companies including google but also others and also by various people in academia who have actively worked on fuzzing over the past few years now this the resulting fuzzers which are all basically afl style fuzzers but many different variants by now are actually regularly used to check various security critical components and by now not only have burned thousands of compute hours but also have discovered hundreds of security vulnerabilities which otherwise would have enabled some attackers to exploit some really important systems but fortunately have been found through this fuzzer before other people could take advantage of it all right and this is already the end of the third video in this lecture on random testing and fuzz testing i hope you now have a better idea of what random testing could actually achieve so it's not just about passing purely random inputs into a program and hoping that something interesting happens but typically there's some kind of feedback mechanism that helps the random test generator to generate better inputs this feedback can come in different forms and in this last lecture or this last video of the lecture we've seen one form where coverage feedback is used to generate new inputs that hopefully trigger new behaviors quickly thank you very much for listening and see you next time