 Hi, and welcome to program analysis. In this lecture, we will look at symbolic and concocted execution, which are techniques for generating inputs to execute a program without needing a human who is providing these inputs. In contrast to the techniques that we've seen in the last lecture, which was on random testing and fuzzing, we'll now look into a white box test generation technique, which actually looks into the program and tries to analyze this program in some detail in order to find inputs that will execute some new behavior in the program. Here's an overview of what we're going to do in this lecture. So the lecture consists of four videos. Now we're in the first one. And in this first video, I will give an introduction of what symbolic execution actually is and give an overview of what I call classical symbolic execution, which is kind of the pure form of doing symbolic execution. Then we will talk a little bit about what challenges this pure form of symbolic execution is facing. And look at one approach called concocted testing that addresses some of these challenges in an effective way. And then finally, in the last video, we will have a quick look into some of the larger scale applications of these techniques and practice. Everything I'm saying here is based on research papers. And in particular, I can recommend the three papers that you see down here. So if you're interested in some more details on the techniques that I'm describing here, you should have a look at these papers that describe them in more detail and also provide a lot of more examples and empirical results than I do here. Let's start by having a look at what symbolic execution actually is. Symbolic execution is a technique to reason about the behavior of a program by kind of executing it with so-called symbolic values. These symbolic values can be thought of as placeholders for the actual values, which may be associated with some information of what the actual values must look like, but they are not as concrete as the normal input value study would feed into a program. So instead of feeding maybe three and five and 26 into a program as an input, we will reason about these inputs symbolically and just give them a symbol. And then while we reason about different executions of the program, impose more and more constraints on these values. This idea of symbolic execution is pretty old, at least in terms of what alt means in computer science, because it has been proposed by King and then refined a little later by Laurie Clark in the 80s. And then it has been around for quite a while without really being that popular, but around 2005 it became popular and practical again because of some advances in a different subfield of computer science, namely advances in constraint solving. Specifically, there have been very impressive advances in SMT solving and we'll see that this is actually a critical component of this whole idea of symbolic execution. So whenever an SMT solver gets much faster, symbolic execution suddenly also works much better. And around 2005, the point was reached where the SMT solvers were fast enough to make symbolic execution practical and then a lot of people got interested in this pretty old idea again and nowadays, it's again a topic that draws a lot of attention. So before looking into the details of how symbolic execution works, let's just illustrate the idea of this technique using an example. The example is what you see here. So it's a function which happens to be written in JavaScript, but the language doesn't really matter. Any imperative language would work. And this function is taking three arguments, ABC, which are the inputs to this program. So program here means a function. And then in the function itself, we have three local variables X, Y, Z, which are initialized to zero. Then we have a couple of conditions that are checked on the way, which are always conditions on the inputs, so on A, B and C. And depending on these conditions, we may, for example, take this path here or this branch here where we are assigning minus two to X, or we may go into this branch here where inside we are checking another condition which may lead us into this assignment and that assignment. And at the end, there's this assertion which is checking whether the sum of X, Y and Z is not equal to three. And the question is, well, is it actually possible that this assertion is violated? So is there some path through this code where X, Y and Z happens to be, yeah, if you sum them up, happens to be three. Before looking into the symbolic execution of this function, let's at first have a look at a concrete execution of this piece of code, which is nothing really surprising. It's exactly what you would expect to happen when you concretely execute this piece of code. And in this concrete execution, because it's concrete, we need to have concrete values for all the inputs. So for this example, let's assume that A and B and C are all equal to one. So these are our concrete inputs. And then the program is executing and starts by assigning values to these local variables X, Y and Z. So, and these are all initialized to zero. Then we are reaching the first if, which checks whether A is true. A is one, one will be converted into the Boolean true. So the first condition is true and therefore we are taking this first branch where we are now assigning minus two to X. Then we are reaching the second branch where we check whether B is smaller than five. B is one, which is smaller than five. So this is also true, which means we are entering this branch and we'll check the nested if, which is checking that not A and C is true, which happens to be false. Yeah, because A is one, so A is true, which means not A is false. So this whole condition is also false. So we are not assigning anything to Y, but we still have this other statement in the outer branch where we are assigning two to Z. And then we are reaching our assertion where we can now sum up our values X, Y and Z where we find that we have minus two plus zero plus two. And this happens to be zero, which is not equal to three. And that means here our assertion check is successful and we have not found a violation of this assertion. So the concrete execution wasn't really surprising because you've concretely executed programs a lot of times. Let's now have a look at how the symbolic execution of this piece of code will look like. So in the symbolic execution, we again start with the input values, but this time we are not assigning concrete values to them, but we are assigning so-called symbolic values to them. So we start by saying that A is A zero where A zero is the symbolic value that represents the initial value of A. So basically the input that is given to this function and we do the same for B and C by also assigning B zero and C zero to them. So these values here are called symbolic values. Given these symbolic inputs, we are now starting the execution of the program and as before, we will now assign these initial values of zero to X, Y and Z. Next, we are reaching this first if and now in contrast to the concrete execution where we are taking one of the branches, but not the other one, we will now consider both possible behaviors because we wanna symbolically execute this program and while doing this, we will consider all possible behaviors of the program. So we reach this branching point where we are checking a condition and this condition is A in the source code. A currently has the value A zero, so the initial symbolic input that is given to the program. So what we're checking here is whether A zero is true or false. This check can have two outcomes, either it's true or it's false and then depending on whether the condition is true or false, we will or will not execute this assignment of minus two to X. So if it is true, then on this edge here, we are executing this assignment, whereas on the false branch, we do not execute it. Now, no matter whether we have executed this assignment or not, the next statement that we will reach in this program is the second if where we are checking whether B is smaller than five. If we look at the value that B currently has, we see that it has the symbolic value B zero. So actually this check means that we are checking whether B zero is smaller than five and this happens both in the case where we have taken the first branch and in the case where we have not taken this first branch. And again, because this is a Boolean check, there are two possible outcomes, true and false. In the case where this check is false, we are at the end of the program because apart from the assertion statement, which I'm omitting here now, nothing will happen. So basically we will not write anything else here and here because we are at the end of the program if this second if is false. If this check returns true, then we are going into the nested if, which means we will have to do another check and now expressing the condition of this other check. Again, in terms of the symbolic values means that we are checking that not A zero and C zero. And please note that I'm now switching to this logic notation instead of the programming language notation and we'll see why I do this. The reason is basically that we wanna feed these logical formulas into a constraint solver later on. So in the case that this check is true, so again we have these two possible branches here, true and false. And in case this check is true, we will execute this assignment of one to Y and then afterwards we'll have this second assignment of two to Z. And in case the check returns false, we do not perform the first assignment, but we do perform the second one. So in that case, this is the statement that follows. And the same happens essentially in the other execution path, which we have here on the right. So if this check here returns true, then we'll also check whether not A zero and C zero, which again may be true or false. And similar to the other side, if it's true, we will perform these two assignments and if that check returns false, we will only do this one assignment here. Now this tree that I've drawn here, this is actually called an execution tree because it represents all possible executions of this piece of code. And now one other thing we can do is to look at the different conditions that must hold in order for us to take a particular path through this program. And we can do this by basically starting at a root and then going down to one of the leaves and concatenating all the conditions that we see on a wave with a logical end. So let's do this here for this first path that at the end ends here. So in this case, what we'll see is that the condition for taking this path is that A zero is true and B zero smaller than five is true and not A zero, oops, and C zero is true. And now if you look at this condition closely, you'll see that there's a contradiction in it because it can't be that A zero and not A zero is true, which essentially means that by just looking at these conditions, we actually know that this path is infeasible. So it's not possible to actually execute this path with any input that the program could get. Now we can do the same for all the other paths and basically write down the conditions that must be true to take this particular path for each of those. So let's do this also for the second path for which we need to have A zero and B zero smaller than five and now I need to negate this condition up here. So this will be not not A zero and C zero. Now we can do the same for this path, for example. So we will take that path if A zero is true and now I need to negate this condition here, B zero is larger or equal than five. The same can be done for the other path on the right hand side of the tree. So for example, here we would have that not A zero because we are taking the false branch of this conditional here and B zero is smaller than five because the second check returns true and then because the last check also returns true, it's and not A zero and C zero. For this one down here, it's a little different. So here we have A zero and B zero is smaller than five and not not A zero and C zero. And finally for the rightmost path, this one will be executed if we have not A zero and B zero is greater or equal to five because that's the negation of this second check. Now in addition to the conditions that we now know for taking a particular path, we can now also for each of these paths look at the values that our local variables will have and we can compare these values to what is actually requested in this assertion. And if we do this, then what we'll see is that there actually is a path where we can violate the assertion and this is the one that leads us down here because if we take this path, Z will be two, Y will be one and X will be zero, which means our sum is zero plus one plus two, oops. And this happens to be equal to three, which means that our assertion is actually violated. And now because we also see from the condition that this condition can be fulfilled, so there are values for A zero, B zero and C zero, we actually know that there is an input to this function which will violate our assertion. All right, so now you've seen this idea of an execution tree on an example, let me now define what this execution tree actually is. So as the name says, it's a tree and more specifically it's a binary tree. So basically every node has exactly two children. And what this execution tree is representing is the set of all possible execution paths through a given piece of code. In this tree, we have nodes and edges and the nodes in this case represent conditional statements. So for every branching point in the program, we have one node that represents the conditional that is checked at this branching point. And the edges of the node of the tree represent the execution of a sequence of non-conditional statements. So basically everything that happens in between these branching points. And then if you look at the path in this tree, then a path that starts at the root of the tree and goes down to one of the leaves represents an equivalence class of inputs that will lead the program exactly along the same sequence of statements. So as a little quiz or maybe homework for you, I have another piece of code here and I would like to invite you to maybe stop the video at this point and just draw the execution tree for this given function yourself. Because it's the best way to basically check whether you've understood this idea of an execution tree. And then kind of as a check sum, the question for you is how many nodes and edges does this execution tree actually have? So please stop the video here and don't continue until you've done it. And then once you have done it, I will just let you know that the solution for this check sum is that there are three nodes and seven edges in this graph. I will not give you the concrete execution tree, but you're welcome to share it, for example, through Ilias once you've done it to double check whether your solution is correct. Apart from the execution tree, we've seen another concept in this example that I've shown. And this was this concept of symbolic values and of a symbolic state of the program. So the idea behind these symbolic values and a symbolic state is that all values that are unknown, so basically all inputs that are given to a piece of code are kept symbolically. So everything that comes, for example, from the user or is read from a file, so where we do not really know by just looking at the source code what the value will be is kept symbolically and is represented as a symbolic value. And then while the program is executing on the symbolic values, we have this idea of a symbolic state which maps the variables in the program to some combination of the symbolic values which represents the state that these variables will have when the program is executing. So looking at this example down here where we have just a simple function that takes two arguments X and Y and then does some computation on these arguments, we would consider the input values X and Y as something that needs to be handled as a symbolic value. So we would call the initial values that X and Y are having, X0 and Y0. And then we would, for example, do this computation here which adds these two symbolic values and writes them to this local variable Z. And we would represent the value of this local variable Z as a symbolic state which in this case would be the sum of X0 and Y0. Just note that X and X0 are not the same. So if somewhere in here, for example, we would now assign some other value to X which we could, of course, do, let's say X plus one, then the symbolic state of X would be X0 plus one. So X and X0 do not have to be the same but X0 is just the initial value of a symbolic input. Given this symbolic representation of the inputs and of the state of the program you can now reason about the conditions that must hold if a particular path in the program is executed and this is done using so-called path conditions. So that's again a concept you have already seen in the example. So what I'm going to do here is just explain it and define it in a more general way. So a path condition is essentially a quantifier-free formula over the symbolic inputs that represents all the branch decisions that have been taken up to the point in the program where we currently are. So for example, if you look again at this piece of code here then there is this condition that is checked here and the question is what condition must hold for us if we execute these dot, dot, dot statements here. And the answer is that the condition here would be that X0 plus Y0 must be larger than zero simply because the check that we do here is on Z and Z as we've seen on the previous slide is the symbolic state that is represented as X0 plus Y0. And here we are checking whether Z is larger than zero so we get this path condition. So basically this is a formula that tells us what must hold in order to reach this branch down here. So now the nice property of these path conditions is that they are logical formulas and we can check whether a logical formula is actually satisfiable or not. And what this means is that we can check whether a path is actually feasible so whether it can be executed or not by just checking if the path condition is satisfiable. This could be in principle be done by hand but of course you wanna automate this whole process of automatically testing a program through symbolic execution. So we need another piece of software that does this satisfiability check for us and this piece of software is called an SMT or a set solver. Set here simply stands for satisfiability because this is essentially what these solvers are checking. But in practice you often want to not just reason about purely logical formulas but also maybe reason about integers or strings in a program. And therefore most people use so-called SMT solvers which means satisfiability modulo theory where theory just means that the solver also knows something beyond pure logic. For example, it knows something about integers or strings. There are many of these SMT or set solvers out there. Z3 and Gizes and STP are just a few that are popular and if you wanna try out some of them it's they are relatively easy to set up and you can basically feed some logical formulas into them and see if these formulas are satisfiable. In addition to telling you whether a formula is satisfiable what these solvers also do is to provide a concrete solution for the case that a formula is satisfiable and concrete solution essentially means that for all the symbolic values you'll get a concrete value that if you put it into the symbolic value will make the entire formula evaluate to true. So let's illustrate these ideas just with two examples. So let's say we are giving this formula here to a solver where we say, hey A0 plus B0 must be larger than one. Then it will tell you, well, of course, that's satisfiable. There is a way to make this formula true and this is, for example, the solution where A0 is equal to one and B0 is also equal to one. Of course, there's more than one solution but this is just one solution that the solver will give you. The second example is this one down here where we have this slightly more complex logical formula where we're saying A0 plus B0 must be smaller than zero and at the same time we want A0 minus one to be larger than five and B0 should be larger than zero. And now if you look closely at this example, you can figure out that there's actually no way of having A0 and B0 such that this entire formula is true and the solver will also find this for you and will tell you that this formula is unsatisfiable so there is no concrete solution to make this formula true. So now you've seen all these ideas of an execution tree and symbolic representation of inputs and the state of a program and then this solver-based way of reasoning about execution. So you may wonder, well, why do we need all of this? So what are the applications of this symbolic execution? And there are many, many applications actually and the general goal of all of them is to reason about the behavior of a program by not just executing the program but by simultaneously reasoning about different paths that might be executed in this program. This has a couple of basic applications which is what we'll mostly focus on here in this lecture. For example, you can use it to detect infeasible paths so you can rule out that particular path in the program can actually be executed and even if an assertion would be violated if you would execute this path, if you see from the path condition that it's impossible to actually execute that path, then it doesn't really matter. And then as we'll see, you can also use this to generate new test inputs. So because the solver can give you a concrete solution for a particular path condition, you can basically ask the solver, hey, what would be concrete inputs to trigger this path and the solver will tell you. And this way you're basically generating new test inputs that are guaranteed to take a particular path. And then by doing all of this, you can basically use symbolic execution to find bugs and also vulnerabilities by encoding what would represent a bug, for example, as an assertion and then using this idea of symbolic execution and constraint solving to find inputs that would violate an assertion. And this then of course means that you've found a bug. Beyond these basic applications, which is what most people use symbolic execution for, you can use this general idea of reasoning about the behavior of a program for other applications as well. So for example, there's work on generating program invariance based on symbolic execution, where you basically wanna find out properties that must hold at particular points in a program. You can also try to prove that two pieces of code are equivalent. This is an inherently hard problem, but with symbolic execution, you can at least in some cases do this. It's also pretty useful for debugging. So if you have a bug and you want to find out when and why this bug is triggered, you can basically try to encode the conditions that are triggering the bug through symbolic execution. And this can then also, for example, guide something called automated program repair, where an automated technique tries to find a way to change the code in order to fix the bug. So here you would basically know that if this condition holds, the program is in a buggy state. So let's change the code so that it's impossible to reach this particular condition while still maintaining the behavior that is triggered when other paths of the program are executed. All right, and this is already the end of video number one in this lecture on symbolic and concording execution. So we've now looked at classical symbolic execution and you've seen some examples of how this works. And you've also learned about the underlying concepts like the execution tree, the path conditions, and this idea of a constraint solver in order to find inputs that make us take a particular path in the program. In the next videos, we will look into some problems of this overall idea and then we'll also have a look at how these problems can be addressed. Thank you very much for listening and see you next time.