 Hi and welcome to the program analysis course. In this lecture, I will introduce data flow analysis, which is one way of formulating static analysis problems. It's probably the most popular way, and there are many, many static analyses that are formulated as a data flow analysis. What I'll do in this lecture is to introduce the concept of data flow analysis. I'll then show a lot of different examples of data flow analyses, and I'll explain the conceptual framework that ties together all these examples. The lecture consists of six videos, and this is the first of them. So let's get started by having a look at the big picture again. So in the introduction of this course, I mentioned that there are two kinds of analysis. Static analysis, where the source code is analyzed without really executing the code, and dynamic analysis, where you execute the code and while the program is running, some analysis is extracting additional information about this execution. Now data flow analysis, the focus of this lecture is static analysis, so we are not executing the code, but instead the analysis reasons about the source code and tries to reason about the different behaviors that the code may have if it gets executed. There are different ways of formulating a static analysis, and data flow analysis is just one of them, but it's probably the most popular one because it's a very powerful way of expressing how to reason about the behavior of a program. Now here's a one slide summary of what data flow analysis is about. The basic idea is that we are looking at the program through its control flow graph. So this graph that has notes for the different statements in the code and edges that tell us that one statement may be executed after another. Now what a data flow analysis is doing is to propagate some kind of analysis information through this control flow graph in order to compute an analysis state at each program point. So in every statement we want to know what the state of the program here could be and there are different ways to define state depending on what you want your analysis to actually compute. Now in order to enable the analysis to do this, so to propagate this analysis state through the entire program, we need to define for every statement how this kind of statement affects the analysis state. So there will be some way of expressing that if you reach this kind of statement then the state that we're interested in will be changed like this or like that. One interesting aspect of programs is that they have loops, so there may be things that occur repeatedly and often you do not know how often they may occur, at least not statically. And in order to handle this challenge data flow analysis is iterating until a so-called fixed point is reached, which essentially means that the analysis state does not change anymore even if you consider more executions of the loop. You'll see examples of all of this in the rest of this lecture, so this was just a very brief summary so that you know what to expect. Here's a brief outline of this lecture and the six videos that it will consist of. So in the first video, this one that we're currently in, I'll start by giving a first example of a data flow analysis, which is the so-called available expressions analysis. You'll see in a second what this really is. After this example, I then explain some of the basic principles behind what you've already seen in the example. And then I'll give more examples, which are essentially more data flow analyses that are all defined in pretty much the same way, but for different analysis problems. Then we look into one interesting aspect of computing the actual analysis state at every program point, which is how to solve the so-called data flow equations and the data flow problem as a whole. Then we'll have a brief look into going beyond just analyzing individual functions, which most of the lecture will be about by turning from an interrupt procedural to an interprocedural analysis. And then at the very end, I'll talk a bit about so-called sensitivities, which tell us how sensitive the analysis is to particular aspects of the code and basically gives us a feeling for how precise an analysis is in terms of modeling the actual behavior that may happen at runtime. All right, let's get started with one example of a data flow analysis. And this will be the so-called available expression analysis. The goal of this analysis is to compute for every point in the program, so for every statement, which expressions must have already been computed and are not later modified by the program. So why do you want to do this? This can be very useful, for example, to avoid recomputing an expression. If you know for sure that when you reach this point in the execution, a particular expression has already been computed and it has not changed when it reaches this point, then you do not have to recompute it again, even if the code actually asks you to recompute it, but you can reuse the previously computed result. And compiler optimizations are doing exactly that, so when compiling the code, they are trying to optimize it by avoiding, for example, to recompute expressions and sometimes this is done based on this kind of analysis that we are looking here, the available expression analysis. Let's make this more concrete by looking at a little bit of code. So here we have an example of JavaScript code, but the language doesn't really matter. And what it does is to compute some expressions, a plus b and a times b and then stores the results into some variables x and y. Then we have this y loop that checks some condition and then in the y loop overrides the values of a and x. Now in this code, there is at least one expression that is available every time the execution reaches a particular point in the program and this is this a plus b expression. So every time we are reaching this conditional of the y loop, a plus b has already been computed and the values that are required for this computation, namely a and b, have not changed since the last time the expression was computed. So one way to get to this conditional is when we come down these statements or when we're basically reaching the y loop for the first time. In this case, a plus b has already been computed here and then a and b do not change before we reach the y loop. The other way to reach the y loop is by going back from the loop body to the conditional again. And in this case, a plus b has been computed here and afterwards a and b do not change before we reach this conditional. So again, this shows that every time the execution reaches this place, this expression a plus b has already been computed or to use the terminology of the analysis is an available expression. So this is the kind of information that we would like to compute, of course, also for more complex examples. So how can a data flow analysis compute which expressions must be available at a particular point in the location of the code? So the way this works is by using so-called transfer functions. So these functions are taking a statement and then tell you how executing the statements affects the analysis state. And in this case of the available expressions analysis, the analysis state means the set of available expressions. So what the transfer function is going to do is to tell us if you execute this kind of statement, how does this affect the set of available expressions? And to do this, the two functions that are defined, namely the gen function and the kill function, which tell us basically things that are generated and about things that are killed. So what the gen function does for our available expressions analysis is to tell us what are the available expressions that are generated by executing a statement. And reversely, what the kill function is doing is to tell us what are the available expressions that are killed by executing a statement, meaning that are not available anymore after executing a particular statement. So you can think of these two functions as two sides of the same metal, where one is basically adding things to our analysis state and the other one is removing things from the analysis state. And in this case, things means available expressions. So let's define these two functions, gen and killer. Let's start with the gen function. So the gen function just as the kill function is a function that takes a statement and then for the case of an available expression, analysis returns a set of expressions. So precisely if a statement, we say that a statement generates an available expression E. So this expression E will be part of what is returned here by the gen function for a given statement. If this statement evaluates E, so it's computing the value of this expression. And if it does not later write any variable that is used in E, so it's not overwriting anything that could have changed the result of computing this expression E. And if a statement does this for some expression E, then this expression E is in the set returned by the gen function for that statement. And otherwise, if there is no such expression, then the function returns the empty set for the statement. So as a concrete and very simple example, let's consider a statement like this one down here where we have some variable X that is written with the result of multiplying A and B. And because in order to execute the statement, the code is computing this expression A times B, gen of this statement will contain the expression A times B. The inverse of the gen function is the kill function which tells us what available expressions are not available anymore after executing a statement. So again, kill takes a statement and then returns a set of expressions. And it does this as follows. So if a statement, we say that a statement kills an available expression E, if the statement modifies any of the variables that are used in E. And the reason is that if you change any of the variables that are used to compute an expression, then the result of evaluating the expression may change and therefore you cannot consider this expression to be available, meaning pre-computed if you want anymore. So if a statement modifies any variable used in some expression, then this statement kills that expression. And otherwise, the function again returns an empty set. So if it, for example, does not modify any variable, then it also doesn't kill anything. As a concrete example, let's say we have a statement like this, where 23 is assigned to A. And this statement would kill, for example, the expression A times B. Because it writes to A and by doing this, it may invalidate this result of the expression A times B. So kill of that statement will include this expression. So let's illustrate these ideas using our example again. So what you see here is the same example that you've seen earlier, this Y loop that is surrounded by some statements that write into X and Y and A, and the first thing we always do in order to perform a data flow analysis is to write down the control flow graph of the code, simply because this is the structure that the data flow analysis is working on. So let's do this for this example. So we always have an entry node and an exit node in every control flow graph, which basically tell us where the execution starts and where it eventually ends. Then after the entry node, the first thing that happens here is that we have this assignment of A plus B into X. So we have an edge here because it always happens immediately after we enter this piece of code. And then we have another assignment which writes the result of A times B into Y. So far so good. So the next thing that happens when you execute this program is that we'll reach the Y loop, which means we will reach the conditional of the Y loop. So what the program will do is to compute the result of A greater than A plus B. Sorry, Y greater than A plus B. So this happens next. And then there are two options. One is that we immediately go to the exit node because if the conditional is wrong, then we are at the end of the code and that's it. And the other case is that we're entering this loop and then we'll have these two statements executed in the loop, one that decrements A and then the other one that computes A plus B and writes the result into X. And once we are done with the second statement in the loop body, what will happen is that we go back to the loop body in order to check again whether we should execute this loop another time. So now the goal of an available expressions analysis is to reason about which expressions are available at a given program point and it does that for also called non-trivial expressions. So that does not include literals like one or two or three and it also does not include single variables like A or B, but it includes everything that is more complex, for example A plus B, A times B and so on. So for our example program here, let's have a look at what the non-trivial expressions are that we have in this code. So only those expressions that really occur in the code are relevant because the others anyway are not computed and in this case there are three of them, one is A plus B because that occurs in three different places here, another one is A times B, which is computed once and then there's another one which is A minus B. So these are the three expressions that we care about in this available expressions analysis. So next we can look at the transfer functions for each of the statements that we have here in the code. So let's just do this. So in order to do this, let me just give the statements some numbers so that I can refer to them more easily. So let's say this is statement one, this is statement two, this is statement three, this is statement four and this is statement five and then I'm going to write down a table where we have some statement s and then we write down the result of gen of s. So basically what are the expressions that are generated by this statement and then we also have a column for kill of s which will tell us what available expressions this statement is killing and we're going to do this for all of our five statements here. So one, two, three, four and five. To do this, let me just show you again the definitions of gen and kill and let's start with gen. So gen is telling us which available expressions a statement creates because it evaluates this expression and does not later write in the same statement to any variable used inside the expression. So let's have a look at the first statement. So this first statement is evaluating a plus b and it's not writing to a or b which means it fulfills this condition. So gen of s contains this expression a plus b. The result of gen of s and also kill of s is always a set because there may be more than one expression that is generated or killed but in this case there's only one name with this expression a plus b. Let's also have a look at what is killed in this statement. So a statement would kill an expression if it would modify any of the variables used in the expression. So now our statement one is modifying variable x. If we look at the non-trivial expressions then x does not occur in any of them and that essentially means that this statement is not killing anything that is relevant to the analysis. So this so kill of this statement is the empty set. Now let's do the same for the other statements and let's again have a look at the gen function now for statement number two. So statement number two very similar to statement number one is computing an expression namely a times b. It's writing to a variable but this variable does not occur in any of the non-trivial expressions so the kill set is again the empty set. Statement number three is also computing one of our non-trivial expressions because it's also computing a plus b. So this is put here and again it's not writing or modifying any variable so the kill of statement three is again the empty set. Now statement four is slightly different because it is not generating any of our non-trivial expressions. Now you may wonder why because after all it's computing a minus one but if you carefully look at the definition of the gen function then you see that it's not only about evaluating an expression but it's also about not later writing to any variable used in this expression and in this case because statement four is also writing to a. The fact that it has computed a minus one before doesn't really matter anymore because a minus one may actually have a different value after statement four has executed and therefore the gen set of statement four is the empty set because there's no expression that is computed here and not later modified. Let's also have a look at the kill set of that statement so because that statement is writing to a it's actually killing all three of our non-trivial expressions because each of these non-trivial expressions contains a meaning that once we have executed statement four whatever we have pre-computed before about these expressions may not be valid anymore and therefore this is part of the kill set. So a plus b, a times b and also a minus b are in the kill set of statement four and then finally let's have a look at statement five. So this syntactically looks exactly the same as statement one which means that the gen and kill functions also have to be exactly the same as in statement one and this is the fact that this statement five is generating the available expression a plus b but it's not modifying any of the variables that occur in the non-trivial expressions and therefore the kill set here is the empty set. So this table tells us all the transfer functions for all the statements that we have in this example and then using these transfer functions we can then perform the actual data flow analysis. Now that we have defined the transfer functions for the individual statements in order to really compute the state at every statement we need to propagate this information through the program and in this case that means we want to propagate the available expressions. To do this for this example we start with the assumption that initially there are no available expressions and that makes sense because when you start executing this piece of code then nothing has been computed yet so there cannot be any available expression. The analysis that we're going to use here is a so-called forward analysis we'll see in the next video what alternatives exist but this basically means that we are propagating the available expressions in the direction of the control flow so basically forward as the program is executing. So not what we're going to do is to compute the available expressions at the exit of each of our statements so after executing each of these statements depending on three things. One is the available expressions that are incoming to the statement so basically the available expressions that exist before we're entering the statement. From this set we are removing everything that is in kill of s so everything that this statement is killing and then we add something again and that is all the available expressions in gen of s which we're adding to the set of available expressions that are available at the end of this statement s. Now there are still two questions left to be answered and this is what happens when control flow is splitting or merging so when control flow splits for example because we have a branch or a loop and we may enter the loop or may not enter the loop then what we're doing here is to propagate the information along both ways so in this case it means we take the set of available expressions and no matter which way we go we know that these expressions will be available in either of these two ways and then when control flow merges for example because two branches are merging or maybe because we are at the end of a loop then what we do here is to intersect the two incoming sets of available expressions so we take the intersection of them and the reason is that if control flow may come from here or may come from there and we maybe only know that an expression is available in one of these two options but not in both of them then we cannot be sure that this expression is really available after the control flow has merged so we only keep what is indeed intersection so we only keep the available expressions that are in both of these incoming sets. So let's do this a little bit more formally and the way we'll do this is by writing down the so-called data flow equations which essentially tell us what is the analysis state at the entry and at the exit of every statement so in this case of the available expressions analysis these equations will tell us what expressions are available at the entry of every statement and at the exit of every statement. So I'll denote this analysis state with AE for available expressions and we compute these two sets for every statement AE entry of S which corresponds to the available expressions at the entry of S and then we'll also compute AE exit of S which corresponds to the available expressions at the exit of the statement S. So let's write down these equations for all the statements that we have in our code here and let's get started with statement number one so AE entry of statement one will be the empty set and this is by definition because we have said that when we start executing this code no expression is available yet and therefore we say this set must be the empty set. Now all the other entry sets of the statements are defined in terms of the exit sets of the statements that are executed just before them so for example for AE entry of statement two because statement two if you just go back to our control flow graph statement two always executes after statement one right and therefore we know that AE entry of two must be the same as AE exit of statement one. Let's have a look at AE entry of statement three so just going back here if you look at statement three that one then you'll see that there are actually two incoming control flow edges one from statement two and one from statement five and what that means is that um AE entry of three is the intersection of AE exit of two and AE exit of statement five so we take whatever comes in from either of those and only if an available expression is in the intersection of these two sets then we keep it for AE entry of three and then for AE entry of four and five we just have to consider one again because whatever yeah the only thing that happens before statement four is statement three and the only thing that happens before statement five is statement four and in terms of data flow equations that means that AE entry of four is AE exit of three and AE entry of five is the same as AE exit of four so now we have defined the entry sets for all these statements let's also look at what set of available expressions exist at the exit of each of these statements so we basically have to write down AE exit for each of our five statements and looking back at what I had said on the previous slide so here I had said that for every statement as the outgoing available expression so AE exit are the incoming available expressions minus everything that gets killed by the statement plus everything that gets generated so now let's use this here for our example so for statement number one that means that AE entry sorry AE exit of one is equal to what comes in so AE entry of one and then we have to we would remove everything that gets killed by statement one but looking back at this table statement one doesn't kill anything and then afterwards we add everything that gets generated by statement one so in this case we have to add A plus B so basically that means we'll take the union of AE entry of one and the set that contains A plus B now next we can do the same for all the other statements so let's continue with AE exit of two so this will be AE entry of two minus everything that gets killed by statement two going back here we see that nothing gets killed by statement two and then plus everything that gets generated by statement two so plus this expression A times B all right now statement three looks kind of similar so we have AE exit of statement three which is AE entry of statement three minus everything that gets killed which is in this case nothing plus everything that gets generated so plus A plus B so we take the union with A plus B and now the next one is the perhaps most interesting one because it's that loop condition so AE exit of statement four so the available expressions after just after evaluating the loop condition is whatever is available before we evaluate this loop condition so AE entry of four and then minus everything that gets killed so minus all of those plus everything that gets generated which means in this case plus nothing so in this case we are removing from this set AE entry of four everything that is killed so A plus B A times B and A plus one or sorry this should actually be A minus one okay so now this was statement four let's do the final one so AE exit of statement five and again this is AE entry of statement five minus everything that gets killed statement five is not killing everything anything so minus nothing plus or union of A plus B because this is generated by statement five and now with these equations we have defined the equations that need to be satisfied by the final state of our data flow analysis so now these equations depend on each other right so we cannot really immediately see from them what the concrete available expressions are at the entry or exit of a statement and in order to get really the concrete set of expressions at the entry and exit of each of the statements we need to compute a solution of these equations we'll see in one of the later videos how this is done so what I'll only show here is the result of computing the solution so for every statement we'll write down AE entry of that statement and AE exit of that statement based on the equations that you see on the left so for statement one what we'll get here is that the entry set is the empty set and the exit set is A plus B so far so easy so for statement two the entry set equals the exit set of statement one so in this case this means it's this expression A plus B and then if you look at the equation that defines AE exit of two you'll see that it should also contain the expression A times B in addition to whatever was there at the entry of statement two which means that AE exit of statement two contains A plus B and also A times B now let's move on to statement three so statement three has as its exit set only A plus B which may seem strange at first because the exit set of statement two contains both A plus B and A times B but because statement three also has another incoming edge from statement five and because statement five is actually as we'll see in a second only containing A plus B in its exit set it means that we take the intersection of the incoming sets and that results only in A plus B but not in A times B next let's look at statement four so statement four has as its entry set A plus B because that's what comes out of AE exit of three and then because statement four is actually killing all expressions that we have in our set of non-trivial expressions including A plus B the exit set of statement four is the empty set and then statement five is getting this empty set as its entry set but then it's also adding something namely A plus B and this is why at the end A plus B is in the exit set of statement five so let's just go back to the beginning of this video where I showed you this example and I said that every time the execution reaches this point we know that A plus B has already been executed and now if you look back at our table that shows the solution of these data flow equations then you'll see exactly the same piece of information namely that AE entry of S contains A plus B which means exactly what is written here in yellow so at the time we reach this statement we know that this expression A plus B has definitely been computed all right so now in order to test yourself to understand whether you really understood this idea of data flow analysis and in particular this idea of available expressions analysis here's another piece of code where as a little quiz my question for you is whether this expression x minus y is available when entering this very last statement of the example so now I encourage you to actually stop the video at this point and maybe think about it yourself and if you want you can even write down all these equations that have written down before because then you can really check whether you've understood these ideas now after you've hopefully done this yourself let me show you the solution so the answer is no x minus y is not among the available expressions when reaching the last statement and the reason is that there is one control flow path which comes from this body of the y loop that eventually reaches this last line and in this body of the y loop the code is actually writing to this variable x which means it is the statement this statement here is killing variable x and as a result because we take the intersection whenever control flow merges again this expression x minus y is not available for sure when reaching this last line all right and that's already the end of this first video on data flow analysis so what I've done in this first video is to show one example of a data flow analysis namely the available expressions analysis and what we'll do in the next video is to look at the principles that are actually underlying this whole idea of data flow analysis and which you know that you've seen an example hopefully can easily understand thank you very much for listening and see you next time