 Hi, and welcome back to program analysis. This is video number four of this lecture on call graph analysis, where we look into a fifth and final way of constructing call graphs, namely the Spark framework. The Spark framework brings in two interesting ideas. One of them is that you can see many different algorithms for constructing call graphs as instances of a single unifying framework. In particular, the RTA, DTA and VTA algorithms that we've seen in previous videos of this lecture can actually be seen as an instance of this Spark framework. So for any of these algorithms, the general recipe to construct a call graph is the following. So at first, the analysis builds a graph representation of the code, which is called the pointer assignment graph, and we'll see in a second what exactly that is. And then it's propagating information through this graph, which essentially, which eventually will tell the analysis where the, what types particular variables may have, and as a result also where the calls on these objects may actually go. And then the second idea that Spark springs in is that it's not just doing call graph construction, but at the same time is performing a points to analysis, which is an analysis that reasons about the objects that a variable may refer to. And we'll see in a second why this is a good idea, because constructing a call graph and reasoning about points to information at the same time allows the analysis to get a more precise call graph at the end. So let's start by looking at this pointer assignment graph, which is at the basis of this Spark framework. So as every graph, it consists of nodes and edges, and here you see a summary of what these nodes and edges are about. So there are nodes to represent allocations, variables, and field references, and the edges correspond to allocations that are assigned to something, assignments between variables, and also field stores and field loads. So let's go through those in some more detail, and let's start with the allocation node. So there's one such node for every location in the program where some constructor is called. These constructor calls are called allocation sites, and therefore these nodes are called allocation nodes. And then what this node represents is the set of objects that are created at this code location. So even if this code location may be reached multiple times in the program, for example, because it is in a loop, all the objects that may be created at this location are represented by this one node. And very often, this node also has not very often actually always this node has an associated type, because you know what constructor is called here. So you know that this type, in this case A, is the type that the objects represented by this node must have. So in the graph representation, we will basically see nodes that look like this, and that are marked with some allocation sites, so some location in the code where you have a call to new some constructor name. The second kind of node are variable nodes. Which, as the name suggests, can represent local variables, but they can also represent a couple of other things. For example, parameters, which are very similar to local variables. And static fields, which you can see as a kind of global variable. And they also used to represent thrown exceptions, which is basically to handle this special case of the Java language. What these variable nodes represent is a memory location that is holding a pointer, or maybe multiple pointers, during the execution of a program, two objects. So this node down here that represents the variable p will represent all the pointers to objects that this variable may ever hold during the execution of the program. And depending on what setting is used when using the Spark framework, these nodes may be typed or not. The third and final kind of node are nodes for field references. So for every place in the program where a field is used, so where we write something like p.f, there is such a node. And this node then represents the pointer dereference that happens when this field reference is actually executed. Because every field is in some object, and there always is a base object when you have a field reference, these nodes always have a variable node as their base. So for the example of p.f, this would be the node that represents p. And then to model arrays in the Java language, these field reference nodes are also used, where there's basically an imaginary field called elements, which represents all the elements that an array can have. All right, let's now look at the edges that we can have in this graph. So the first of them, the first edge is an allocation edge, which basically represents the fact that some newly allocated object is assigned to a variable. So if we have a piece of code that looks like this, where we say new hash map and create a new object, and then assign it to a variable p, then we will get some graph representation that looks like this, where alloc1 is the allocation site where the new hash map constructor is called. And p is the variable node that represents the variable p, and we have this edge here in between, because this newly allocated object is assigned to p. Similar, the same will happen if we do not explicitly call a constructor, but create an object, for example, like this, where a string literal is used, which also creates a string object. The assignment nodes are similar in the sense that they also represent a data flow from one field or variable to another. So whenever there's an assignment from a variable to a variable or from a field to a variable or a variable to a field or maybe a field to a field, one of these edges is used. So if, for example, we have an assignment between two variables like this, then we will have a corresponding edge that says that there is a flow from p to q. If we have fields involved, like in this one, then the corresponding nodes may just be field reference nodes, similar here where we are assigning the value of a field into a variable, or actually this should be q. So let's illustrate these ideas using a concrete example, which is this piece of code here where we have two static methods. So in this example, we have three classes, a, b, and c, where a and b are both subclasses of c. We have two allocation sites where new instances of a and b are created, and then we have a couple of more variables here, q, which holds the value p at some point, and also t, which gets the return value of this bar method down here. And we also have some fields, so we know that p has a field, which will get the value that is in r, and the argument that is passed into bar also has a field f, which is then actually returned and then eventually assigned here to t. And now in order to find out where this call of t.m will actually go, we now need to reason about the different variables and their types, and then at the end also about this call. So let's do this by looking at the pointer assignment graph that Spark will create for this little program. So we will have nodes for the different allocations here. So there will be one, oh sorry, one node for the allocation side one, which is this call of new a, and there will be another call, another node for this allocation side two, which is the call of new b. Then we will have nodes for all the different variables, so one for p, one for q, one for s, another one for r, and also one for t. And then we will also have nodes for all the field references that we have in this program, where one of them is p.f and the other one is s.f. Now we have the nodes, next we will also add edges between these nodes, where we'll have different kinds of edges, these allocation edges here that basically tell us that these newly allocated objects are stored into p and r. Then we have some assignment edges, for example one here that tells us that q is assigned, no sorry, p is assigned to q, and another one here that tells us that q is assigned to s, and this is based on the information that the bar method is called here, where the q object is actually passed to s. Then we also have some edges on the other side here, one for this assignment of r to the field p.f, and then based on the return value that is returned by bar, we also have an assignment edge that tells us that s.f will be assigned to t. So now given this graph, we unfortunately still don't really know anything about this call of t.m down here, and the reason is that we do not really have any connection in the graph from one of the allocation sites, where we know something about the types to this node that represents t down here, and this is actually what we'll need the points to analysis for. So let's first have a look at points to sets and points to analysis in general, and then we'll get back to this example to see how it helps to find out where this call of t.m is actually going. So what does a points to analysis do? Well, it's essentially computing points to sets, which are sets of objects that a variable may refer to. So every variable will have a points to set, which tells us about all the objects that this variable may point to. So the objects here are represented as allocation nodes. So it's not really the runtime objects, but a static abstraction of these objects, which are based on the allocation nodes. So as a simple example, let's assume we have two allocation nodes like this, where we instantiate classes X and Y in different locations of the code, and both times write it into the same variable A. Then what the points to analysis will tell us is that the points to set of this variable A contains these two allocation nodes, and because we do know the types of each of these allocations, because we know what constructor is called, from this we can easily see that the variable A may have types X and Y. One important question when computing these points to sets is how to actually reason about allocation and assignment edges. So and the answer that we give here in the context of this lecture is that we look at a so-called subset-based analysis. What this means is that every allocation or assignment edge induces a subset constraint instead of an equality constraint, which would be the other option. What this concretely means is that if in our graph we have an allocation edge or an assignment edge that looks like this, so here we say that this object allocated at allocation side one is assigned to some variable P, then this induces a constraint that tells us that alloc one, this allocation side must be in the set of the points to set of P, but it does not say that the points to set of P is equal to this set of this allocation side one. And the reason why we use a subset-based analysis here is very simple, is that just because we know that some allocation side gets assigned to a variable P, that does not mean that later we could not see another allocation side. So by using a subset-based analysis we can basically add more allocation sides to the points to set of a variable without having to know all the assignments at once and also without having to draw wrong conclusions about the actual assignments that happen in the program. Note that all the analysis that we consider here are flow insensitive, which means that we do not consider the order in which statements are executed, and that means even if you would know, for example, that one assignment to P is overwritten by another one, the analysis doesn't really know this because it only sees that, hey, there are two assignments to this variable P, and therefore I believe that both of these values that are assigned to P may be what P actually refers to. So there is nothing like overwriting a value because the analysis is flow insensitive. So having said that, let's now have a look at how to actually compute the points to sets. And in order to do this, we will introduce one more node into our graph, and this is a sort of helper node that is only needed to compute the points to sets. This node is called the concrete fields node, and this is in order to represent all the objects that are pointed to by a particular field F. So in particular, we will look at all the objects pointed to by field F of all objects that are created at a particular allocation site. So for example, such a node could look like this, where we say there is some allocation site, say Alloc1, and every object created at this allocation site has, if it has a field F, then everything that may be in this field F is represented by this node called Alloc1.f. So now given this helper node and the pointer assignment graph, we can now compute the points to set for every variable in field using this algorithm that you see here. So the algorithm consists of two steps. One is to initialize some of the points to sets in the graph by just looking at the allocation sets, edges. And then there's this main loop here, which is repeatedly updating the points to sets based on different edges in the graph until nothing changes anymore. So this is basically done until all the points to sets have stabilized. Let's now illustrate this algorithm using the example that we've already seen before. So on the top right, you again see the code that you've seen before. On the handwritten nodes, you're seeing the pointer assignment graph that we have constructed so far. And now we're applying this algorithm for computing the points to sets of the different variables and fields to this example. So the first step in the algorithm is to initialize the points to sets of variables that are involved in an allocation edge. In this example here, we have two allocation edges, this and this. And what these edges tell us is that the variable P gets assigned the object created at allocation site, LOG1. And what this tells us is that this variable may point to all of these objects that are represented by LOG1. So this blue dashed edge that I'm using here, this is to represent the points to edges of variables and fields. And then looking at the other allocation edge, we can basically do the same. So this second allocation tells us that are may refer to the objects allocated at LOG2. After having performed this first step, we are now entering the main loop of this algorithm where we start with the first step, which is to propagate the sets along the assignment edges that we see in our graph. So we have a couple of assignment edges here, we can only propagate something if we already know something about the source of one of these assignment edges, which is the case, for example, here, because we already know something about the points to set of P and cannot propagate this information because of that assignment edge to the points to set of Q, where we will now also know that Q may refer to objects created at LOG1. Now, after having done this, we also can propagate this information along this other assignment edge down here where we can now propagate this information that Q points to LOG1 down to S by saying that also S is now known to refer to the objects created at LOG1. So, this is all we can do for the first step in the main loop. So let's move on to step number two in this main loop, where we look into the load edges in our graph. There's one such load edge, which is this one down here. Now, if you would already know something about the points to set of S.F, then we would propagate this into the points to set of T, but we do not yet know anything about it. And therefore, there's nothing we have to do here. Instead, the algorithm moves on to step number three in the main loop, where we will now look at store edges. In this example, we have exactly one store edge, which is this one. And what we do here is the following. So we look at the points to information of the variable that is stored into the field. So the points to information of R, and we know something about R namely that it may refer to LOG2. And then we look into the base objects that this P may refer to. So we look into basically this edge up here, where we see that P may refer to LOG1. And now our helper node that I talked about earlier comes into play, because now we are creating this helper node for LOG1.P, which represents all the values that the field F of an object created at LOG1, may point two. And now based on this helper node, what we do is the following. We now propagate the information that R may point to LOG2 to this helper node by saying that these fields represented by LOG1.F may also refer to LOG2. And with this, we are done with the first iteration of this main loop, and then go back to step number one in the loop. There's nothing more we can propagate along assignment edges. But now when reaching the second step of the main loop, we can look at the load edge that we had already visited earlier again. And now there's actually something we can do about this. And the reason is the following. The algorithm now checks if it knows anything about S.F. And to do this, it looks into this base object. And it sees that for this base object S, we actually know that it may point to LOG1. And therefore it looks into this helper node LOG1.F, because F is the field that we're accessing on whatever S may point to. And therefore it'll see that LOG1.F may actually point to LOG2. And because we know this about S.F, we can now propagate this information down to T by saying that well, if S.F can refer to LOG2, then T can now also refer to LOG2. So we now add these points to LOG2 into our graph. And then the algorithm moves on and will essentially see that nothing changes anymore. So there are no more points to edges that we can add to our graph. And therefore the algorithm terminates and has computed the points to information for all the fields and variables in this example. So now the big question is, of course, what can we do with this information? After all, our goal was to compute a call graph. And to illustrate this, let's have a look at this call of T.M at the very end of our FOO method. And as I said earlier, the class hierarchy that we assume here is that we have two classes A and B, which are both subclasses of a class C. And let's just assume that each of these three classes A, B and C is actually offering a method M. So without knowing anything else, the algorithm would have to assume that T.M can either call A.M or B.M or C.M. But now based on the points to information that we have computed here, we know that T actually always can only refer to whatever is created at LOG2. And LOG2 is known to create an object of type B. So what we know here is that the call goes to B.M and not to A.M and also not to C.M. So we have ruled out two other call edges that we would have otherwise, because we have computed points to information here. So now I hope to have convinced you that this idea of Spark of combining points to analysis with call graph construction is actually a good idea, because it can give you a more precise call graph. The other idea that Spark had introduced was to have this generic framework that allows you to express different kinds of call graph analysis algorithms in more or less the same framework and just to give you a glimpse of how this could work. So if you now take this algorithm and the pointer assignment graph that we've seen earlier, then you can basically change some details of this and then you'll get some of the other algorithms that we have already discussed and potentially also other variants that we are not discussing here. So one change, for example, is to instead of having one allocation node per allocation site, you can also have just one allocation site per type, basically conflating all the different places where the same constructor is called. Another variant is that instead of representing fields, precisely by looking at exactly what variable a field is called on or is referenced on, you can also represent fields just by their signature, meaning by the class name dot the field name. And another variant, for example, would be to instead of looking at assignments in the subset manner that we've seen here to to let them impose equality constraints, which will be a little faster, but give a less precise call graph. So in summary, let's have a look at the pros and cons of this Spark framework. So on the benefit side, one benefit clearly is that it's a generic algorithm where you can tune precision and efficiency according to your wishes. So if you want to analyze a very large program, maybe you want to use a less precise call graph construction algorithm, but still be able to do the analysis, whereas if you have a smaller program, maybe you can afford to do a more precise analysis. And to get a precise call graph, one of the key ideas in Spark is that it does points to set computation along with the call graph construction, which, as we've seen, can increase precision. On the downside, all of this is still flow insensitive. So the analysis still does not really look into the order in which statements are executed, which is something one could do, but of course, would make the analysis, again, more expensive. But even the way it is right now, if you use all the features that Spark provides, it can actually get quite expensive and take a while to compute for large programs. All right. And this is already the end of this lecture on call graph analysis. I hope you now have a better idea of what a call graph actually is and how it can be computed in an object-oriented language. We haven't really talked much about the applications of call graphs, but they can be used for many, many different things and are at the basis of many other static analyses. So, for example, if you just want to know whether changing the code in one method may actually impact the code in some other method, then one way to do this is to look at the call graph and to see if there actually is an edge or maybe a transitive sequence of edges from the first method to the other method. If you want to try any of these things in practice, I recommend to have a look at the suit framework which implements these algorithms for Java. So if you want to just play around with it a little bit and see what kind of call graph you'll get, then you should have a look for suit. Thank you very much for listening and see you next time.