 OK. Thank you for coming here in this rainy day for getting out of bed and showing up. The first announcement, the midterm should be graded sometimes tonight. So hopefully by tomorrow morning, you will see the grades in your G lookup. You can then pick them up in the section next week. If you can't really wait, then you could come to the office hours and talk to the TAs and get those midterms. I want to set a little bit of a context for this lecture. So essentially, first half of the semester was about how to build abstractions and use languages to make it easier to program. And then we started moving towards doing things statically, which means that compile time, proving properties of programs so that we know they don't contain bugs, they don't have any security vulnerabilities. And we did look at Java type systems and ML kind of type systems, sort of the object-oriented ones, classical ones, and then those based on the inference. And what we did last time, we looked at a type system in which we took a program and translated it into prolog rules. And then those prolog rules encoded a set of constraints about what must be true about types in the program. We solved them. And the result was a type of every expression in the program, including the type of every variable, of course. And if those types were consistent, we did find a solution. And we knew that the program was type safe. We never did things like reading a field that didn't exist or adding integer in a string. Today we'll use the same approach to do something else. We won't quite build a type system. Our interest is not to type check the program, but to find some other properties about it related to optimization, to security, and so on. So we'll talk about what is static program analysis. It's somewhat different than type systems. It's different in that in type systems, if the program didn't type check, if those constraints could not be satisfied, we had to reject it, and the programmer had to fix it. Program analysis is different. It tries to discover deeper properties than the type system. And when it cannot prove them, it just says, well, I did my best. Here is what I know. I don't know more about this program. And so the knowledge that the static analysis derives can be used to optimize the program when the knowledge is useful. And it can be used to actually prove that the program has some properties, meaning if at every point where some bad vulnerability could happen, the analysis says, yeah, I can prove that vulnerability cannot happen, then the result of the analysis is essentially approved that the program is safe from hackers. We look at some advanced analysis called points to analysis, also called flow analysis. You'll see soon why it's called like that. And look at one particular algorithm for computing it called Anderson's algorithm, because it was invented by a guy called Anderson. And it's very simple. We'll write it in prologue. It will be perhaps three lines of code. But we need to understand a little bit before we get there why three lines are enough. And maybe we have time we'll talk about why this algorithm is essentially parsing. Not parsing on a string, as we look so far, but parsing on a graph. And sort of we'll go full circle back to parsing and prologue and cyk. So what is static program analysis? It answers questions about the program. It's related to type inference, because it looks at the program and tries to infer certain things. What could those, well, let me first tell you. So static analysis means, as you know, at compile time. So you do look at the program before you actually see any input. You don't know what the input is. You are sort of running in the compiler. And that means you need to prove those properties before you see the input, which means you have to prove that whatever you're claiming is true, such as x has value 7, is true for all inputs to the program. And so here are some sample properties. You may want to ask, is variable here a constant for all inputs? Meaning for all possible ways how the program can get here for all possible ways of variables? Or you can ask whether in, say, a function returns a table, as opposed to, say, an integer. And then the properties can go on and go on. So let me show you four motivations for why this analysis is becoming more and more interesting in everyday practice. First one is just performance. You always wanted to have faster programs. And you may want to replace an expression x of i with expression x sub 1. Why is that faster? Because you do not need to take the value of i, add it to the beginning of address of x, and compute then where the value is. You have a much simpler address computation than i is a constant. So in this program, here you know that i is 2. Here you know that i is 4. If this if statement does not touch the value of i, then you know that i is 4 also here. And you can optimize this by replacing this with 4 and this one with 3. Of course, the story is different when this branch here contains an assignment like this. What would the analysis do then? So imagine that if that if statement, which is conditionally executed, we may want to modify i. The analysis could branch and then, well, what would be then the result of the analysis? Yes, so this is one possible way to design the analysis. You could say i could be, well, 4 or 3. And then you would know all possible outcomes. So that's one way to do it. And whether you decide to do this or something else depends what is the client of the analysis. The client of the analysis is whoever runs the analysis for the purpose of getting some knowledge about the program. So here for the purpose of optimization, we probably couldn't do that much if we knew that i is a constant, but it's not always the same constant. Because you couldn't just replace i in this array indexing with those values. And so you may want to design analysis which, rather than branching and keeping all the facts, we'll say instead that i is an unknown value. Because it is not a single constant. That would be also a legal way to build an analysis, except this one learns a little bit less. Another one is security vulnerabilities. So many of you have heard of injection attacks. The injections attacks are where, imagine you have a program running on the server, say your web server. Usually, what you enter on the client side in the browser is translated to some SQL command on the server side which you then send to the database. If the web service is not designed well, you could enter, instead of your social security number, for example, a string that is a SQL command such as dump the entire database back to the browser or remove the entire content of the database. So these attacks are common and people design so-called taint analysis. What these taint analysis do? They check whether a value might flow from a post command so that input of the program to which the attacker can enter any value. So this is some untrusted source to the SQL interpreter, where this input could be interpreted not as the value of a name of a person or social security number, but instead it would be interpreters a command, a dangerous command. So the taint analysis tries to determine whether each value from untrusted source goes to a trusted sync, a dangerous thing, a system command through some sanitizer. And it's called a taint analysis because what it does? Well, you take the value that comes from the untrusted input you taint it with a bit when you do it at runtime. And the value propagates with that bit. And when you come to a SQL command and the value is still marked with the bit, you know that it has not been sanitized, that there is a bug in the program which makes it possible to reach that SQL command without running that value through a sanitizer. So such libraries have been written for Python and they essentially give you a dynamic way of protecting yourself against these attacks. It effectively plays an assertion in front of the SQL command which stops the server program before executing some untrusted value. That may or may not be desirable, right? That value may be completely saved, yet you are not willing to serve it, the request to the server. So it may be much better to actually do it statically. And the statically, the static analysis is essentially a direct counterpart to this dynamic analysis. Rather than at runtime propagating the value with the annotated bit, you are going to analyze the program and pretend that you are running it on all possible inputs. And essentially asking, does there exist a path in the program through which that value could flow from the input to the trusted sink without reaching the sanitizer? But you are doing it without actually running it, you are doing it by analyzing all paths in the program. And the analysis that we look at today is exactly one such static analysis that asks whether values could flow from here to here without being interrupted somewhere. Motivation number three is optimizations for virtual calls in Java. Why are virtual calls in Java more expensive than regular calls? Because you need to take the address of the object, go to the header from there, find the pointer to the virtual table which contains the methods for the class of which that object is an instance, you find the method and then you can call it. So there are typically two extra load operations that you need to do, which is a significant overhead compared to a normal function call. Additionally, normal function call can be inline, which means you take the body of that function and put it right there where the call happened, essentially eliminating the call entirely. That you cannot do with virtual calls easily, because you need to dynamically check at runtime which function to call. So how would you actually optimize it? Well, the idea is that you determine statically, again at compile time, what is the target function of a call? So if you see something like p.f, a call to a p.f, if you could determine what p.f is at compile time, then you could pick that function and hard-code a call to that function. That, of course, assuming that there is a single target. So if the analysis can say that this is a, there is only one function that could be called here, then the compiler can replace this virtual call, including that lookup with a direct call to that function. So how can we do it? So the first possible way of analyzing it is, not through flow analysis, but just look at the declarer type of the variable. So if you have a call here, you say, oh, where was a defined? That's defined of type foo. And sometimes knowing that a is of type foo is sufficient. Because if you know that function f, okay, there exists only one function f in the entire hierarchy, whether it's foo or bar, if there is only one f, there is f is not overridden in bar, then you could legally just make that optimization, replace the virtual call with normal call. However, sometimes that declare type is not precise enough because a could be foo or a subclass bar. If we have different fs, then you cannot do that optimization. So the flow analysis does better. How does it do? It tries to actually understand by analyzing the flow value through the program, what is the dynamic type of the variable a. So it sort of simulates, you could say the execution of the program without actually running it on a particular input. And if you could then determine that this value here is a subclass of foo, say bar, then you know a little bit more information and that may be enough to constrain the set of targets of f to a single target and perform the optimization. So here is an example. So if you look at this call here, look at the program and tell me whether it is possible to optimize that call into a single target call. So replace the chain of lookups, go to the header, go to the virtual table. You propagate the actual dynamic type of b, right? So you know that the, what is the static type of my b? It's b, right? The dynamic type is also b. So the static type and dynamic type of this is my b, sorry, it's b. Now the static type of my a is a, but by understanding what values I assign into a, you can see that actually my a will contain at runtime, no matter what input you give to this program, that's important. It will contain a dynamic value of a dynamic type b. In other words, my a will point to an object that is an instance of b, okay? And therefore, if you look at the call to bar, the argument a is declared to be of static type a, but it actually will be a dynamic type b. And therefore you know that this call foo is going to go to this function here, and you can transform the program to actually call that function directly or then go through these lookups. So somewhat unrelated to the static analysis, but related to the optimization is a little problem with Java semantics. Why cannot you always do this optimization? Casting is not really a problem because you can propagate those dynamic types through the cast, and you would have exactly the same information as here. So casting does not cause a problem, but there is some feature of Java that would actually make what I'm describing here unsafe. I think you're on the right track. Let's try to make it more detailed, okay? So it is essentially the delayed loading, right? The fact that classes in Java are loaded during the execution, okay? So if you look at this program here, it only has two classes, A and B. B is a subclass of A. And you see the entire universe, the entire set of classes that exist in this program at this point. But it could be that later the execution continues somewhere else, and all of a sudden class C is loaded, which is a subclass of B. And now we have A and subclass B and subclass C, and C also contains foo. And now it's no longer true that instances of B could only call this function, right? Because if something has a type B, could it also be C, a subclass of C, and therefore another function could be called, right? In this specific case, we do. Now that's true, but let me hide this. Imagine that this is, instead of this, I'll hide this from the analyzer, okay? And the analyzer could actually still be smart enough to optimize this program. Why is that so? No, assume that at the point when we do the analysis, say we start a Java program, we are in a just-in-time compiler which does the analysis of the program at runtime. So we run the program for a little while, then we look at this piece of code here and see this is a frequent piece of code, which means we want to optimize it in the just-in-time compiler. So we interrupt the execution, analyze the program, and we realize, oh, this here, this always goes into this function foo, so we are going to inline it. And even if the analyzer doesn't see this expression or it is a code to some further code that is, I don't know, too big to be analyzable, it can still make a deduction. It can still prove that this optimization is safe. Well, why is that? So essentially the knowledge that we have is that the static type of my B is B. What could be the dynamic type of my B? If your static type is B, your dynamic type can be B or anything lower, right? Essentially the declared type of the variable my B constrains the set of dynamic types that the variable can have at runtime. So static time is B and the dynamic type can only be B in this case because we have only seen classes i and b. And therefore the dynamic type of my A is going to be B. So my A has a static type A, dynamic type B, then we propagate the information here and we see that the dynamic type of this variable is B, not just by propagation from here to here to here, and the optimization is legal. So the declared type still helped us to do the optimization. And now we continue running the program after this optimization. Class C is loaded. All of a sudden the class hierarchy is deeper. A new foo might have appeared. It might have come with class C. And now this piece of code here could actually generate an instance of class C and this call here may need to go into C.foo. So the fact that we are doing analysis on a program that is essentially incomplete because more classes are loaded make certain optimizations difficult. So what do you think just-in-time compilers do to deal with such situations? Well, what would you do? Yeah, so even a simpler solution is not to optimize except these virtual calls are expensive. And if you had to always assume that, oh, more classes could come, then you could only optimize classes that are final, right? If you know that the B class is final class, it cannot be further extended, then this optimization would be safe. But I think you want to be more aggressive and essentially re-optimize. Each time a new class is loaded, you could look at all optimizations you've done and see are they still safe. And if not, you need to undo those optimization. Okay, so excellent. So there are various ways how you can check that those optimizations are legal even after new class is loaded, right? One of them is re-analyze everything, re-analyze everything perhaps that could be affected by this new class. And if the optimization is still safe, meaning there is no C call on foo, you are good. Another one is what you suggested is to insert here a check which checks whether the dynamic type of A indeed is B. And that you could do, but that slows down the optimization a little bit because now you have this dynamic guard that's executed each time. So now that's probably an extra load into the header of the object. And you'd rather avoid it. So it's hard to say that there is a trade-off between the sort of check eagerly and once and then leave the fast code really fast or put a runtime check. It's all legal. Are you good? Yes, so indeed this probabilistic analysis is what the first just-in-time compilers did, right? We run for the language called self which is sort of a version of small talk which is sort of object-oriented version of scheme. Well, that would not be fair actually. It was the first language that brought object orientation. And so it was done by a student here from Berkeley who realized that, well, if a virtual call is expensive, let's just do profiling. So runtime analysis of the dynamic types that you see on a particular call. So here, if you see that the dynamic type of A is usually B, then you inline B, right? So you take the body of B, put it right there. So there is no call, no copying, no passing of arguments. And you just have one guard which says, well, if it is not B, then let's do a normal call. And usually it works really well. But you need to have expensive profiling, right? You need to, at runtime, figure out what these types are. And you still have this one guard, but yes, this is what they do. In fact, this is what they, all virtual machines today indeed do. For the V8 in a JavaScript in Chrome, same for Mozilla stuff. Okay, another one is verification of costs. I'm not sure I wanna say much, but imagine that you are looking at this. You'd like to, so you know what you have in Java, right? In Java, a dynamic check is inserted each time an expression is cast. So we know that Java is statically typed, but not fully, it has dynamic checks. And maybe you want to be extra sure that the program is type safe and indeed look at your program and make sure that no cast would ever fail. Because you are shipping it to the moon and you don't want the program to reboot just before landing. And so you could do, again, analysis and analyze what could be the dynamic types of this expression. And if it is always such that the dynamic check would pass, you have verified that the program would never fail any of those cast dynamic exceptions. And this is indeed what high assurance software does. It proves such things. And this is, again, an instance of flow analysis. You're asking what objects can flow here, what other dynamic types? If they agree with the dynamic check, you have proven that those checks cannot fail at run. Let's see, do I want to do this one or not? Let's skip this one for now. All right, so another one. And this is probably most interesting since we spent half of the semester looking at dynamically typed languages and they are becoming more and more popular. You know how we fake objects. We fake objects by building them on top of dictionaries of tables, right? And essentially an access to a field looks like this. It's a lookup of a key in a dictionary. And we would like to optimize it by compiling it into a code that looks like that. It doesn't compute the hash value of F2 and go into the dictionary as a hash table, very expensive, but instead represents the object that's a struct and then the load would be just one instruction that takes the value of P adds four, which is the offset of the F2 field in that struct and that's it. So when is it possible to translate an object that is truly a dictionary? Please, that's how you define it into a struct and perform this really efficient access to it. Can we think of conditions under which that's safe? Again, that would be something that the analysis would determine at compile time. Okay, so let's start. Okay, so if we know that the table always has an F2 field, we could make it into a struct and one field in that struct would be an F2 struct. Can we relax it a little bit? Do we actually need to know that the table must always contain the F2 field, right? In fact, one naive way of compiling objects in our 164 programs or in Lua or JavaScript is find all the fields that exist in the program. Yes, they are in different objects in different classes, but if you could find them all, you just build one huge struct that contains fields from all the classes and you give them all offsets, some four. And each time you create an object, you create another instance of the struct. It would not be exactly memory efficient, but it would be faster at least in terms of time. And yes, there would be objects that don't contain all the fields, but that's fine. So we don't quite need to know that the table always contains F2. We are happy to put more fields there, at least for correctness sake, it doesn't hurt. What do we really care to know? Whether the size of it is static, okay? So if somebody adds dynamically more fields, that would be a problem, okay? Now imagine we deal with that with sort of the hybrid approach suggested that you have part that is a struct with the fields you know, like F2, and things that you add later, like you know how Lua does arrays, yeah, there will be a dictionary. Indeed, the Lua interpreter does have this hybrid approach and it makes sense. So yes, you're right, that condition requires special runtime data structure, but it's doable, right? So you need to know the size of those fields, okay? So those fields, if we are in a language like Python or Lua or JavaScript, all the fields are essentially the same, right? They are pointers to these boxed objects, like integers or they could be arrays. So all of them are perhaps 64 bits. And so if you wanted to be more efficient and actually put those int values, if these are int values, right? Then you would need to prove that they're always int values, which is an extra level of analysis. So we still haven't gotten to the key condition for when it is safe. What could really break if we do that? Okay, all right. So that's it, we cannot afford to do this. If we are accessing the table through dot ID operation, which is the sugar into that, of course, then we are safe, but when we are accessing it this way, which means the name of the field is actually a string potentially computed at runtime or potentially read from the input, then it is not safe to represent it that way because something like p.f could be accessed in two ways, one through p.f and one through this way and another one through px, where x comes from whoever knows where, okay? So let's look at this example here. Here is why it is dangerous that if you don't know what is going to be read from the input, right? So if that object that you want to represent as a struct has such access is done on itself, then it's not safe to do this really important optimization. So let's look at this, right? And have a look at this line. Here is where we are creating objects of type full, represented as a dictionary, of course, it's JavaScript. And we would like to know whether it is safe to use a struct here. Plus we would like to know if it is safe what fields it contains. So can you look at the program and determine that? So I'm really asking you to look at the program and analyze it in your head and see it does the program use those objects created in line one, creating here. There is more than one, of course, assume this is all sitting in a loop somewhere. Is it using it in such a way that it is safe to use a struct rather than a dictionary? Now the browser battles that you have seen essentially are, this is the battlefield of those browsers, right? These JavaScript interpreters are trying to figure this out and find really efficient implementations. So looking at what the engineers building those interpreters just in time compares, what are the things that they would like to determine? So is it safe to use a struct here or not? Clearly, if you make a mistake and your interpreter uses a struct and I know you are using the browser and the interpreter made a mistake and use a struct when it was not saved, then I can hijack the browser, steal your credit card numbers. So you don't want your 100 million users across the world to be exposed to this. So it's important to determine this correctly. So is it safe in this case or not? If not safe, then you should show me a scenario under which struct, you could hijack the program with a struct, perhaps, I don't know, right? And into it and read it back as a string or vice versa or write into an address that you should not be writing. Well, so okay, so let's come up with an example in Foo. What could happen in Foo that could prevent using it? So okay, let's for now simplify things and let's assume that Foo is sort of nice and friendly. That both Foo does, it essentially initializes a field. This dot, it has a field, right? Let's assume this is what happens in Foo, in the constructor. Imagine you can read it, you can analyze it, do you know that's what happened? So in the nominal semantics of the program, the result of new Foo is a new struct with one key field, with one value seven. Now under that assumption, can we use a structure other than a dictionary? We have r dot f p, oh, I see we do, all right. So let's do something here. We'll make it an object, good point. Yeah, so the point was, if that was indeed seven, then this here would have a problem, okay? So bar presumably now has the field F. So is it safe? Not safe? See, this is the analysis that we would like to build, so that we don't need to stare at the program for minutes, but instead in ideally a few milliseconds, the compiler wants to determine whether it's safe. So what do you think? So that's a correct reasoning, but let's try to make simpler reasoning. How about this, what objects are we accessing with this dangerous notation, right? We said that we could translate objects into a struct if we are accessing them through dot field notation, because the name of that field is always known from the program source. But if somebody accesses an object through this notation, then this could be anything, and you don't know what key you are actually reading. So these are the only two instances over here, right? Where that can happen, okay? They work on what object? On variable, whatever is stored in variable s, right? Now what can s point to? Could s point to an object created here in line one? It couldn't, right? S is always allocated afresh right there. So we have obtained a proof that in this program, the only instances, the only objects, access through this bracket notation, which is the dangerous one, because it allows you to access arbitrary key, whose value, whose name you know only at runtime. The only objects are those created here, and the set of those objects doesn't overlap with the objects created in the full line, right? So in other words, the full objects are safe. They are not going to be accessed here. They are not going to be accessed here. So you can do your optimization and run perhaps 100 times faster. Well, maybe 10, maybe 50%, still a big deal, okay? So how do we build an analysis that can actually do all of this, okay? So first of all, what do we expect from static analysis? As I alluded at the beginning, it's not like the type system which says, I type check the program and everything is hunky-dory, or I didn't, and you need to rewrite it, right? The type checker says, I cannot even compile the program if you don't fix the types. The analysis when unsure says, I don't know, but this is what I learned and the rest I cannot guarantee. So the answer from the analysis must be conservative in that the client of the analysis, the optimizer, the verifier, the security analyzer will not do anything dangerous, okay? So it must err on the side of caution. And if this is the specification for the analysis, it seems that it's really easy to satisfy the specification, right? How would you build an analysis that satisfies this? Well, you do something and then you say, I don't know, right? So that's exactly what you could do, but ideally you would like to do something better than that. You would like to perhaps answer half the queries at least, that the client asks you. So there are several ways how you can be unsure. Perhaps you will realize that, oh, this value is a constant along this path, along the other path, I don't know what is the value of x, because it's incremented by some unknown value, in that case you have to say, I don't know what the constant is. So that should be clear by now, but here we're elaborating how misleading a client could happen. In the taint analysis, you would like to verify that in that program, there is no way how a value from the dangerous input that the attacker could provide flows into a SQL command. You would like to make sure that no such path exists. So saying that, oh, all paths are correct, no such path exists would be misleading. Okay, so what analysis does all of that? It's a flow analysis, right? If you look back at what we were asking, we were really asking, how does a value flow through the program? Where can it reach from here to here? It doesn't seem now like all four clients we looked at were the same kind, right? Because the first one was constant analysis. We asked, what is the constant of this point here? Is that really about the flow? Essentially, as we ask, all the values that can flow here, do they have the same value, okay? For taint analysis, we ask, all the values that can flow to the SQL command, do they have their taint bit cleared by a sanitizer? The verification of the cast, we ask all the values that flow here, what are their dynamic types? So it's all of the same kind, whether it's constant or pointers of what is flowing. So essentially, we are talking about flow from some points where things are created, like constants or tainted values to use this, which could be the SQL command or virtual calls. Okay, so let's look at how to build this analysis. And we assume initially that we are dealing with Java. Java is easier because thanks to classes, we know what are all the fields in the program, okay? And okay, so this is what we are going to do. We take the Java program, and we know that these are the sources of our information. This is from where the values will flow. Imagine we are doing the optimization of the virtual call. The objects are created at new sites, so this is where those values flow from. What we'll do, we are going to rewrite them into some special values, O1 and O2, think of them as constants. We don't really know what those constants are, they are really not constants. They are sort of abusing the notation of a constant. So this is the first transformation we'll do, but look back. I'm making here an assumption for now that so far I'm handling only assignments, no .fd references. So imagine all you can feed me at this point is program with assignments, okay? So I change news to a constant, and now I change the program into constant propagation problem, okay? And what do we do? Well, we propagate these constants O1 and O2, we think of them as abstract objects. They are not concrete objects, dynamic objects created, they are sort of abstract objects that stand for all the objects that could be allocated at that site, and we propagate those, okay? So that's essentially all we need to do. Now we'll consider statements that are writes into a field, so we'll call them a put field, and these are called get fields in Java because they read from a field. So we have added those now into our programs, and now we want to do analysis on programs that contain not only creations of objects, assignments, but also put fields and get fields. And there is one more question that we need to ask, and the question is what? The question is whether this and that could ever point to the same object. If they do then, then what? Then what have we learned? Why is that question the important question? Okay, so that's exactly right. If Y and W could be so-called aliased, meaning they are two names for the same object, right? Y and W really point to the same objects, you could say they are aliased, then the value that is assigned into the F field of Y could be read out from the F field of W and what flow as a result is established. In this program now we'll sort of see another kind of flow. So we see that this object flows to X, this one flows to Z, this one can flow to X because it's the same value, therefore it can flow to W. It can also flow to Y, right? I'm showing you how the value flows. If Y and W are aliased, then we have just inferred another flow. Between what points of the program did we just get a flow? Yeah, so Z goes to Y to the F for sure, it goes into the field, then that's later could be significantly later in the program is read out from the same object under the name W to the F because Y and F are aliases. Flows to V, so now we have inferred another flow and this flow here is a consequence of the fact that these base pointers here Y and W are aliased and we are writing into the same field F. So that's essentially the logic except now we somehow need to turn it into an algorithm, an algorithm that will analyze the program. I showed you something trivial, you know how values propagate, you know what it means for two expressions to refer to the same object. What do we do to make it a scalable algorithm? Or maybe we should go through several concerns that one needs to go through when trying to analyze programs. So what would be a few questions we need to answer? So that's essentially right, we will set up a bunch of constraints and we would like to set them up and then solve them effectively computing sort of a transitive closure of where the values can propagate, right? How far the values can reach? And so we compute sort of the furthest reach of the values and we cannot propagate it any further. We know that those values cannot go any further and we have approved that they'll stop those values where they stop. So the untainted, the tainted values will not propagate to our SQL commands. But the key question is what constraint do we set up? There is a million ways of how to do it. So let me suggest just one concern is that if you look at this Y here and this W, these pointer variables are reference at many points in the program. And if these different points of the program they could refer to different objects, okay? How do we deal with the fact that Y and W could be aliased here but maybe are not aliased somewhere else? Are we gonna keep track of those constraints differently at each program point? You see now how we need to keep track of information, potentially need to keep track of information that maybe true here but maybe not true somewhere else because we override Y and W, okay? So what would be some suggestions? Okay, so what do you essentially are proposing is this transformation called single static assignment doesn't matter what it's called but they essentially are saying if I use Y here and I use Y here, let's rename them and I'll make them a distinct variable. I call one Y zero, the other one Y one. Now I made a distinction between them and I'm all sent, okay? And I'll do it in such a way that each of these variables is assigned it only one point in the program and therefore the variable will have the same value everywhere, I could do that. But it doesn't solve the problem yet because it could be that the object O one could have different content at each program point. So maybe O one dot F at this point points to O two. Maybe later in the program O one dot F changes and it points to O three. And now you need to do that with all the objects and now it becomes hell, in fact, the problem is really hard, right? So think about it, the abstract objects, right? That you allocate, they contain fields, you would like to know it seems what those fields point to, right? What are the abstract objects they point to? Effectively you somehow need to capture the state of the heap. And if those change, it turns out that such transformation is not so easy to do. So let me show you sort of what we just discussed this. What we do, rather than keeping track at each program point separately and say, oh, this is true about the program here and this is true about the program here, what do we do? What is again a cheap thing to do? That's guaranteed to be safe, maybe at the cost of some precision. Yes, essentially that's how we do it. But there is a simple transformation that explains why this sort of makes sense. So rather than keeping each of these points separately, what do we do? Can we transform the program where we get that somehow for free? Or we'll do something even dumber than that. So if you look at this program here, it has several program points, right? Here is one program point, here is another program point, here is another, sort of before every statement you have one program point, right? These are distinct points where different facts may hold. What do you say? Hey, I don't care, I'll collapse them all into one. And essentially, this is what we are saying. We are taking these points and collapsing them into one and we are creating a program that looks now like this. You have essentially while, true, and now you have a switch statement, okay? And here we have statement one, statement two, statement three, statement four. And in each iteration of that loop, the switch statement will execute any of the statements. Now this program clearly is no longer correct, right? Am I right? It collapsed those points and essentially threw all the statements into one bag. And in fact, I'm sort of putting here a devil into the switch statement. I'm saying that really consider all possible ways the statements can interleap. Why is that a safe conservative thing to do? Exactly, because if this program is safe, if this program cannot do anything bad, then the original couldn't because all the execution paths in the original program are contained in here. As I created more paths on which I may discover that some properties don't hold and I will say, sorry, I cannot prove that X is a constant because of those paths that I created. But when I prove something, it's definitely was true about the program before I smashed it together. And this analysis after I transform its program like that it's called flowing sensitive because we are no longer sensitive to how the values flow through the program. We said, well, they can flow in whichever way they want. So we lose the sensitivity to how the execution progresses through the original program. Okay, now why is that interesting? Because now we have one state, we do not need to keep track of the program state. We don't need to know what is in this field of this abstract object here versus there. This is what will allow us to build sort of the four line program that will analyze the program. All right, so we talked about that, why is it sound? Yes, it is of course potentially imprecise, right? Because we have introduced paths that didn't exist the original program. And on those we may discover that maybe the value, the tainted value could flow to the SQL command even though in the real program they couldn't. So this is the over approximation of the analysis. All right, so let's develop the analysis, right? We want analysis that has four lines of code, which means we cannot look at all possible Java statements in the specification because there are just way too many. Which means we need to sort of find a set of canonical statements that can represent all Java programs in some way. Think of it as your core language into which we'll desugar everything. So can you think of four statements that we'll need to represent all Java programs? Now that sounds like a fun challenge, right? So Java could have well, function calls, arrays, objects, arbitrary expressions, p.f.g and calls and could have p.f.i.g and we make a call. Now this goes into another function. The result here is we referenced again, legal Java, right? We need to analyze that. So four statements, so can somebody tell me the four statements that we'll need? All right, so the first one. We need put and get, okay, what else do we need? What was the first one? Declarations of, yeah, so the introductions of variables, okay, so, good, okay. Let's put declarations there. I'll finesse it soon later. I don't want it to count as number five, all right? So what else, okay? No, I won't let you do function call. We don't wanna deal with functions. I'll show you how we'll deal with, how we'll desugar away the function. You could, you would get a more precise analysis than what we have, but in this one, let's say no function. So method X, I would say these puts and gets are essentially accesses to fields and whether those fields are methods or not, okay? Assignments, we need assignments and now I will say the declarations of variables when you have something like that. I'll make that an assignment. So imagine you assign the variable right away where it is declared. Essentially what we'll do, we'll ignore the declarations and only take the assignments. One more, I see, okay. Right, so essentially we want to model the input as a way of saying we don't know anything about this value. Yeah, so that would be sort of a special right-hand side of that assignment. So I'm not treating it as a special statement, but yes, you would need a special marker, something like, could be anything kind of marker. That's an excellent point. Turns out that we don't really need it if we only care about flow of pointer values because it's either a legal object in Java allocated at some new site or now. So you cannot in Java have an input value coming into the pointer, right? The program would be rejected by the type checker. But we need those new sites, the new things, right? The allocations. So here they are, here are the four. So we'll have the allocation, which goes into some variable assignment, get field, put field. Now everything else will be desugared into this or ignored, why ignored? Because if you have arithmetic in the program like x plus y and we are only analyzing pointers, then we can safely ignore x plus y. It just doesn't matter, doesn't influence how the values flow. Unless you look at arrays, okay, so here is first of all how you desugar something complex like that into the three canonical statements. And what do you see here is essentially translation into your byte code, right? You're familiar with this. How about method calls? Let's look at the method calls first and then I'll return two arrays. So if we have a method call like this, okay? Well, what do we do? We introduce a new return value, a variable for the return value, okay? And whatever the function returns, right? This one, the returns x dot f is just assigned into this return value. How about arguments? Well, this function has an argument x, right? So we can model it how? We can just call it x. If you need to distinguish it from other x, you just give it a special name. So the function calls and the returns are just modeled as assignments into parameters and back from return values. Are we approximating something by doing this? Are we losing some precision again? Okay, so now that's a different thing, right? So this is issue number one, right? How do we pass arguments and the return values? So that's right. There is the other issue is that if you have a virtual call, you don't really know what functions can be called, okay? Can you suggest two ways of handling it? Exactly, so we'll do that. Essentially, we need to do this collapsing and say, I could call this function or that function, that function, make assignments between the actual arguments and parameters for all of them. But we would like to be precise. Turns out that this matters a lot for the precision of the analysis. So we would like to reduce the set of targets and not be extremely conservative. How would we reduce the set of targets that the call might invoke? What would be a simple technique that you could do? Well, the simplest one is just call all of them, okay? A better one would be, could we look at the type hierarchy and figure out what functions could be called given the static type of that receiver of p.f call? Okay, we could do it. So you could look at the static, look at the type hierarchy, all right? What could we do better than that? Now if the static type of p tells you that you could call one f or another f or another f, how can we do better than looking at the static type? We can compute the dynamic type with this analysis itself. So the analysis itself could sort of interleave, compute the dynamic type of p and say, oh, I could call this function and now you sort of throw it into this graph. And now when you later discover, I could call the other function instead of grow it. And the analysis does two things simultaneously, propagates the set of facts, as well as the information which functions could be called and it sort of throws them in. The graph itself grows and the things that it propagates through the graph grows as well. Erase, so I said that if the program does X plus seven, we'll ignore it because we care about the flow of pointers, right? But clearly X plus seven could be an index into an array. But what we'll do in this analysis, we are going to take all the elements in the array and squish them again, right? And the array G is going to be represented with a single field called ARR. That we represent all the values that could end up in all the elements of the array. And you're starting to see the key ideas behind sort of static analysis of programs which are more and more important because the world is becoming distributed. Now people who are sending programs between each other are adversaries, right? You are happily downloading programs into your browser. So these analysis make more and more sense and will be more important. Now it turns out that if you know from your, from other courses, I don't want 70 perhaps, that what is the, an undecidable problem, right? Problem that no algorithm can answer. It turns out that most properties that we want to ask about programs, even as simple as does the variable X have a value seven on any possible input. Even the simple constant propagation property, we want to compute it precisely, turns out it's undecidable. Meaning there is no algorithm that can answer it precisely. And the simple answer is that if you really wanna do it well, you need to consider all possible executions and there are how many possible executions, right? There are infinitely many unless, unless you can bound the space of the inputs, right? If you tell me all you can only pass four bit integers into the program, then you have bounded the set of possible inputs, then you can of course answer the question. But if the set of inputs is unbounded because you can feed it sort of arbitrarily big documents or strings, then almost no questions that you would like to answer, you can actually answer precisely because the problems are undecidable. And that's because there are unboundedly many executions. So what do we do? We collapse them into some sort of bounded number of executions. And what do you see here is another collapsing. The array could be arbitrary large, but what do we do? We say, well, we're gonna smush all these elements into one or the values will be kept into one location. Now the array is bounded. A lot of approximation happens because if one element always is seven and the other one is three, now we think that, well, we don't know whether it's three or seven could be either. So result of diluting the information. All right, so let's write down the algorithm. It's only four lines, so it should be simple. So we are going to compute two binary relations. So these are really prologue relations, right? Points to X and O. I'm gonna write it this way because it turns out it's somewhat easier to read. And X points to O will hold when X may, during the execution, point to some object allocated at the abstract location O, which is exactly when a value of O can flow to X. These are inverses of each other, right? You can think of these are the properties we are going to compute. And so if this one holds, then O flows to X. If O flows to X, then X points to O, all right? Now, this support our clients because if you want to figure out the targets of virtual calls, what do you do, right? You ask whether X can point to an abstract object of a particular dynamic type, the dynamic type that is behind the new, that's the dynamic type. You ask what are all the objects that can flow to my virtual call? If all of them have a particular dynamic type, then you can do the optimization, right? So here is if we have P dot F, you ask what are the values that can flow here? And if those can flow here, and this is O one, O two, O three, and if all of them are new bar, and this is a new bar, and this is a new bar, and this is flows to, or in other words, in this direction is points to. Then the optimization is saved because P contains bar for sure. Okay, so what is the first rule, right? Look, we have four statements, so we essentially just need four rules. So when you see this in the program, right? This is a statement in the program. We are going to translate it into this fact. It's a thing of it as a prologue fact that will be written as say O one, once we write it in prologue, but I'll write it as this binary relation, O one dot P, O one new P actually, okay? If I see this fact in the program, meaning I have a statement in such a statement, new statement in the program, what can I infer about flows to or points to? What inference can I make? So when I encounter this statement in the program or think of it when you encounter this prologue fact, can you infer something about the relation flows to and points to, okay? So this is exactly what we have the rule. If we see this, then we know that O one flows to P. We could also infer the inverse, but we don't have to, right? Because we can just compute flows to relation and then we get the inverse at the end by reading it out using inverting the prologue query. Now, how about this? If we see an assignment, all right, what about the assignment? How will the assignment lead to more inferences? Exactly, so what do we see? If imagine that O one flows to R and we see an assignment from R to P, then we can infer that O one flows to P as well. Or you can write it like this. You see, if we have already infer the edge, you can represent this relation as an edge from O to R. So this is a flows to edge and you see an assignment statement from R to P, then you infer this edge from here. So you see now the relationship to parsing, right? We see two facts, we make an inference that we place as another edge on the graph. That's how we want to visualize it. How about a put field? Now you have to think what happens here. So we see this statement here in the program. It's a put field. You'll represent it as this sort of fact, essentially saying that we do see an assignment into P dot F from A. We'll represent it in prologue as something like P F F A P. Sorry, this should be small F. If that's easier for you to think about, right? This will be our prologue fact that we derive by the parser of the program from the existence of the statement. So parser will read the statement and generate this prologue fact. Everything including the dot at the end. And we'll probably use strings here to represent the F. So what would be the inference rule here? Now we are doing the interesting stuff because going through assignments is a toy, right? You're just propagating the flow. But what do we do with the heap? Okay, so let me write it this way for all objects from O well, okay. Well, okay, so for all objects such that, let's do it this way, for all objects such that O flows to P. So we are really saying, let's take all the objects that flow to P, okay. So essentially say for all, say O prime such that O prime flows to A, okay. What do we do with them? So essentially what we are saying, P points to some objects, right? Let's call it the set O, right? This is the set capital O. Now A points to some other objects called O prime. These sets could overlap, of course. Now what do we do here, right? We can actually compute those, but what happens here in this inference? Okay, so now you're saying we need to somehow represent O dot F, right? So now we need to somehow represent what O dot F points to, and that's a little bit of a mess. So what we'll do instead, that will lead to really compact inference, and here you sort of are learning how to write in prologue, is we are not going to write a rule for put F in isolation. We'll write a rule for essentially all pairs of put field and get field statements, okay. If you remember from this example, right? At some point we had the example where we said, well, P dot F something here, then we had R dot F something here, and if these two guys were alias, then we would say, oh, the value may flow from here to here. And this is essentially what we are encoding by this rule. So when you see some put field to F and some get field from F, so this and that, if P and R are alias, then this value will flow from A to B, okay. So let's look at that. So O one are those guys that flow to A, so this is whatever flows here, and now this corresponds to that statement, that corresponds to that statement, is just the fact that is derived from the statements. Now we ask if these two guys are alias, the base pointers, then O one will also flow to B, right? So it remains to define this alias thing, and how do we do that? Well, that's very simple. So if there exists one object such that its value flows to X and its value flows to Y, then these two guys are alias, right? That's essentially what that rule says. And if you want to see it in prolog, here we have a program, here it is translated by, think of it as parser, even though I did it by hand, into those facts. So these facts represent the program, and here are our inferences. Here is the new statement, the flow across assignment, the flow due to put field, get field alias, and then the flows too. This is just prolog rewritten, written from those sort of more formal looking statements, and here is the alias, right? X and Y are alias if there exists O that flows to X and flows to Y. So four lines of code. Of course, it was all possible by all the smart collapses that we did, but the analysis is actually not so bad. So before we should look at an example, it's important to realize how do we actually use the result of the analysis? So imagine we do an inference, and I discovered that, oh, this object may flow to this variable. Can I rely on the fact that it may flow to the variable? Is it guaranteed to flow? Remember what we've done, we collapsed all these paths, and then we discovered that some object from some new expression can flow to a variable, does it mean that it is guaranteed to flow? No, it only means that there exists a path along which it will flow, and the path may not even exist in the real program, it could be a path that we created in this over approximation by collapsing everything, right? So it looks like we've done something useless, right? We computed that the things may flow, but we don't know for sure that it will flow. So how do we use the result of the analysis? It's like asking for the weather, and the weatherman will tell you, well, it may rain tomorrow, but if you are not sure, you cannot exploit the fact that it will rain, because it may rain, but so what do we do? Please, right? The interesting information is actually when we discovered that something cannot flow, right? Because then we have approved that this cannot flow there, and so it's the absence information, the absence of the inference, what is interesting, right? So if we discovered that value seven can flow to an integer variable X, and no other constant can flow to it, then it's gotta be seven. If we discovered that no tainted value can flow here, then we have proven that the program is safe to these tainting attacks, right? So here is an example, the one that we had in the Prolog program. Here, it's translated into these six facts, right? Six statements translated into six facts, and you could now sort of show how the inference happens, because of O1 to X, O1 flows to X, and you could sort of do the inferences further. Essentially, I'm just running here the Prolog program. The more interesting way to look at it is to represent the program this way and view the visualization of the inference this way, right? So we do have, what we have here, these are just the statements, okay? The statements in the program is sort of the base facts, and now we are going to do the inference, and what do we see is because of that, we have flows two, or you could say points two. I think we are working with points two, right? No, we are doing flows two, okay? So let me erase this. Now what would be the next inference that we do? Right, this edge simply means that O1 flows to X. What is the next one that we do? Well, okay, oh, that's good. So let's look at the rules. Could it be that we determine this one, right? We don't, it's not in our rules, but we'll determine that it flows here, it flows here. As a result, because there is a path from O1 to W, and from O1 to Y, now these are aliases, okay? So because these are aliases, now we see that a value can flow not only here, but it can also flow there. So now we can read out the result of the analysis and see that V could indeed point to O2, but we cannot point to O1, and so we have learned something interesting, right? We have proved that V1 cannot have the object that contains some O1, and that's useful for optimization and so on. So that's essentially it. So what is interesting is this is a way of visualizing prologue inferences, except it works only for binary relation for two arguments, and you will see here that we had to cheat and the F, which was the third argument, we had to put into the name of that predicate so that it becomes binary, and now we can visualize everything as two edges. So that's it. Thank you, and I will see the next few teams in 15 or so minutes in soda, in 6.06, by the way, 6.06 soda.