 Hi and welcome back to the program analysis course. This is part 3 of the lecture on information flow analysis and what we'll see in this third part of the lecture is how to actually analyze information flows. So given an information flow policy, which as we've seen in the previous video, tells the analysis what it should check, we will now see how the analysis actually checks the flow of information through the program. So at a high level what the information flow analysis is supposed to do is to check for violations of a given information flow policy. This kind of check has various applications. One of the potentially probably most common ones is to detect vulnerable code. So for example you want to check whether some code is vulnerable to let's say an SQL injection because some input that may come in from a user or an untrusted user, so a potential attacker could lead to something that is then interpreted as an SQL injection. And by tracking whether this kind of propagation is possible in the code, you're essentially checking whether the code is vulnerable to SQL injection attacks. Another kind of application is to check if some code is malicious. So for example you may want to check if code violates some privacy policy by for example sending some data that you know to be private out into the network and by doing this breaches some kind of or breaks some kind of privacy policy. Or more broadly you can just use this analysis to check if the program behaves as expected by for example checking whether some secret data is written to the console and if it is and if you have specified a policy that would detect this, then the information for analysis will basically help you to check if the program does not do this kind of bad behavior. So now when an analysis tracks the flow of information through a program, there are two fundamentally different kinds of flows that it can track. And these are called explicit flows and implicit flows. So an explicit flow is a flow of information that is caused by a data flow dependency. So we have some data that directly flows to some other memory location through a data flow and this is an explicit flow. In contrast an implicit flow is caused by a control flow dependency. So we have some secret data that influences whether the flow of control goes this way or that way and by doing this kind of influence it's actually creating a flow of information and this is called an implicit flow. For example let's again look at the credit card example that we've already seen a couple of times in this lecture now where we had this secret information stored here which then assigned to this variable and this variable is then used in this check here to eventually if the condition is true assign something to the variable visible. In this example we have both explicit flows and implicit flows. So we do have an explicit flow here at this assignment because there's a data flow from credit card number to variable x which clearly propagates some information. We also have this implicit flow down here where this conditional which depends on the secret variable or the secret value determines whether we are executing this statement here and because this is a control flow dependency this is an implicit flow. There are some analyses that only check for explicit flows and the terminology differs a bit. Sometimes these kind of analyses are contained analysis, sometimes they are also called information flow analysis so I would recommend to not rely on the terminology alone but to actually whenever you hear about some kind of analysis like this to check whether they track explicit flows or also implicit flows or maybe just one of them. As practically every program analysis problem also information flow analysis can be addressed in two fundamentally different ways namely through static analysis and through dynamic analysis. A static information flow analysis typically over approximates all possible flows that could happen so it's over approximating all possible data flows and all possible control flows and then as a result what it will tell you is whether some information may flow from a secret source to an untrusted sync. In the course project you're actually implementing one of these static information flow analyses one that only looks at explicit flows because you only look at data flows in this analysis and as you've probably figured out by now you do not just report this over approximation of may flow but in the project you also are supposed to report whether some information must flow from a secret source to an untrusted sync but in general if people talk about static information flow analysis usually they mean an analysis over approximates and that reports may flow information. The other kind of analysis which is actually the focus of the rest of this lecture is about dynamic information flow analysis where we are associating some security labels sometimes called taint markings with particular memory locations while the program is running and then while the program is executing particular operations we are propagating these labels through the program so that at runtime we can check whether some information propagates from a secret source to an untrusted sync. So before a dynamic analysis can check whether there is a flow of information we need to define the taint sources and the taint syncs so basically from where to where we want to track for information flows and here's just a non exhaustive list of possible sources that a program analysis could consider so first of all there are of course variables that's basically what we've already seen in the simple examples so far but you could also for example say that the return values of a particular function are considered a source and everything that is returned by that function should be tracked or you could say that every data that is read from a particular IO stream should be considered a tainted value so for example everything you read from a particular file. You also need to define possible syncs so again these could be variables but it could for example also be parameters given to a particular function so one very typical example is the eval function in javascript where everything where you want to track whether something may be propagated to eval because eval will then evaluate whatever you give to that function as a piece of code and will execute it so if you want to check for code injection vulnerabilities you would typically say that eval is a sync or you could also say that instructions of a particular type are syncs so for example you may not want some tainted value to influence the flow of control in your program at all and in this case you would basically say that every jump instruction is a sync and now what the analysis does given these taint sources and taint syncs is to report any illegal flow of a taint marking that comes from a sync to sorry from a source to a sync. Let's now have a look at how the analysis is propagating the taints through the program so that it can eventually determine whether there is a violation of this no flow from source to sync policy and let's start with the first kind of flow namely explicit flows which is the easier of the two so the basic idea is that for every operation in the program that produces a new value the analysis is looking at the labels at the security classes of the inputs of that operation and then propagates these labels to the output of the operation so there may just be one input there may be two inputs there may be more inputs no matter what the number is the analysis will look at the labels of these inputs and then combines these labels using the least upper bound operator that we have seen when we looked at universally bounded lattices and takes the result of this operation as the label of the result of the operation that the program is usually doing so basically on top of the normal execution that is computing the result of some operation it's also computing the result of the of this label level operation to know what security label the the result of the operation has so this is what happens for explicit flows let's now have a look at what happens for implicit flow so how does the analysis propagate information to track implicit flows here the idea is that the analysis maintains the so-called security stack which is essentially a stack of all the labels of those values that have influenced the current flow of control so if we are at a particular location in the program because of some decisions then the labels that have determined these decisions will be put on this stack so precisely this happens as follows whenever some value x influences a branch decision in our program at some location log then the analysis is pushing the label of this value x onto our security stack and then when the control flow reaches the immediate post dominator of this location log basically the place where the flow of control emerges again and where we would go anyway no matter what the branch decision was then this label of x is again popped from s so that we do not have this label on the stack anymore and then whenever any operation is executed while the stack s is non-empty so basically whenever we are under the control of some secret value then all the labels that are on our stack s also will be considered as inputs to the operation because essentially we are only performing this operation because of something that had a label that is now on s so let's illustrate this tracking of security labels using two examples the first one i'll explain and then the second one is a little quiz for you so in example one we again look at this credit card code that we have and now we also need to define the policy that is supposed to be checked here so in this example we have two security classes namely public and secret and we are defining as our only source this variable called credit card number or more precisely the initial value it gets at the first line and as the sink as before when we looked at this example we care about whether this secret information can flow into our variable called visible so let's now go step by step through every operation in this program and let's have a look at how these operations influence the security labels that the analysis is tracking so after this first assignment so basically just after this line what we have is that the label of what is stored in credit card number is secret so this is simply because this is the source that we have defined then the first is on a second assignment actually that happens here so after the second line we have observed an explicit flow because there's a data flow from credit card number to x so we take the label that is the label of credit card number and make this the label of x which means that after this explicit flow the label of x will also be secret in the third line we are assigning this value false to visible so after this line the label of visible will be the label of all the inputs now this only input here was this literal false which by definition is public so the label of visible will be public then comes the interesting part because now we're looking into this condition of x being larger than 1000 which means when this condition is checked the program produces some intermediate value which is a boolean so let's just call this intermediate value b and of course this value because it's valuable also have a label and the label of this value b will be the combination of the labels of all the inputs of this operation so specifically that's the label of x and then combined with the label of 1000 which means we are taking secret and combine this with the least upper bound operator with public public is the label of 1000 because by definition it's just a literal in the program and the result of this will be secret and now because this is a control flow decision which depends on b we will push the label of b so we will push secret onto our security stack s next we are reaching this line here in the if block where now all the labels on s are part of the input of whatever happens in this block and the the underlying reason simply is that we are only here at this line because of this conditional which in this case was secret and now when we compute the label of visible basically just after this last assignment here we do not only take the label of true into consideration but also the labels that are on our stacks in this case secret and then combine this with the label of of true which in this case will be public but because secret combined with public using the least upper bound operator gives secret it means that at the end we have the label secret attached to visible and this because secret is our sink happens to be a violation of the policy which is the outcome that you've of course already seen in this lecture but now you also know in detail why this is actually what the information flow analysis reports so now the second example will be an example that i just tell you but i would not tell you the solution and the idea is that you should solve it by yourself and if you're not sure about the outcome feel free to share your answer in the alias forum with the other students so that you can double check what the solution is but i really want you to try this on your own instead of just looking at me doing it so here we again have a piece of code and we again define our security classes which again are just two public and secret the source in this case is this get x method or specifically the return value of this method and the sink is this foo method or specifically the parameters or the arguments that are given to to foo and now in this execution that you are analyzing here you should suppose that get x x returns the value five and then you should basically do what the analysis is doing and write down the labels of all the relevant variables after each operation and then at the end you should be able to figure out whether there is a policy violation or not so now you've hopefully stopped the video and done the quiz and then we can move on to another interesting aspect of tracking implicit flows which are so-called hidden implicit flows so far we've only talked about explicit flows and implicit flows but there's actually a sub category of implicit flows which are sometimes called hidden implicit flows and these implicit flows are hidden in the sense that they only happen because a branch is not executed which makes it really difficult for a dynamic analysis to track it because a dynamic analysis by definition only tracks what happens and what is executed so the approach that i've explained so far is actually missing these hidden flows and just as a very simple example to illustrate why this happens let's have a look at this example here where we have three lines of code so we have one variable called x which initially is false and the label of x should be public so this is just a public variable and then we have this other variable called secret which is labeled as private so whenever something secret something from secret is assigned somewhere then this should also be private or if secret determines the control flow decision then whatever happens should also be labeled as private now what this piece of code here essentially does is to copy the value of secret into x because initially x is false if secret is true then we also make x true so after executing these three lines the value of a secret will be copied to x even though there is no direct copying happening but it's it's just indirectly happening through the flow of control here but now if we see an execution where secret happens to be false then this statement down here isn't really executed which means that according to the dynamic information flow analysis that i've explained so far nothing um nothing is propagated and we do not really see that secret was copied into x even though it actually has happened so given that we have spent so much effort now on defining what information for analysis is doing it would be pity if you would miss these hidden implicit flows but fortunately there's a way to make the analysis find these hidden implicit flows the basic idea is the following if we have a conditional and the two branches b1 and b2 then the analysis will conservatively over approximates all values that may be defined or written in branch b1 and then will add spurious definitions into the other branch b2 and also the same the other way around so for our example that we've seen earlier what the analysis would do is to essentially add this second branch here that is executed if secret is false and in this case it would add this spurious definition which just writes the old value of x again into x so that when you execute a program no matter which of the two branches you take there always is a right to x so basically there will always be a propagation from what is on a security stack namely the label private into x so by doing this and by adding these spurious definitions the analysis would be able to reveal these hidden implicit flows because it will always see some data flow under the control of the of the label of secret so everything we've seen so far was basically true for all kinds of dynamic information flow analysis i finally want to have a very brief discussion of one implementation of a dynamic information flow analysis namely an analysis called ditem there's a lot more information about this in this paper that i've mentioned at the beginning of the lecture so if you're interested in how this is actually implemented in practice just have a look at this paper but as a very brief summary what ditem does is to do dynamic information flow analysis for x86 binaries so you basically give it an x86 binary and then it's doing something with it and when you execute this binary you can check for violations of an information flow policy the way ditem does this is by storing the tank markings on the security labels into bit vectors and there is one such bit vector for every byte of memory in the entire program so it's relatively expensive to do this and there are a couple of tricks to make it less expensive but by having these bit vectors that store the tank markings for every byte of memory the analysis can then propagate the tank markings while the program is executing and to do this it's using instrumentation so it basically adds instructions to the existing program that look at these bit vectors and then propagate the information in these bit vectors whenever some relevant operation happens during the program and one final question is of course how does the analysis know when it tracks these implicit flows when it needs to pop something from the stack because if you just execute a program you do not know when you've reached the end of a branch and the answer is that the analysis before doing the instrumentation performs also a static analysis and computes a static control flow graph and then looks at the immediate post dominators of of conditionals in order to find out where the program execution merges again and at this point in time pops the security label again from the stack so that the security label only remains on the stack as long as it's relevant for tracking implicit flows all right so let me summarize this whole lecture on information flow analysis so this kind of analysis is supposed to track the secrecy of information that is handled by the program so that you can check some information flow policy that you can define by essentially defining three things namely the security classes by defining one of these lettuces the sources that create values that need to be tracked and the things into which you want these tracked values to not flow and if they actually do flow into one of these things then you want to raise a violation of the policy this kind of analysis has various applications it's particularly used for security related applications like malware detection or for example checking for vulnerabilities now having said all this always keep in mind that even a perfect information flow analysis may not be able to track all kinds of flows of information through a program because actually there are some channels that go beyond the data flows and control flows that an information flow analysis is tracking for example the program may also leak information through its power consumption so that for example one process is reading how much power another process is consuming and that way they can communicate with each other or sometimes just the timing of operations can also propagate some information so just keep this in mind even if you would have or have a perfect information flow analysis there may still be some flow of information that the analysis is missing all right and that's already the end of this third part of the lecture and also the end of the information flow lecture so I hope you've enjoyed it and then see you next time