 Hi, hello and welcome to the program analysis course. In this lecture of the course, we will look at information flow analysis, which is a kind of program analysis that is typically used to check the security properties of a program. What we'll do in this lecture is to first look at what an information flow analysis is actually supposed to do, and then we look at how it can be implemented. Here's an overview of the lecture, which will consist of three parts. So we are right now in the first part of the lecture, where I will give an introduction into the topic, explain what information flow analysis is, what kind of properties of a program it can check, and also properties that it cannot check. And then we look into some more details by first looking into information flow policies, which are essentially a way to specify exactly what properties the analysis is supposed to check. And then in the third part of the lecture, we will look into the actual analysis, and we'll have in particular a look at the dynamic analysis of information flows. Most of what I'm saying here is based on these two papers, in particular the first paper, which is recommended reading. So if you're interested in more details, or maybe haven't fully understood something, or just want to know a little more than what I'm going to say here, then these two papers, and in particular the first one, are very good source. Information flow analysis is a kind of program analysis that is typically used in the context of security, where the overall goal is to secure your computing system. Specifically, the goal of information flow analysis is to do this by security data that is manipulated by a computing system, and by enforcing a particular security policy in the way this program is computing and manipulating based on data. There are two kinds of policies that an information flow analysis can check, and these two kinds of policies are related to confidentiality and to integrity. Confidentiality means that we have some kind of secret data, for example, a password, that should not leak to some non-secret place, for example, some output that someone who's not supposed to see the password could see. The inverse kind of property that you want to check is integrity, where you have some high integrity data that should not be influenced by some low integrity data. So for example, you may have some information that you do not want to be changed, but there may be a way to possibly change it for an attacker or someone may change it without really wanting to change it, and you want to check using an information flow analysis if this kind of violation of your security policy is possible. The way an information flow analysis checks whether there could be a violation of a confidentiality or an integrity policy is by checking whether information may propagate from one place in the program to different place in the program. I'm putting place here in quotes on the slide, because place, of course, is not a very technical term. When we talk about a program analysis, then typically place means something like either a code location or a variable. So essentially, an information flow analysis is checking whether information that is maybe stored in one variable may be propagated to information that is later on stored in another variable. By doing this, an information flow analysis complements other kinds of techniques that also impose some kind of limit on how a program may handle data. And in particular, for example, it complements other techniques that impose limits on releasing information. So information for analysis is only about how data and how information propagates within a program, whereas there are other security mechanisms like access control lists or maybe cryptography that are checking other kinds of properties. And in particular, that are responsible for making sure that information is only released in a controlled way. This is not what information flow analysis is doing. Information for analysis is really about how the information is propagated within the program. So let me illustrate this idea of what an information for analysis is supposed to check using a simple picture. So in this simple picture, I'll use these circles here to designate places in our program. And when I say place, I basically mean everything that may hold some data, for example, a variable. Now, if you have a program, there are different such places. And some of these places may contain particular kind of information. So for example, you may have one place that contains some secret information, which could be a password or could be something like a credit card number or any other kind of information. It could also just be an email that is supposed to be secret and that should not reach all other places in the program. In particular, you may have some places in the program that you know to be untrusted. So for example, this could be information that you send out through the network to some other computer or that is maybe written to a file that you know could be read by arbitrary people. And now, one thing that information for analysis is going to check is whether it's possible that information may flow from this place that holds secret information to this other place that you know to be untrusted. So essentially, the analysis here is checking whether this kind of flow is possible. And if it does that, then it's actually checking a confidentiality policy. The inverse to confidentiality policies are integrity policies, where we also have some untrusted place. And in addition, we have some place in which we are going to store trusted information. So this is information that we will trust because we know that it's computed in a way where we have some control over what data ends up in this place. And now the question that information for analysis can also check is whether it's possible that there is a flow of information from this untrusted place to this place that stores trusted information. Again, here the question is, is this kind of flow possible? And if the analysis checks that kind of policy, then this is an integrity policy. Now, after this abstract image, let me illustrate these two ideas with examples. And let's start with an example that shows a violation of a confidentiality policy. So in this little piece of code, we are dealing with a credit number that is stored in this variable called credit number. And we do not want that this credit number leaks to this other variable called visible. And now the question is, is there actually a flow of information where some information about a credit card number flows to this variable visible? And in this example, there actually is such a flow. And to see this, we need to basically track the data that is stored in credit card number through the program. The first thing we see here is that the secret information, the credit card number, is here stored into the other variable X. And then we have this check here which checks whether X is larger than 1000. And if yes, it's changing the value of visible from false to true. So some information about the credit card number, namely the information whether it's larger than 1000 or smaller than, smaller or equal than 1000, will be propagated to this variable visible. So at least some information about our confidential data ends up in a place that is untrusted. As an example for the inverse kind of property for an integrity property, let's have a look at these three lines of code where we have a policy that says that information that comes from this user input function should not influence who becomes president. So let's say there was an election and let's say the designated president which is stored in that variable here is Michael. And now there's a call to this user input function and we're storing the result of that call into designated president. So if our policy is that designated president is the trusted piece of information but user input is untrusted, then we do not want this kind of flow where the user input essentially ends up in this variable designated president. So what we would have here because of this assignment is that low integrity information propagates to a high integrity variable. And this again would be a violation of the information flow policy because it's a violation of our integrity policy. Now the previous example was relatively trivial because the untrusted information was directly assigned to the variable that was supposed to contain only trusted information. Here's a variant of this example where it's a little bit more subtle to see why the untrusted input may influence the integrity of our designated president variable. So what happens here is that we again call our user input function and then instead of directly assigning the result to designated president, we're checking the length of the user input and if this happens to be equal to five and only then we're changing the value of designated president to some other name, for example, Paul. So in this case, we also have a violation of our integrity policy because some low integrity information indirectly propagates to a low integrity, sorry, to a high integrity variable namely to designated president and the user input can essentially influence who becomes president. So as you can hopefully see from these examples, these two kinds of problems, confidentiality and integrity are dual problems because if you have an information flow analysis that can check whether some information propagates from one place in the program to another place in the program, you can use the same kind of analysis both for confidentiality and integrity problems. In this lecture, I'm focusing on confidentiality problems from now on but keep in mind that everything I'm saying applies to both confidentiality and integrity because once you have an information flow analysis you can use it for both kinds of problems. So now that you know what an information flow analysis is supposed to do, the question is how to actually analyze the flow of information through a program. And the key idea here is that the analysis is tracking so-called security labels. So the basic idea of these security labels is that we are assigning to each value in the program some kind of meta information that tracks the secrecy of that value. And then whenever some program operation happens, so whenever we are computing a new value from some old values, then we are propagating the meta information so that at the end we know for every value in the program what level of secrecy this value has. So let me illustrate this idea using the simple example that we've already seen earlier where we have this piece of code that handles a credit card number and that is not supposed to leak any information about a credit card number into the variable visible. So the secret value that we wanna track initially is here, it's this value one, two, three, four. And now what the analysis is doing is to associate a piece of meta information namely that this value is secret to this value and then propagates it through the execution of this program. In particular it will associate this is secret information also with the credit card number variable because the secret value is assigned to this variable. So everything that I'm underlying with this dotted line means that this contains a secret value. In the second line of the program we are assigning something that contains a secret value to a new variable and because this is an operation that propagates the secret value, this new variable X is then also marked as containing a secret value. In the third line, the analysis doesn't really have to propagate anything because false is just a literal which is assigned to a fresh variable so there's nothing secret involved here. But then here in the fourth line where we are computing this Boolean conditional that checks whether X is larger than 1000, the result of this Boolean expression will actually be a secret value simply because one of its ingredients namely X has been secret. And then because the fact that we execute this line here depends on a secret value. Everything we write in this conditional piece of code will also be labeled as a secret value. So in particular this visible variable will also be marked as containing something secret and as a result we have a violation of our policy. Now instead of defining what the analysis is supposed to do by talking about the propagation of individual pieces of data from one place in the program to another, we can also try to formulate this as property of the entire program and this is what this notion of non-interference that I'm talking about now is doing. So essentially what non-interference means is that any confidential data does not interfere with any public data and that's the property that information for analysis at least if it's used for confidentiality policy is supposed to ensure. So precisely that means that if there is a variation of the confidential input, then this does not cause any variation of the public output of the program. And that means if an attacker can only observe the public output of the program, then this attacker can actually not observe any difference between two executions that differ only in their confidential input. So this idea of non-interference basically says at a level of the entire program and at a level of an attacker who can observe something about this program but not everything, what the information for analysis is supposed to achieve. All right, and this is already the end of this first part of this lecture on information for analysis. I hope you now know what information for analysis is supposed to do and have an idea of the kind of property that it's going to check. What we'll do in part two of this lecture is to formulate a bit more precisely how an information for policy looks like which then defines what exactly the analysis is checking. Thank you very much for listening and see you next time.