 Hi and welcome to part three of this first lecture in the program analysis course. What we'll do in this third part is to look at some of the foundations of this course. So this is basically material that you've probably seen in some other course during your studies. So this is just to make sure that everybody's on the same page and we start from the same level when we talk about the more advanced topic. So here we look into grammars and then some representations of programs, for example, ASTs, abstract syntax trees and CFGs, which means control programs. So if you have seen all this material somewhere else, feel free to skip this part of the course. So this is just to make sure everybody is on the same page. Because this course is about program analysis, it is of course based on programming languages because you first need a language in which you can write a program so that you can then analyze it. So let's start by having a look at what these programming languages actually are. So a programming language or short PL can be thought of three parts. One is the syntax, the other one are the semantics of the language and then if it's a real language that you can actually use, there also needs to be an implementation. So what do these three parts really mean? So the syntax basically means the form of programs written in this language. So how do they look like? What kinds of characters can you put together in order to get a program written in this language? The semantics are about the meaning of programs written in this language. So the semantics of a programming language basically say what will happen if you execute a program written in that language. And then finally there's the actual implementation, which is what you need to really execute these programs. There are some languages that only exist basically on pen and paper. So they have a defined syntax and they have defined semantics, but are never implemented. This is mostly for research purposes, but in this lecture we focus on languages that are implemented. So the implementation is an essential part of these languages. We look into each of these three parts in some way or another during the lecture. I'll start to say a little bit about the implementation and the syntax here. And then in the next lecture, we look into more detail on how to describe the semantics of languages. So let's start with the third of these two ingredients, namely the implementation, because this is probably what you are most familiar with when you're actually using programming languages. So there are basically three different ways how a language can be implemented, which compilation, interpretation, and some hybrid form of these two. And let's have a start by look at how typically compiler works, so how compilation usually works. So compilation means we have some source code of a program. And at the end, we would like to execute this program on some machine. So we need to have the same program in a machine language. So for example, the source code could be written in C, and then the machine language could be x86, if you want to execute this on a machine that understands x86 instructions. And now to do this, a typical compiler consists of four steps. The first one is a lexical analysis or short Alexa. And what this lexical analysis does is to split the source code into so-called tokens, which you can think of as individual words in the programming language. Then the next part is a syntax analyzer or also called a parser. And what this part of a compiler does is to take the tokens and organize them into a tree, which is then called the syntax tree. Then next we have a semantic analysis, which is basically checking that some semantic properties of this program hold. So for example, if you have a typed programming language like C, then it's typed checking the program. And then after the semantic analysis and assuming that it has successfully passed, then the final part of this tool chain is a machine code generator, which is taking the program in some intermediate representation and then emits the instructions in the machine language, so that at the end we have, for example, x86 instructions that can then be executed on the actual machine. Another way to implement the programming language, which is not based on compilation, and this should be b0c, is interpretation. So without going into much detail here, let me just say that this is easier in a way, but not as efficient as a typical compiler. So what happens here is that the code is passed, so the first two boxes in the compiler are more or less similar, but then there's no generation of machine code, but instead the interpreter will go through the past code one statement after the other and just execute it whenever it sees that statement. And then finally there are some hybrid approaches that basically combine this idea of interpretation and compilation. For example, this is done for languages like Java or also JavaScript, where you have a virtual machine that is starting to interpret code, so that you can start running the code very quickly, but at some point compiled some of the code, usually code that is executed very often, so that it is executed faster because the machine code is specialized and optimized, so this is typically much faster than interpretation. So these hybrid implementations and virtual machines basically try to get the best of both compilation and interpretation. All right, so now after the implementation, let's have a look at another ingredient of a language, and that's the syntax of the language. And here we look at two ways to describe the syntax, and the first of them is based on a grammar. So essentially what a grammar is trying to answer is the question, which programs are syntactically correct, so which programs are actually part of the language. And to do this a grammar consists of four parts, which you probably have already seen somewhere, so this is hopefully just a summary. One of them is a set of terminals, which are the basic building blocks of the programming language. Then you have a set of non-terminals, which you can think of as helper symbols that describe how the terminals can be put together in order to get a legal program. Then you have a set of production rules, or just productions, which describe how to derive terminals from non-terminals, and then you also have an initial symbol, which describes the starting point of deriving a correct program, and this is some non-terminal. To make this more concrete, let's have a look at a specific example, and this example will be a very simple toy language that describes arithmetic expressions. So arithmetic expressions, things like 1 plus 2, and so on, right? So in order to do this, we need to define these four things, the terminals, the non-terminals, the productions, and the initial symbol, and let's start with the terminals here. So this is a set of symbols that your programs, or in this case your arithmetic expressions, are composed of, and this will be all the digits from 0 to 9, and also plus and minus, because we want to focus on these two operators in our expressions. Then we need to have this set of non-terminals, and in order to define arithmetic expressions, there are many possible ways how we can do this, so this is just one possible grammar, of course. In this grammar we'll have four non-terminals, one called x for expression, one called num for number, one called op for operator, and then the last one digit for digits. We also need to define the start symbol, so in our case the start symbol will be the non-terminal called exp, because we want to at the end be able to describe an expression, and this non-terminal represents expression. And then we have this set of productions, which are each a rule, so let's just put down these rules here, so there will be one rule that describes how an expression looks like, and this is basically saying that an expression can be a number, or it can be an expression followed by an operator, followed by another expression. Then we also need to say what an operator is, so we give another rule for that, where we say an operator can either be plus or minus. We also need to define what a number is, so we say a number is either a single digit, or it's a single digit followed by a number, and then finally we need to say what the digit is, and here we use the remaining terminals that we haven't used so far by saying a digit is either zero or one, and so on until nine. And these four things, the terminals, the non-terminals, the productions, and the initial symbol are now describing how our language of arithmetic expressions looks like. So now to make sure you really understood this, let me just have a little quiz for you, where I'm basically asking what of the following is really part of our language, and I'm giving you four options here, so the first one is this, the second one will be this, the third one is this, and then finally we also have this, and at this point if you do not yet know the answer right away, maybe stop the video for a second so that you can think about it, and then I'll tell you the solution. So the first example is part of the language, because you can derive this program or this arithmetic expression by starting at the start symbol and then using the production rules. The second is not part of the language, and the reason is that the parenthesis are not really in our set of terminals, so there's no way to get this string of characters from the language that we have defined. Similarly the third example is also not part of the language, it looks almost correct just that we do not have an operator for multiplication, we just have defined operators plus and minus, and finally the last one is an example of this arithmetic expression language, because we say in the first production rule that an expression can just be a number, and a number can be a sequence of digits, so this example is a legal arithmetic expression. So now a grammar is describing what programs are part of a language, but once the compiler or any kind of program analysis tool has determined that a program is actually a legal program in your language, then a common way to look at these programs still on a syntactic level are abstract syntax trees, and since we'll see them a couple of more times in this lecture, let's have a look at what they actually are. So abstract syntax trees are defined also through a grammar, but this grammar is called an abstract grammar, because it basically describes what trees are part of this abstract syntax tree. So just to give you an example, and this happens to be an example again for our arithmetic expressions, we could have an abstract syntax tree that basically says while there are expressions, expressions can either be a number or an operator that is applied to two expressions, and an operator can be plus or minus. So as you can see in this abstract grammar, the terminals that are used here correspond to the tokens of the language, and just to give you an example, let's say we have an arithmetic expression 3 plus 45, then in our little abstract grammar, we would get an abstract syntax tree that says there's an operator, namely plus, and this has two children, namely 3 and 45, and this tree actually can be described or can be derived from the grammar that you see, because the operator is the plus, and then the two expressions each are number, namely 3 and 45. We'll see more complex abstract syntax trees and also see how this works for a larger language in the next lecture. So now an abstract syntax tree looks mostly at the structure of the program, but it's not really reasoning about what happens if you execute this program. In the following, we look at two representations of programs that look at particular aspects of what happens when you execute this program, and the first one are control flow graphs. So control flow graph is basically a model of the flow of control through the program. So what this means is that it basically models in what order the different parts, different statements in the program are actually executed. And obviously, this does not have to be the order in which these statements occur in the program, because sometimes there may be branches or loops or jumps, which make you go not just to what comes next, but to some other statement. So such a control flow graph as any graph consists of two things, a set of nodes and a set of edges. And here the nodes are so called basic blocks. What are basic blocks? Well, basic blocks are a sequence of operations that are always executed together. So you can think of this as basically a sequence of statements or sequence of operations that do not have any branching instruction in between. So you know that they will always be executed one after the other. And then the edges in our control flow graph are describing possible transfers of control. So they are basically saying after this that thing may happen next. To make this more concrete, let's have a look at two examples. And example number one will be a very simple piece of code in JavaScript. But again, this could be any language. I'm just picking one that many of you may already know. So here we have an if statement with some condition called C. And if this condition is true, then we are assigning five to the variable X. And otherwise we are assigning seven to the variable X. And then after one of these two things has happened, we are writing the value of X to the console. So now if you want to create a control flow graph for this program, you would start at the first instruction that is executed here, which is this if where we are checking the condition. And this will be one of our nodes. And then the two basic blocks that could be executed next. One is this assignment of five to X. So we'll have another node here and it's connected through an edge with the conditional. And then we also have this other basic block, which consists of again, one assignment, namely assigning seven to X. And then no matter which of these is executed at the end, we'll reach this console log statement, which is then another node. And the two edges because you can reach this from either of these two branches. Now on this example, we have a branch, but we do not yet have a loop loops make programs more interesting. So let's have at a look at a second example, where we now do have a loop. And in this example, let's say we have a while loop. So we have this while with some condition C, and then in the while loop, there are two things. One is that we are incrementing a variable X, and then we are assigning the variable X to another variable Y. And after this loop, let's say we have again a call of console dot log, where we are passing the variable X. Now in order for you to really see if you've understood this idea of a control flow graph, I'll ask you to stop the video at this point and try to draw the control flow graph yourself and only then resume the video to double check if you've really understood it. So let me show you this control flow graph then. So in this case, we again have one node for this check of the condition in the while loop. And then there are two possible outcomes of this check. One is that we go into the loop, one is that we do not go into the loop. Let's start with the simpler one where we skip the loop body. So we will then directly go to the console log statement. So we have this as another node connected through an edge, or we could actually go into the loop in which case we have these two statements in the loop. And because they are always executed together, they actually form a basic block. So they will be together in the same node in our graph. So we have a node that has both the X plus plus instruction and this assignment of X to Y. And this forms a single basic block, which may be executed right after the check of the conditional. And now what happens when you're done with this execution of the loop body? Well, the semantics of a while loop say that you then go back to the condition and check again if you should enter the loop body one more time. So we have this edge that goes back to the first node here. So now the control flow graph focuses on the order of operations or statements happening during the execution of the program. Another graph representation that looks at a different kind of flow in the program, so called data dependency graphs. So what a data dependency graph is doing is to model the flow of data from so called definitions to so called uses. So again, it's a graph. So a graph consists always of a set of nodes and a set of edges. But now these nodes and edges are of course defined differently from the control flow graph because now the nodes are operations and the edges are possible definition use relationships and we'll see what this is in a second. So an edge in this graph, so an edge that goes from some node and one to some node and two means the following. It means that at node and two, some data is used that is defined at node and one. And an important word here is this word may because a data dependency graph typically shows you any possible definition use relations. So if node two may use the data defined at node and one but maybe it doesn't have to because there may be a branch then we'll still have an edge because this may be a defuse relation. All right, let's again have a look at two examples to make this more concrete. Example one again is a simple one and then for example two, I ask you to think about this yourself. So an example one, we have just two statements, one that assigns five to x and then another one that assigns the result of x plus one to y. So looking at the data dependency graph for this example, we'll have two nodes. One that is representing this first operation in the first statement where we are actually defining some value namely the value of x and then we have another operation here which is to take x add one to it and write the result into y and because the second operation is actually using the value of x defined in the first line there is an edge between these two that says hey there is a defuse relation between these two operations. As a second example let's have a look at a piece of code that is slightly more interesting because here we have a couple of more statements and also a branch. So here we start with an assignment to x followed by another assignment but now to y then we have an if where we are checking if x is greater than or equal to one and if this is the case then x is assigned to y and then at any case after that we are assigning to another variable z and what we do assign is the result of adding x and y. Again I'll invite you to stop the video here to just draw this data flow graph on your own to double check if you've understood the idea and then let me show you what the solution looks like. So here we'll have one node for this assignment of three to x another node for the other assignment that assigns five to y and then there are a couple of places where these variables are used. So one of the places where x is used is this check whether x is greater than or equal to one so there will be a defuse relation between these two nodes. Then we have this assignment of x to y which is also using x because in order to write the value of x into the variable y you first need to read x so you're using x and because x may be defined here we have this defuse relation chip and then there is finally this statement that adds x and y and writes the result into z and this one may use different values of x so there will be one edge from here or sorry and it may use different values of y so there are two edges one from here and one from here because depending on whether the if is actually executed the last statement will either read the y defined in line two or read the y defined in line four so we will have these two edges to represent this in the data flow graph. All right and after this very quick walk through some of the basics that we'll need in this program analysis course we are already at the end so I hope you now know what grammars are what ASTs are what control flow graphs are and what data flow graphs are and maybe you've already known this before so this is just to make sure we all know about this now and then in the next part of the lecture we'll actually have a deeper look into the semantics of programming languages and how they can be defined. Thank you very much for listening and see you next time.