 Welcome to part 1 of lecture series on machine independent optimization. So, in these lectures we are going to discuss the various optimizations on intermediate code. So, we will begin with you know an introduction of what is code optimizations, some illustrations of the different optimizations that are carried out by most compilers. And then we will have to consider a technique called data flow analysis, which is necessary to perform code optimizations. So, we are going to look at those as well. And then the fundamentals of control flow analysis are essential for everybody to know, because they help us in defining what exactly is a loop structure in control flow graphs. And then we will apply these principles to two of the important machine independent optimizations to understand how they are carried out and what the algorithms are. So, finally, we will discuss in detail the static single assignment form and the various optimizations on these static single assignment forms. So, when we consider machine independent code optimization, first of all we should understand why exactly this optimization becomes necessary. The most important reason for this is the inefficiency which is introduced by the intermediate code generation process. So, as you would have you know learnt in the lectures on machine independent you know rather the machine independent code generation or intermediate code generation as it is called. You would have observed that every time we want to make an assignment we invariably end up generating a copy of the variable involved. So, because you know the compiler takes an easy way out it simply uses generates a new copy of a variable whenever necessary and it knows that the optimization phase actually is going to get rid of it. So, extra copies of variables and then we store constants in variables and then use the variables instead of using the constants over and over again, because that is easier for us. Then there are many expressions which actually will be evaluated again and again either because the programmer has not observed them which is actually not the major reason, but mostly because the compiler has introduced extra intermediate evaluations as will become clear very soon. This code optimization as I said removes such inefficiencies and improves code. So, whenever there are extra copies it gets rid of those whenever there is repeated evaluation of expression it gets rid of these things repeated evaluations and whenever it is possible to use a constant instead of a variable it does so. So, whenever the code optimization you know is applied it can be the optimization can be in time space or power domain. So, far whatever I mentioned you know the removing extra copies etcetera etcetera they basically improve time and space, but a new dimension to the problem would be added if we want to you know save power which is very important in embedded systems. So, reducing power consumption in code is not so trivial we need models of the power consumption of the device and so on and so forth. So, they are not really topics for discussion in this lecture. Code optimization algorithms often change the structure of the programs sometimes beyond recognition as well for example, they may inline functions. So, inlining of functions that means the function call will be replaced by the body of the function with appropriate replacements to the parameters the temporary variables etcetera etcetera. So, if this happens then the original call to the function gets deleted it is replaced by the body of the function itself. And then it is possible to apply what is known as loop unrolling. So, when a loop is unrolled obviously, say twice thrice four times etcetera the number of iterations of the unrolled loop will be smaller lesser than the original. So, again the loop will not be iterating you know in the for the same number of times as the original then the induction variable elimination it actually removes some of the programmer defined variables. So, for example, if there is a loop which is controlled by a variable i it is possible that the induction variable elimination process removes the variable i and uses a different variable already present in the program for controlling the loop. So, again this the such transformations make it very difficult for the debugger to be used along with optimized programs. So, you know if we want to insert a break point at a function call there is no function right. If we want to you know look at the value of a variable which has already been eliminated then it does not work out at all. So, therefore, the usually compilers you know stop most of the optimizations if the user request that the debugger be turned on. So, when the debugger is on the program that we are debugging is usually unoptimized program. So, code optimization really consists of a bunch of heuristics and the percentage of improvement depends on the program sometimes it may be 0 as well. So, for example, if there is just a couple of there are a few assignment statements and there is no way you can change any of that by any optimization then the improvement would be 0. So, in such a case it does not mean that the optimization phase has been is generally ineffective, but it is just that for that program the improvement cannot be made. Here are some of the common machine independent optimizations that are used in compilers. So, I am going to give you an example of each of these there is what is known as global common sub expression elimination. So, repeated evaluation of expressions is removed by this process. Then there is the process of copy propagation. So, if we have many copies of the same variable you know then we can retain just one of them and eliminate there. Constant propagation and constant folding it tries to you know promote the use of the constant instead of the variable and it also tries to simplify expressions involving only constants. And thereby you know promote the use of constants instead of variables. Loop invariant code motion it removes code which is inside a loop and is not going to change because of the iterations of the loop. So, such code is called loop invariant code and sometimes such code can be removed from the loop and it can be placed outside the loop. Induction variable elimination and strength reduction this typically involves removing one or two variables and you know which are involved in iteration and then replace these and try to control the loop using the rest of the variables in the program. Strength reduction tries to you know replace expensive operations such as multiplication and division by addition or shift and so on and so forth. Partial redundancy elimination is a bit difficult to explain without an example. So, I will defer the explanation to the time at which we discussed the example. Loop unrolling is something I already mentioned we unroll the loop many times function inlining also has been mentioned. So, we replace the function call by its body tail recursion removal implies that a recursive function call at the end of a function you know a loop can be rather not a loop a recursive call at the end of a function can be possibly replaced by a loop. Vectorization and concurrentization or transformations which are useful to make the program work on vector computers or multi processors and so on. So, loop interchange and loop blocking operations help in the process of vectorization and concurrentization. So, we are going to use this bubble sort program which is quite simple as a running example. So, this is the standard bubble sort program it sorts the array a with 100 elements. So, and the a like in C runs from 0 to 99 and we assume that there is no special jump out of it if the array is already sorted. So, we definitely go through all the iterations even if it is not exactly necessary. So, it is a standard program with equal to 100 and then there is a j loop then there is a comparison there is a swap and so on. So, let us look at the intermediate code for this particular program. So, here is the you know condition for the i loop right and then here is the condition for the j loop. So, if the i loop has to terminate it comes out here and if the j loop has to terminate it comes out here and then goes back to increment the i. If the i loop does not terminate it goes into the j loop and then the j loop actually works you know in this sort of a thing. So, in all these. So, here we have actually the comparison and you know rather the swap operation this is just the comparison is right here. So, you can see that we have one element of the a another element of a and then there is a comparison. So, if we need to swap we come to this block here is the swap operation. So, even though there were only three statements in the swap block the code generated is quite long. So, let me explain why this happens. So, when we want to do a swap the first thing is we want to do temp equal to a j then a j plus 1 equal to a j and then a j equal to 10 right. So, here is the this is the sequence of operations three operations which are required for the first assignment statement temp equal to a of j. So, assuming that each integer requires 4 bytes the increment on the array which is in terms of bytes is going to be by 4. So, we need to multiply the j index by 4 then we get the element from the array and we assign it to temp. So, this is the sequence of three operations for just that temp equal to a of j then we have the other one a j plus 1 equal to a j. So, here again we compute 4 star j. So, you can now observe the repetition of the computation 4 star j here and 4 star j here as well then you know we take the address of the j plus j th element rather j plus first element and then yeah this is j th element. So, t 14 is j plus 1 t 15 is 4 star t 14 and then t 16 takes the element of a right. So, now star of t 13 equal to t 16 this is the assignment of you know a j plus 1 equal to a j. So, then the last one is the a j equal to temp operation here right. So, if you observe all this block and this block together we have a 4 star j here and then we have 4 star j here and here as well. And then we have j plus 1 computed here and then we have j plus 1 computed here and here and here as well. So, in other words we have i minus 1 here and i minus 1 here. So, there are many places where the same computation is being repeated. So, in such a case we will be able to perform many of the optimizations. So, we are going to look at those. So, let me explain the first optimization global common sub expression elimination. So, this says consider a situation where there is a computation of some expression y plus z in this block right and every path preceding that block has a computation of y plus z. Of course, it is obvious that y and z should not change along these paths. So, this y plus z, this y plus z and this y plus z will have identical values and there is no need to compute this y plus z all over again. We might as well say put that y plus z computation in a temporary the same temporary along all paths u u u and then just say x equal to u. So, in this fashion whenever we it does not matter which path we take we are going to compute the y plus z expression only once it will not be computed twice as in this particular case it will be computed only once. So, this is the other part of the example which shows the need for repeated application of G C S E. So, here we have x plus y and x plus y. So, we get rid of the repeated computation by introducing the temporary u and assigning C equal to u here, but once we perform what is known as a copy operation copy propagation. So, for example, observe here that this is a equal to u and it is really a copy of u. So, we can replace this a by this u directly and get rid of this particular assignments or copy operation. The same is true here could actually replace this C in the equal to C star z by u and get rid of this statement. If we do that we have u equal to x plus y and b equal to u star z and here we have d equal to u star z. So, because of the copy propagation we have discovered another instance of you know common sub expression here one here and one here two instances and we can apply G C S E again to get rid of this repeated computation. So, we have v equal to u star z and d equal to v. So, the moral of this example is that you require many applications of common sub expression elimination and copy propagation in order to eliminate most of the common sub expressions which would otherwise be hidden. So, in general optimizer supply the optimizations many times in an iterative mode. So, until not much improvement is possible or you know a definite number of times have actually taken place. So, let us see how it works on our running example. So, I have marked in color the various common sub expressions i minus 1 then here is 4 star j then we have j plus 1 in many places. So, if we eliminate these we get this program. So, we have you know t 2 equal to i minus 1. So, here instead of t 21 equal to i minus 1 we have t 21 equal to this t 2 we have t 4 equal to 4 star j. So, wherever there was 4 star j we replaced it by t 4 see and then wherever we had j plus 1 you know we replaced it by t 6. So, here is t 6 and here is t 6 and so on. Now, this has given rise to many copies. So, here is t 21 equal to t 2 and i equal to t 21. Obviously, we can make this i equal to t 2 and so on. So, we do that here is i equal to t 2 then in this case you know we in the previous case for example, t 11 equal to a of t 10 could have been made t 11 equal to a of t 4. So, similarly this t 12 equal to t 4 could have been eliminated and we could have made this t 13 equal to a plus you know t 4 right. So, if we do such optimizations if we do such optimizations the copy propagation we get this code. So, we have a t 4 then a plus t 4 right and similarly this j equal to t 6. So, the copy propagation example rather optimization when applied removes many of these copies, but then there is an opportunity for further optimization now you know. See for example, the expression 4 star t 6 right now becomes a common sub expression. So, here is 4 star t 6 and again 4 star t 6. So, we can eliminate that as well and then perform another round of copy propagation. So, here if we had said t 5 equal to t 7 then t 15 equal to t 7 then we could have replaced this t 15 by t 7 right and similarly this t 18 could have been replaced by t 7. So, that is what is done here right. So, after round of GCSE and copy propagation we get this short piece of code in which there are many instructions which have been eliminated. The point is even in such a simple program such as bubble sort there seems to be an opportunity to perform GCSE and copy propagation several times. So, if this is so in a simple program I am you know there is certainly there are many chances to perform these optimizations in larger programs. So, let us consider an example to understand how constant propagation and folding take place. So, here is a program we have a equal to 10 b equal to 20 if b is 20 go to b 3 you know yes no. So, if it is yes then we assign a equal to 30 then d equal to a plus 5 and stop. So, it is very clear that since b is a constant 20 this evaluation of 20 equal to 20 can be carried out by the compiler itself. So, that is basically you know propagating this constant value of b to this particular use and then evaluating this equality amounts to constant folding. So, if this becomes true and therefore, the code for the you know the rather the edge for the no part can be removed from the control flow graph. So, we will have only one edge here. So, this becomes a equal to 30 and here we have only these two instructions because this has already been you know evaluated. Here it is very clear that the value of a can only be 30 because this edge does not exist anymore. So, we can evaluate the constant 30 plus 5 as 35. So, this is also another example of constant folding. So, the program because of constant propagation and folding has become quite simple. So, let us move on to the next example of loop invariant code motion. So, here is a very simple loop. So, we look at the two statements in red one says t 3 equal to address of a the other says t 4 equal to t 3 minus 4. So, consider just one statement t 3 equal to address of a address of a is a constant it is nothing but the you know offset of the array a inside the activation record. So, this would be you know this a statement which does not change its value during the iterations of this particular loop. So, it is obvious that this statement can be moved outside the loop like this, but then the next statement t 4 equal to t 3 minus 4 depends only on this statement which is loop invariant. So, therefore, in turn this statement also becomes loop invariant it does not change its value during the iterations of the loop and even that can be moved outside the loop. But remember you must move these statements in the same order as they are present in this loop they cannot be swapped otherwise the program may be incorrect. So, this is what is known as loop invariant code motion in this particular example there is only one basic block one thread of control. So, moving code outside is was a very simple operation, but as you will see in the later parts of the lecture there are many conditions that need to be satisfied in order to move the loop invariant code to outside the loop. Next example is strength reduction. So, here is a multiplication 4 star i suppose the processor is a very simple processor say in the embedded system domain and it does not even support a multiplication of integers forget floating point. In such a case usually the software implements multiplication if it is essential by a sub routine which is very expensive to be called. So, in such cases you know we may want to replace this 4 star i by a repeated addition process right. So, 4 star i t 5 equal to 4 star i you know as i increments will take the value 4, a 12 etcetera. So, we might as well add 4 to it and get the new value of t 5. So, that is precisely what we intend to do here, but we replace t 5 you know by a new variable called t 7. So, t 7 equal and then we have t 6 equal to t 4 plus t 7 and we have t 7 equal to t 7 plus 4 which is placed immediately after i equal to i plus 1. So, that we do not forget to compute the value of t 7 which is required for this iteration. So, now t 7 increments in force and it is applied to t 6 exactly the way t 5 was being supplied to this particular assignment. So, the semantics of the program does not change and we have actually the there are two steps in this replacement we would have first set t 5 equal to t 7 and then then a copy propagation to remove t 5 in this case in this example and make it t 7. So, that is a two step process we move on to the next optimization called as the induction variable elimination. So, usually the induction variables are variables which are used to control a loop. So, even here for example, we say check i greater than 100 and then increment i if it is not right. So, this is the loop control and i is used for that purpose. So, suppose you know you look at the program a little more closely you find that there is another variable t 7 which is also being incremented in tandem with i right. So, this is the variable which introduced in the previous example with you know the strength reduction process. So, as we increment i t 7 also monotonically increases by increment is 4. So, if we actually want to get rid of a variable it is possible to you know get rid of i and then use t 7 in its place with the appropriate change in the operands of the expression. So, we have used i here and we have actually computed i here. So, we replace this by t 7 greater than 400 because as i increases from 1 onwards 1 2 3 4 etcetera because the increment is by 1 t 7 starts with the value 4 and increments by 4. So, a 12 and so on and so forth. So, whenever we compare i greater than 100 we need to compare t 7 with 400 and now once we replace that the variable i with t 7 and then the operands are changed appropriately there is no need to retain this variable and its associated statements. So, we remove it and now the program becomes smaller. So, this is what is known as induction variable elimination. So, we could you know remove i and replace it with t 7 of course, if you observe carefully it is also possible to get rid of a variable to you know remove t 7 itself and replace it with appropriate values of i, but that would defeat the purpose of the you know strength reduction that we perform. So, we will be undoing strength reduction if we replace t 7 with a usage of i. So, that is not intended. Now, we move on to partial redundancy elimination a common global common sub expression elimination G C S E as it is called is actually a can be termed as a total redundancy elimination transformation. So, if you recall the example we must have in order to remove this a plus in order to remove this a plus b we must have a computation of a plus b along every path that reaches this basic block. So, here is a path and here is a computation of a plus b that reaches this part this block, but unfortunately along this path there is no computation of a plus b that reaches the basic block number four. So, we cannot apply the global common sub expression elimination process here, because the expression a plus b is not available along this path it is available only along this path. So, this particular example a plus b is said to be partially redundant it is not totally redundant it is actually available along this path, but it is not available along this path. In such a case sometimes it is possibly cheaper to insert a computation of a plus b in this edge. So, we have dissected this a plus this edge introduced a new block and put a computation h equal to a plus b here. And for this x equal to a plus b we have replaced it with h equal to a plus b and x equal to h. So, the semantics of the program remains the same. So, now consider the expression a plus b. So, a plus b is now available along this path at the entry of this block a plus b is available along this path at the entry of this particular block. So, we can perform ordinary common sub expression elimination and instead of this y equal to a plus b we can replace it by y equal to h. So, this is the essence of partial redundancy elimination. There are many difficulties here. First of all we need to make sure that this a plus b is partially redundant and there are a couple of conditions to be checked there. And then we need to find the arc which is the best for the introduction of the extra computation that we have shown here. So, you know if we introduce it here it is worse you know we would have computed it twice right and this is the computation you are going to remove but you have not gained anything. So, we are introducing it here which is the best but as you easily can imagine there this program may actually grow in this direction and there may be many way paths. So, which path of this program should be taken and which arc of the program should be cut in order to introduce the computation of a plus b that is a difficult question. So, that is another there are a couple of conditions that we need to check in order to make sure that we introduce a plus b in the best possible place. Then unrolling a for loop. So, here is the for loop for i equal to 0 and then we i less than n i plus plus then some statement s 1 which for you know the indication here is that s 1 for the value of i. So, this is the instance which is relevant for the iteration i and this is the instance of s 2 which is relevant for the iteration of i. So, the reason why we mention it like this is if the code has you know use the value of i then when we perform loop unrolling we may have to replace it with the appropriate value of i plus 1 or i plus 2 etcetera. So, the what we have shown here is an instance of s 1 with the value of i. So, when we write a s 1 of i plus 1 we imply that it is an instance of s 1 with the i being replaced by i plus 1 and here the i has been replaced by i plus 2. So, we can unroll the slope. So, s 1 of i and s 2 of i correspond to loop iteration i s 1 of i plus 1 and s 2 of i plus 2 i plus 1 they correspond to the iteration i plus 1. So, s 1 of i plus 2 and s 2 of i plus 2 they correspond to the iteration number i plus 2. So, there are three instances of the body of the for loop here. So, it is very obvious that this loop must operate only one third the number of times that the original loop operates. So, if the original loop operated you know some n number of times then we need to perform the once we perform the unrolling we make sure that the check i less than n is changed to i plus the i plus 3 less than n and we also make sure that the increment by 1 is now increased to 3 and there are three instances of the body of the loop, but it is also possible that the number of iterations is not exactly you know divisible by 3. So, in such a case there would be a few iterations which remain right. So, for that there is a small sequential loop which would operate once or twice or maximum of three times and. So, for example, here if we have i equal to you know if this is supposed to run only three times. So, i less than 3 then this part in you cannot even unroll it. So, we will have to execute it in a sequential mode. So, this test would fail 0 plus 3 less than 3 would fail. So, this part will not even be executed. So, we will have to execute it here the same is true if the number is 4 or 5 right. So, in that case we need to we will be left with 1 or 2 iterations which need to be executed. So, this is the condition k less than n k plus plus and we start with the iteration number i with which we ended this particular loop. So, why should we do loop unrolling? There are many reasons for this the first one is the first one is in instruction scheduling we actually need a large basic block. So, that the parallelism in the basic block can be you know used by the instruction scheduler to its advantage. So, if we have a very small basic block with 5 or 10 instructions instruction schedulers do not work very well. So, they work very well if there is at least there are at least 50 or 100 instructions. So, in such a case unrolling a large loop by 10 or 20 times yields large basic blocks and therefore, instruction scheduling becomes a very efficient process. The second one is by increasing the decreasing the number of iterations the overheads of the jump they actually reduce. So, as you realize every jump instruction kind of creates a problem for the pipelines. So, we must reduce the number of jump instructions as far as possible. So, if we reduce the number of iterations of the loop the number of jump instructions will automatically executed will automatically come down. So, that is another reason why we may want to unroll a loop. So, here are two examples of unrolling a while loop and unrolling a repeat until loop. So, while see s 1 s 2 can be unrolled as while see then s 1 s 2 and once we have executed we need to check whether the condition holds or not. So, if not see break again s 1 s 2 if not see break again s 1 s 2. So, this is the unrolling pattern for the while loop the repeat until is very similar. So, we do repeat s 1 s 2 if see then break because we need to iterate until see is true then s 1 s 2 if see then break s 1 s 2 until see. So, again the number of times we we have unrolled twice here. So, there are three instances of the loop body the number of times the loop would iterate will be one third approximately to compare to the original. The next optimization is function inlining. So, take a simple function definition into find greater which tries to find the largest number in the array a. So, here is a parameter a size 10 and then the number you know n. So, here. So, here is a loop which goes on from 0 to 9 and if the if this particular array contains an element which is greater than n then it returns the index of that element otherwise it increments the loop and not the greatest of the array, but an element greater than this element n and then it iterates until it finds it otherwise you know the loop terminates in comes out. So, if there is a call x equal to find greater y comma 250 by inlining this particular function we need to introduce new variables for the local variables of this particular function. So, we let us call them new i and new of a 10. So, this is the parameter. So, the new a now is assigned the value y which is the formal parameter because this is a call by value we have to make a copy and in the loop we use the instead of i which is supposed to be new i here we just use it with the same conditions new i equal to 0 new i less than 10 new i plus plus and we compare new a of new i instead of comparing a of i because now this is a copy greater than 250 then x equal to new i and go to exit. So, return here actually is replaced by x equal to new i where x is the variable on this left hand side this accumulates the return value and then we could have had a break as well. So, we exit the loop. So, this is the inlining of functions what do we gain by inlining functions. So, when we inline a function the most important thing is there is no subroutine call instruction necessary subroutine jump instructions are subroutine call instructions are very expensive because they imply creation of a an activation record then pushing parameters into that activation record and then getting the result from the activation record and finally, destroying the activation record and whenever we want to access a variable on the activation record there is a bit of you know cost attached to it. So, if you inline the function creation of the activation record etcetera destruction pushing parameters they all they are all not there at all. So, it is much cheaper and the code runs in a much faster way compared to the un inlined call. So, inlining introduces efficiency into the program by eliminating a number of subroutine calls. The next optimization is the tail recursion removal. So, let us understand what exactly is tail recursion here is a simple function called sum which takes an integer array as one of the parameters a number n as the second parameter and a pointer to an integer x as the third parameter. So, if this says if n equal to 0 then add the 0 th element of a to star x. So, star x equal to star x plus a of 0. So, the sum is being accumulated in star x otherwise add the n th element and call some recursively with n reduced, but x remains the same. So, this recursive call is the last statement in this particular function. So, in other words it is at the tail of the function that is why this is called as a tail recursive call. So, in such cases with appropriate checks it is possible to remove this tail recursion and replace it with the while loop. So, the same function declaration remains the same and you know instead of a recursion we have a while true loop which runs forever, but there is a break inside which make sure that we get out of the loop. If n equal to 0 then star x plus a of 0 which remains as it is and then instead of you know returning from the function we have a break which goes out and then of the loop and then terminates the function. Otherwise if n is not 0 we had a recursive call here we have the same sum and then we reduce the value of n by 1 and then continue with the loop. So, this loop executes as many times as the number of values you know as the variable n. So, once n reaches 0 it breaks the loop. So, we have successfully replaced this tail recursion by a while loop and the number of times this while loop operates is the same as the number of times this recursion happens. Whereas, this while loop is very efficient compared to you know this particular recursive call. So, recursion as I said is much more expensive because we need to set up an activation record push parameters extract the result and finally, destroy the activation record. So, all that gets eliminated in this process we move on to vectorization and concurrentization. So, take a simple loop x i equal to x i plus y i where the loop iterates from 1 to 100. So, because we are only extracting the old values of x and y and then summing it up and putting it into the array again we can actually execute all these statements with the help of a vector processor. So, here is a vector 1 to 100 which whose value is extracted first another vector 1 to 100 whose value is extracted next they are added the corresponding elements are added and assigned to the vector x again. So, this is very easy to see because there are no dependences there are no usages of the value which is computed within the loop I will give you an example where such vectorization is not possible. Thus if the processor were to be a multi core processor we could actually start a thread for each one of the iterations of this loop and each thread would do this summation. So, that is indicated by for all i equal to 1 to 100 x i equal to x i plus y i. So, we start 100 threads each thread doing x i equal to x i plus y i for a particular value of i. So, that is the way it would operate on a multi core processor and these are called vectorization and concurrentization. So, here if you look at this example the statement says x i plus 1 equal to x i plus y i look at the expanded version x 2 equal to x 1 plus y 1 x 3 equal to x 2 plus y 2 x 4 equal to x 3 plus y 3 and so on. So, x 2 is computed here used in the next iteration x 3 is computed here and used in the next iteration and so on and so forth. So, because of this dependence of the value on the previous iteration the code cannot be either vectorized or concurrentized. So, emitting such code even though syntactically it looks correct would be wrong. We are not trying to reuse the rather use the computed value of x, but we are just using the old values of x here. So, this is a this is incorrect. In certain cases it is possible that the outer loop can be converted to a parallel loop cannot be converted to a parallel loop and the inner loop can be converted to a parallel loop. The problem is inner loop it rates certain number of times which may be too small for giving sufficient work to you know if you parallelize the inner loop then the work involved just one statement may be too small for each thread. So, in such cases sometimes we are allowed to do what is known as a loop interchange. So, the j loop goes outside and i loop comes inside. If you are allowed to do this then the j loop can be operated in parallel and the i loop with this assignment statement operates in a sequential mode. So, for each thread which is created for j there is a whole loop which executes which is sufficient work for the thread. So, the code generated would be something like this. So, this is the loop interchange for parallelizability which is beneficial in certain cases. When we have a fixed amount of a cache and we know the cache size it is also possible to actually do what is known as loop blocking assuming that the cache size is 64 the block size is 64. We actually break the i loop into iterations with an increment of 64 the same is done for the j loop we break it with an increment of 64 and inside we actually iterate from 1 to 64. If we do this every time we finish one iteration of this we are going to get a block of 64 into the cache memory and then these two you know loops actually work only on the elements of the cache. So, if this is the case then the program becomes much faster. So, loop blocking also is very beneficial when we have cache memory. Now, we move on to the fundamentals of data flow analysis. So, data flow analysis basically is a bunch of technique that derive information about the flow of data along the program execution paths. So, essentially we look at the execution path in a program from point p 1 to p n. So, we consider the point just before the statement and just after the statement. So, if you look at p 2 p 1 is a point just before the statement and then p 3 would be a point just after the statement. So, this is how p i is a point immediately preceding a statement and p i plus 1 is the point immediately following the statement and of course, we could reach the end of the block. So, in such a case p i is the end of some block and p i plus 1 is the beginning of the successor block. So, we are essentially looking at the paths in the control flow graph. So, and in general there is an infinite number of paths through a program and there is no bound on the length of a path either. So, this is true because if we have a loop then we do not know the number of times it iterates. So, there could be very large number of paths. Basically, data flow analysis or program analysis summarizes the program states that can occur at a point with a finite set of facts. So, even though there are a huge number of paths to a particular program, we want the summary of the information coming along all these paths and put that into that particular state. So, the analysis is certainly not a perfect representation because we are summarizing the effect of many paths into that particular point. So, data flow analysis aims to produce such summaries of information. So, what these are will become clear later, but the applications of such program analysis techniques are in program debugging where we want to ask questions such as which are the definitions that reach a point and these are all useful in optimization such as constant folding, copy propagation, CSE and so on. So, data flow value for a program represents an abstraction of the set of all possible program states that can be observed for that point. So, for reaching definitions this could be the set of all definitions that reach a point. You know the set of all data flow values is the domain of that application. So, we are going to look at the reaching definitions problem where the domain is the set of all subsets of definitions of the program. For available expressions we would consider expressions and set of all subsets of expressions as the domain and so on and so forth. A particular data flow value is a set of definitions. In general all the data flow problems involve equations with in and out and where S is a statement. So, we want to find a solution to the set of constraints that are imposed on in and out and then say given these constraints these are the values of in and out. We want to do this for all the statements and that would be considered as a solution to the data flow analysis problem. So, we will stop here and then continue with this lecture in the next part. Thank you.