 Welcome to part 2 of the lecture on automatic parallelization. So, in this lecture we will continue our discussion on data dependences, direction vectors and look at a couple of examples of vectorization, concurrentization etcetera. So, to do a bit of recap we know that there are three types of dependences S 1 and S 2 are two statements and if the definition of X is used here without any modification it is a flow dependence. If the usage happens before the definition then it is anti dependence and if there are two definitions then it is output dependence. The data dependence direction vector is actually an additional information attached to the dependence itself. So, there is one direction vector component for each loop in a nest of loops. So, if there is a three nested loop then we have one for each of these three. The dependence data dependence direction vector is a d you know long vector where d is the depth of nesting and each of these components can be less than equal to or greater than then the others less than or equal to greater than or equal to not equal to and star are actually derived from the principal components less than equal to and greater than. The less than direction means it is a forward direction implying some quantity is computed in the iteration i and used in a later iteration i plus k. Whereas, this is the this is a very common type of direction vector component. Backward or greater than direction means that the dependence is from i to i minus k. In other words computed in the iteration i and used in iteration i minus k if this does not look possible. Of course, in single loops it is not at all possible, but in doubly nested loops or higher level loop nesting it is possible I am going to give you examples of this later. The equal to direction vector says the dependence is in the same iteration. So, computed in iteration i and used in iteration i. So, we saw this example last time. So, this is you know x j equal to x j plus c. So, the value of j is the same in the same iteration. We actually use it first and then define it. So, it is a delta with equal to whereas, this is x j plus 1 equal to x j. So, we compute first and then use it later in a different iteration. So, this is s delta less than s this is x j equal to x j plus 1 plus c. So, we use first and then compute. So, this is an anti dependence with less than direction vector and this loop is running downwards x j equal to x j plus 1 plus c. So, you can see that x 99 is used here x 98 will be used later and so on. So, this is still a delta less than type of relation and the last one is x j equal to x j minus 1 plus c. So, again we use you know we compute first and use later. So, this is again a delta less than relation. This is a different example with two levels of nesting. So, i and j are the two loops. So, we have a i j equal to b i j plus c i j and we also have s 2 which is b of i comma j plus 1 equal to a of i comma j and b of i comma j. So, this is the expanded version of the two loops. So, for i equal to 1 let us say we have j equal to 1 then you know expanding s 1 we get a 11 equal to b 11 plus c 1 and expanding s 2 we get b 12 equal to a 11 plus b 11. So, there is a flow dependence from this to this and obviously, the iteration is the same i iteration is the same and the j iteration is also the same. So, s 1 delta s 2 with both the directions be equal to. So, that is because of this. Then if you consider j equal to 1 and j equal to 2 b 12 is defined here in j equal to 1 and used in j equal to 2, but the value of i is the same. So, this is again a flow different you know flow dependence and computed in an earlier iteration and used in a later iteration. So, the first i loop has equal to direction vector and the second loop has less than direction vector and it is a flow dependence you know from s 2 to s 1. Then we have b 13 here this is s 2 again and it is used in you know j equal to 3 b 13 here. So, you can look at b 12 here and b 12 here as well. So, this is again a flow dependence and similar in type, but only thing is this is between s 2 and s 2 defined in an earlier j value of s 2 and used in a later j value of s 2. So, this is from s 2 to s 2 dependence is delta and the first component is equal to and the second component is less than. So, this is the direction vector in this example and the dependence diagram is also here. So, we actually place the same dependence as here from s 1 to s 2 there is delta equal to equal to. So, that is 1 and then there is 1 from s 2 to s 2 that is nothing, but equal to and less than with delta and the third one is from s 2 to s 1 which is equal to and less than. So, these are the three dependences that we have along with their direction vectors. So, this is a third example of a direction vector again we have two loops here in both these examples and this is supposed to show the direction vector less than and greater than. As I said greater than direction vector says computed you know in a later iteration, but used in an earlier iteration. So, it does not seem to make sense, but it does when we consider doubly nested loops. So, we have s 1 as a i plus 1 comma j equal to and s 2 as equal to a i comma j plus 1. Let us expand the loops with i equal to 1 and j equal to 2 we have s 1 as a 2 2. So, that is this part and if we take i equal to 2 and j equal to 1 then s 2 will be again equal to of a 2 2. So, that is i equal to 2 and j equal to 1. So, clearly there is a dependence from s 1 to s 2 a 2 2 is being computed here and a 2 2 is being used here. So, this is a flow dependence. So, that is a delta all right s 1 to s 2 there is a delta what about the data dependence direction vector. So, the value of i from here to here has increased. So, we compute in a lower iteration number and use in a higher iteration number. So, the direction vector for i is less than and for j we compute in a higher iteration number j equal to 2 and use in a lower iteration number j equal to 1. So, the second component is greater than there is no trick here it is just that the value of i is different in these two. So, the j loop starts running afresh for every value of i. So, in the iterations of j corresponding to i equal to 1 we define a 2 2 at j equal to 2, but then once we go to the next of i j starts running again and that is why we have the value a 2 2 here. So, we are really using the you know value of a 2 2 in a lower iteration number, but definitely in a different iteration of i. So, there is you know this is quite realistic. Then the second example s 2 less than greater than delta less than greater than s 1 we have the i loop and the j loop here we have a i j plus 1 and on the right side s 2 we have a i plus 1 comma j. So, again we expand i equal to 1 j equal to 2. So, s 2 is a 2 2 equal to and i equal to 2 and j equal to 1 s 1 becomes equal to a 2 2. So, again there is a you know flow dependence from here to here this is i value increases. So, the dependence is direction vector is less than for i the j value reduces. So, it is a greater than for s 1 for you know second component. So, all types of dependences are possible and I have given you examples of this. So, let us look at one more example. So, here we have 2 nested loop for i and j and then inside j we have 2 independent loops one for k and another for l. So, the only dependence from this to this is that x is defined here and x is used here, but apart from that because of i and j you know the there are other dependences as well. So, let us expand the loop with i equal to 1 a equal to 2 then 3 values of j j equal to 1 2 and 3. So, with i equal to 1 and j equal to 1 we have x 1 2 and sorry i equal to 1 and j equal to 1 we have x i j plus 1 and k I have not expanded k because k is an independent loop here and you know l is another independent loop here. So, we are only looking at the you know dependences corresponding to direction vectors for i and j plus 1. So, a suitable value of k here can always be placed you know k and l I can just place 1 here and 1 here there is no problem. So, x 1 2 k is defined here and then in the same value of i and with a different value of j x 1 2 l. So, I can place this make this 1 here and 1 here. So, that establishes a concrete dependence between the 2. So, that is shown here similarly between x 1 3 k here and x 1 3 l here there is a dependence I can place k equal to a suitable value and l equal to suitable value here right to make them the same. Now, for the second statement we have to a 2 1 l here which is used in a different value of i. So, a 2 1 k. So, k and l can be equalized again the a 2 2 l is defined here and a 2 2 k is used here. So, this is another you know instance of the same dependence. So, in the first case it was a flow dependence. So, there is delta and in the second case also it is a flow dependence. So, it is again delta. In the first case the first direction vector component is marked as equal to because it is the same value of i for both these. So, this is the same column whereas, here we have marked the first component as less than because there is a difference in the value of i from here to here this is equal to 1 and this is equal to 2. So, and this has increased in this direction the second component is less than here because j has increased in this direction. So, we have less than here and the value of j is the same across these two. So, we have equal to as the direction vector component here. So, if we draw the dependence diagram here. So, s 1 to s 2 we have delta you know equal to and less than. So, that is this and from s 2 to s 1 we have the other one delta of less than and equal to. So, you observe that this is from s 2 to s 1 whereas, this is from s 1 to s 2. So, these are the two you know adjusts in this you know dependence diagram. So, we will see how this matters a little later. So, far we saw examples of data dependence relations and direction vectors. Now, it is time to understand how to use the dependences in order to do vectorization and you know concurrentization. So, individual nodes are statements of the program and adjusts depict data dependence among the statements. So, we have already seen this this is how the data dependence diagram is created graph is created. If the d d g is acyclic then vectorization of the program is possible and is straight forward. So, remember the most important condition for vectorization is that the data dependence diagram or the graph should be acyclic. The direction vector by itself does not pose a problem here, but it will definitely pose a problem for the concurrentization. So, vector code generation can be done using a very simple topological sort order on the data dependence graph. So, suppose the graph is acyclic then it is a bit more complicated. So, we find all the strongly connected components of the data dependence graph and reduce the d d g to an acyclic graph by treating each strongly connected component as a single node. So, now the you know once it has become acyclic we can actually now generate vector code, but as far as the SC cells are concerned they cannot be fully vectorized, but the final code will contain some sequential loops and possibly some vector code. So, that is how this is going to be this is a bit more complex and it is not possible to provide a very simple example here. So, we are going to emit you know rather not to get any example in this case, but we will concentrate our attention here. Then so in the case of concurrentization if all the dependence relations in a loop nest have a direction vector of equal to then the iterations of the loop can be executed in parallel with no synchronization between iterations. So, remember direction vector value of equal to for a loop. So, in that particular loop if the direction vector value is equal to that means all the dependences are in the same iteration they do not flow across iterations. Therefore, iterations of the loop can be executed in parallel there are a couple of observations here which are very important any dependence with a forward direction in an outer loop will be satisfied by the serial execution of the outer loop. So, if there is a less than direction in the outer loop you run it sequentially then the dependence is automatically satisfied for that loop. And if an outer loop is in L is run in sequential mode then the all the dependences with a forward direction at the outer loop of L will be automatically satisfied even those of the inner loop. So, if we this is a very important thing if we are able to if we run the outer loop in a sequential mode then you know we can run all the inner loops in a parallel mode provided you know the direction vectors permit. So, we do not have to worry too much once we run it in sequential mode the everything will be satisfied in at the inner levels. So, we can run them in parallel only thing is the outer must have less than direction you know in all the edges. So, if some of the edges of the data dependence diagram corresponding to the outer loop have equal to direction vector then you know running the outer loop in sequential mode will not satisfy all the dependences. So, that will be a problem. However, this is not true for those dependences with equal to direction at the outer level. So, I already mentioned this the dependences of the inner loops will have to be satisfied by appropriate statement ordering or and or loop execution order. We are going to see examples of this very soon. So, let us take an example of vectorization. So, here is the here is a loop i with two statements s 1 and s 2 and here is another loop with i as the index with s 3 and s 4. The index of course, does not matter because these two are independent loops. Now, we have s 1 s 2 s 3 and s 4 the dependences among these statements is also shown here between s 1 and s 2 there is nothing in common. So, they do not have a dependence right and then x i is computed in this loop and then it is used in this loop. So, obviously, between s 1 and s 3 there is a flow dependence delta I have not bothered to indicate the direction vector. So, this is just dependence because vectorization does not bother about the direction vector. Then we have used x i here, but we have also computed x i plus 1 here. So, we compute and then use. So, this is the way it is. So, x 2 and then x 1 x 3 and x 2 etcetera. So, what happens is from s 4 we would have a dependence delta to s 3. So, that is also there right. So, this is very important and then again we have value of b i being computed here and the value of b i being used here. So, it is the same value of i. So, from s 2 to s 4 we again have a delta dependence and between these two s 1 and s 4 there is also an output dependence from s 1 to s 4 that is delta O. So, these are the various dependences in our program. Now, obviously, this is a directed acyclic graph. There are no cycles here. So, if we do a topological sort of this graph then the statements can be vector statements can be emitted. So, now this x i s 1 has no incoming arcs. So, we can emit the code for s 1 directly. So, x 1 to 99 equal to the vector constant 1 to 99. So, x i equal to i means x 1 equal to 1 x 2 equal to 2 etcetera. So, this is a vector constant 1 2 3 4 etcetera up to 99. So, this assignment is a vector statement for this loop this part of the loop. The second statement is s 2. So, again it is very similar b i equal to 100 minus i becomes 99 colon 1 colon minus 1. So, we have you know i equal to 1. So, this starts with 99 then goes to 98, 97 etcetera. So, this vector with a stride of minus 1 automatically indicates that it is a vector with 99, 98, 97 etcetera etcetera. So, this is s 2. So, these two do not have any incoming edges. So, this can be processed right in the beginning. Now, s 3 has an incoming edge from s 4 and it has an incoming edge from s 1. So, s 3 can be processed only after s 1 and s 4 are complete, but s 4 itself can be processed once s 2 and s 1 are complete. So, we have finished code generation for these two. So, in the execution order the vector code will be executed in the sequential order. So, this is s 4 and x i plus 1 equal to g of b i. So, x 2 to 100 equal to g of b 1 to 99. So, this is the vector statement corresponding to this loop this part of the loop and lastly the s 3. So, that would be a 1 to 99 equal to f of x colon 1 to 99. So, this is a very simple vectorizable set of loops. So, we just emit the code in the topological sort order and automatically it gets done. The second example. So, we have already seen this program before and we also discussed these dependences. So, from s 1 to s 2 there is a dependence delta equal to and less than and from s 2 to s 1 there is another dependence with delta less than or equal to. So, this is a cyclic graph and therefore, the loops cannot be vectorized. So, i and j loops cannot be vectorized of course, it is always possible to vectorize the k and l loops separately that is never an issue. So, now we actually try to run the i loop let us say in sequential mode. So, i equal to 1 to 100. So, we run the loop in sequential mode. Now, the dependences corresponding to i will all be satisfied. We can take out the i part from this dependence diagram. So, here for example, this less than this is equal to and less than. So, the equal to part can be taken out and in this case this less than is automatically satisfied and equal to is never a threat for vectorization. So, we can remove this arc completely. So, that is what we have done here. So, between this and this. So, you know so, this the equal to arose because of the second component right, but when vectorization is performed we are going to actually you know actually read this entire vector and then make the assignment. So, here also we are going to read the entire vector and then make the assignment. So, because of that the vectorization of the j loop is also possible. So, the i loop dependences are satisfied the j loop dependences change as before. So, we first emit the vector code for s 1 and then emit the vector code for s 2. So, automatically this is these are the two vector statements inside the i loop. So, these are executed sequentially. So, x i comma 2 2 1 0 1 comma 1 2 100 equal to a i 1 2 100 and 1 2 100. So, let us go back to the dependence diagram here. So, we are running the i loop in sequential mode. So, these are all going to be run first and then these etcetera. And the j part is you know vectorized. So, that is precisely what we are doing. So, this dependence is from this s 1 to s 2 since all the vectors you know in the vector code all the s 1 statements are executed first and then the s 2 statements this particular dependence will always be taken care of. So, that is what we have shown here. So, the x statement is run s 1 is completed and only then s 2 begins. So, automatically the dependence will be taken care of for this iteration for this particular two variables x and x. So, there is also a statement here that the j loop cannot be parallelized. So, that is true. The reason being the direction vector component is less than here, but it is less equal to for the i loop in this particular edge. So, what we really you know mentioned in the observations is that if the corresponding loop or direction vector component is less than in all of the edges then you know sequentially running that particular loop will satisfy all the dependences in words, but in this case we have equal to here and less than here. So, even if we run the i loop in sequential mode the j loop cannot be run in parallel mode, but the k and l loops can always be parallelized. So, assuming that we run i and j in sequential mode then this particular part k loop can always be and the l loop these two can be vectorized and run even parallely if necessary, but that is not advisable we will see why later. So, here is the example of the code which is slightly changed. So, the previous one we had you know the dependence in a slightly different fashion. So, for example, the dependence ran from here to here and here to here whereas, here the dependences are slightly different. So, even the code is different. So, this is x i j plus 1 k a i j k and here it is a i plus 1 j l and x i j l whereas, in this case it is i j plus 1 k and a i j k a i plus 1 j plus 1 l. So, it is not j anymore. So, what happens is the dependence is not from here to this, but it is from here to the next one. This dependence is as before. So, it is still delta equal to and less than, but this particular dependence is from i equal to 1 to i equal to 2. So, that is less than again and here we have j equal to 1 and it is j equal to 2. So, again this will also be less than. So, we have delta less than less than for s 2 to s 1 and we have delta equal to and less than for this particular edge. Now, the dependences have changed. Now, it is possible to interchange the i and j loops. So, there are test for conditions for loop interchange. In this case they are satisfied. So, it is possible to interchange the i and j loops. In other words, the j loop runs first and then the i loop runs. If that happens or obviously, the dependence direction vector components also get swapped. So, this becomes a delta less than equal to and this of course, remains as delta less than less than. Now, both the edges have delta less than for the outer loop. So, that this would be the j loop and this would be the i loop. Therefore, we have the j loop and the i loop. The dependences you know even though they remain the same, it is possible to now run the j loop in parallel mode. So, whereas the sorry the j loop can be run in sequential mode. So, if we do that then you know the dependences of all the nested loops inside will be satisfied. So, in other words if we run the outer loop which is j in sequential mode, then the i loop can be run in parallel mode. So, that is the advantage that we have. So, that is about you know that is one of the examples that we have for concurrentization. So, we run the outer loop in sequential mode, and then we can run the inner loop, i loop in parallel mode. This is always advantageous because the inner loop being bigger the amount of work for each thread will increase. Whereas, if we had actually run the outer loop in you know if we simply say that the inner loop has very little work, then making it into a thread is of not much use. Here are more examples of concurrentization. We have i equal to 2 to n here, and we have j equal to 2 to n here, and we have s 1 and s 2 here. So, in this case again when we expand a loop i equal to 1, i equal to 2, and we have j equal to 1, j equal to 2 and j equal to 2, 3. So, we have a 2 2, a 1 1 here, a 2 3, a 1 2 here, a 2 4, a 1 3 here, and on this side we have a 3 2, a 2 1, and a 3 3 and a 2 2. So, a 2 2 is being used here, and it is being defined here. So, there are many you know dependences here. So, for example, we have you know a 2 2 here. So, s 1 delta less than less than s 2. So, from s 1 to s 2 there is a delta and that is a flow dependence s 1 to s 2, and then it is less than less than. So, this is i equal to 1 and j equal to 1, and this is again i equal to 2 and j equal to 2. In both cases the i equal to 2 is more than i equal to 1 and j equal to 2 is more than j equal to 1. So, this dependence corresponds to that. Then we have another one s 1 delta bar equal to equal to s 2. So, that corresponds to this B i j. So, I have not shown it here, but that is easy to see this is B i j and this is B i j. So, there is a usage and then there is a definition. The third one is also an anti dependence s 2 delta bar equal to equal to s 2. So, there is B i j here and B i j here as well. So, this is the usage and then this is the definition. So, there is an anti dependence from s 2 to s 2 as well. So, these are the three dependences in this loop. Now, in this case for example, if we i loop can be. So, this is the true dependence the other two are anti dependences and that is not of much importance to us. So, if we run the outer loop in sequential mode. So, that is the i loop then the j loop can be run in parallel mode. So, that is an advantage here. So, we can run this in serial mode and then we can run this in parallel mode. So, obviously this will be satisfied. In the second example of concurrency. So, we have i loop here and the j loop here. So, we have s 1 delta equal to less than s 2. So, again you know so, we have a 2 2 here and a 2 2 here. So, it is same value of i, but different value of j. So, that is why this is correct. Then we have s 1 delta bar equal to equal to s 2. So, that is this B i j B i j and then of course, s 2 delta bar equal to equal to s 2 is corresponding to these two these three. Now, the j loop cannot be run in parallel mode, but however, the i loop can be definitely run in parallel mode. So, even here you know we cannot run the i loop or the j loop in parallel mode. So, that is why we resort to this sort of a thing whereas, here we have equal to as the component for this, this and this all three. So, loop can be definitely run in parallel mode, but the j loop cannot be there, but that is perfectly because running the j loop in sequential mode gives us a lot of work for each iteration. So, whereas, here we had run i in sequential mode and then we were trying to run j in parallel mode. So, if we do that, then the amount of work for the j loop is a little less compared to this. As it is you know if the j loop is big enough, then it could be run in parallel mode with enough work, but otherwise if the j loop has little work, then it is not a good idea to run it in parallel mode. Now, let us look at a couple of transformations which can increase the parallelism. There are many of these for example, recurrence breaking or cycle breaking, there are ignorable cycles, then scalar expansion, scalar renaming, node splitting, threshold detection, index splitting, if conversion etcetera. So, we are going to look at only a few of these to understand what goes on, then we have loop interchanging loop fusion and loop fusion. Scalar expansion for example, we have a scalar in the loop t here and t here. If you look at the dependence diagram of this particular program, then we have you know S 1, S 2 and S 3 here. From S 1 to S 2, there is a delta bar. So, from S 1 to S 2, so that is this a i and then there is from S 1 to S 3, there is delta equal to, so that is between this t and this t and then between S 2 and S 3, there is again delta bar. So, S 2 and S 3, so that is this part right and then from S 3 to S 1, there is delta less than bar. So, that is from here to here, that is we are using it here and then you know defining it there and finally, there is a self loop delta less than o on S 1, so that corresponds to this t. So, we write into the same location again and again, so the iteration i equal to 1 must write into t and only then the iteration number 2 can write into t. So, this is an output dependence on t on for S 1 and since the statement is the same, we actually have a self loop and obviously, it is less than because the iteration numbers keep increasing. This is obviously, a cyclic loop right, this this and this there is a cycle. Suppose, we make t you know into a vector that is the scalar expansion, scalar is expanded into a vector. So, we have t x of i equal to a i, a i equal to b i and b i equal to t x of i. So, if we do that then obviously, the loop goes away right. So, this loop is gone and we have from S 1 to S 2 that is very you know that is this right and we also have from S 2 to S 3, so that is also there and we have from S 1 to S 3 this is S 1, this is S 3. So, this is a flow dependence, rest of the dependence is vanish. So, now this particular data depends diagram is cycle free. So, we can vectorize it using topological sort. So, we do this first then this and then this. So, we can very simply execute rather emit the parallel vector code for this. The other possibility is if we are running it on multicore processors, we can actually make this temp t into a private variable separate for each core. So, assume that each iteration runs on a different core. So, for each core we have a space little bit of memory space available. So, we make it a private variable for each iteration. So, then again this becomes a cycle free you know just like this there is no cycle here and all the dependences are within the same iteration. So, we can easily parallelize this particular loop as well. Scalar expansion may not be always profitable. So, if you consider this program we have t equal to t plus a i plus a i plus 2 and a i equal to t. So, this is a cyclic graph right there are many dependences. So, you know so we have a dependence from S 1 to S 1, we have a dependence from S 1 to S 2 then we have one from S 1 to S 2 for this and then we have again you know this a i plus 2 to a i right. So, there are many many and of course, one on S 1 to S 1 itself. So, there are so many of these dependences here and it is a cyclic d d g, but making the temporality into a vector actually still retains this as a cyclic data dependence graph. So, it does not change it at all you know from the couple of them removed. So, this is gone right, but this remains there is no change because of this. So, because of this there is this remains and then we also have a cycle from here to you know here to this and this again right. So, this is this to this there is no cycle, but we have a cycle right here this is still cyclic. So, we got rid of one of them this particular thing we got rid of right and we also got rid of this to this, but this cycle still remains. So, cyclic data dependence graph cannot be vectorized we will have to do this sequentially and then vectorize the rest of them otherwise we need to run it sequentially. So, scalar renaming tries to you know remove anti and output dependences. So, here we have t equal to a i plus d i b i then we again use t equal to d i plus d i star b i. So, if we introduce a separate variable for each of these. So, they become t 1 and t 2. So, the output dependence between s 1 and s 3 now goes away the now you know we can vectorize this code. So, there is no problem we removed the by renaming the scalar t and making it separate t 1 and t 2 we have eliminated the you know output dependence and now the vector this can be vectorized. So, that can that is easy to see because this t 1 does not have any dependence on it. So, now this is a i here and here is a i plus 2, but we are not actually executing this particular code in concurrent mode we execute s 3 then s 4 then s 1 and then s 2. So, automatically we compute here and then use it in s 1. So, compute in s 4 and then use it in s 1. So, that is automatically taken care of then we have compute in s 3 and use in s 4. So, compute in s 3 and use in s 4 that is also taken care of and then we have compute in s 1 and use in s 2. So, that is also taken care of right. So, in this manner we can parallel you know make this code run in vector fashion. So, we had looked at if conversion a little before now if conversion is also a way to assist in vectorization and of course, concurrentization. Here we have a program i equal to 1 to 100 if a i less than equal to 0 then continue a i equal to b i plus 3 otherwise. So, in this case if conversion says you know if this is a conditional statement. So, there is no way we can make it a vector code. So, what we do is we introduce a vector condition for a i less than equal to 0. So, b r of i equal to a i less than equal to 0. So, this is a vector of conditions and we make this you know change the program by is saying if not of b r i then a i equal to b i plus 3. So, instead of continue we made it like this and now assuming that there is a masking operation available in vector machines. The masking says wherever the mask is true execute the statement and wherever it is false do not execute it. So, we compute the mask as before then we have this is in vector mode. So, instead of this now this was still a sequential loop. So, now we made a vector of the conditions and then we said where not of b r 1 to n a 1 to n equal to b 1 to n plus 3. So, we have actually introduce this masking statement and this is still a vector code. So, this is the advantage you know if we use if conversion all the control dependence because of the if then else condition is automatically translated to converted to data dependence here. Then we have another example with s 1 s 2 s 3 s 4 inside a loop. So, we have an if then and then there are two you know there is a statement c i equal to c i plus a i and then we have another statement d of i plus 1 equal to d i plus 1 plus 1 and if. So, there are two statements here within the condition and there is one statement outside the condition. So, if you draw the dependence diagram now there is s 1 then s 2 they do not have any you know dependences on them now there is s 3. So, s 3 is actually dependent on s 2 in by a condition. So, that is why it is shown as a c and there is a dependence from s 1 to s 3 as well. So, that is because we are using a i here compute it here and use it here. Then we have you know a dependence from s 2 to s 4 which is again conditional just like s 2 to s 3 was a conditional dependence this is also a conditional dependent and then we have a dependence from s 4 to s 1. So, that is because we are computing d i here and using it here. So, these are the various dependences now you know we can actually emit vector code corresponding to this. So, what we do is we for i equal to 1 to n right. So, sorry this i loop actually this should be removed this is a minor mistake there is no loop here this is basically a vector code. So, temp 1 to n is equal to b 1 to n greater than 0. So, that is what we compute as the condition that is a vector and then we have a mass execution where temp 1 to n d 2 you know colon n plus 1 equal to d 2 colon n plus 1. So, we execute the second statement as a mass statement. So, what we have really done is we have executed the mask statement first. So, that is s 2 and then we have executed the s 4 statement because s 1 is dependent on it. So, that is a conditional mask statement then we execute s 1. So, which is an unconditional statement and then we execute s 3 which is again a conditional statement. So, this statement is again conditional. So, the order in which this is being executed is the topological sort. So, first s 3 then s 4 then s 1 and finally, s 3. So, this is the execution order and this is the execution order for the vector code. So, just omit this i equal to 1 because there is no loop there. Loop interchange is another very important transformation. So, for machines with vector instructions inner loops are preferable for vectorization and we can use a loop interchange to enable this. For multicore and multic processor machines parallel loops are preferred to be you know parallel outer loops are preferred. So, again loop interchange may be able to achieve this. There are very simple conditions for loop interchange to be possible. Of course, l 1 and s 2 must be tightly nested no statements between the loops and then the loop limits of s 2 must be invariant in l 1. So, we cannot have i equal to 1 to n and the second one saying j equal to i and then something else i 2 something that cannot be done. Then most important is there are no statements s v and w in l 1 with a dependence of s v delta star less than greater than s w. Delta source stress it could be any of the dependence less than I know either true anti or output. So, if we have s v delta star s w with less than and greater than then if we interchange the loops this greater than would become the first component which is meaningless that is the reason why such loops cannot be interchanged. So, this is very simple these two loops and this statement the dependence is s delta equal to less than equal s. So, interchanging is possible the i loop cannot be vectorized. So, now you know we interchange the two loops j runs outside and i runs inside. Now, the inner loop is definitely as you know vectorizable sorry here j loop could not be vectorized and i became the inner loop. So, now we can vectorize the inner loop. So, the dependences have also become dependence direction vectors have changed you can observe that here. So, this is the vector code that corresponds to this particular thing. So, we run the outer loop in sequential mode and now the inner loop runs in vector mode. Again here the outer loop is not parallelizable because we have a less than. So, we want to exchange these two once we exchange it becomes equal to and now there is more work per thread because i loop will be now be run in sequential mode and the j loop will be run in parallel mode. So, this is how each thread will now run a loop. So, that is how it runs. So, if you look at the diagrams for dependences let us say this is the these are the various instances of statements for i equal to j equal to different values. So, and let us say these are the dependences in black. So, if we were running the loop in this fashion to begin with now loop interchange will run them in this order. So, you can see that with this type of a dependence there is no violation of any of these dependences that is what is most important if the loop interchange has to be legal. Whereas, if we have something like this then you know running it in this fashion will satisfy the dependence of this kind from s 1 2 to s 2 1, but if we run it in this fashion this runs before this. So, the difference is violated. So, any of these backward dependences are going to be violated if there is a loop interchange with loops of this kind. In some cases we have dependences in both directions right rectangular. Now, loop interchange is legal, but obviously it does not give you any benefit because we cannot you know run any loop in parallel when there are dependences in both directions. So, that is the example for loop interchange being possible, but with no benefit. Now, loop fission is something which says the whole loop may not be vectorized a vectorizable. So, divide the loop into smaller loops now this loop is vectorizable, but this is not, but still the program speeds up a little bit because this loop becomes vectorizable this is loop fission. So, we divide the loop into two parts. So, in this case for example, there is a dependence from s 3 to s 2 you can see that s 3 to s 2. So, we have c i here and c i plus 1 here. So, we compute here and use here. So, if we break the loop here make these two into one loop and s 3 into another loop obviously, this dependence will be violated. We cannot really compute something in the s 3 and then use it in s 2 because they have become in separate loops whereas, in this case all the dependences are in the forward direction. So, we can break the loop either here or here or both places. So, in other words we can make three loops out of these three statements and still there is going to be no violation of any of these conditions. So, that brings us to the end of the lecture and the end of the course as well. Thank you.