 Welcome to part 3 of the lecture on instruction scheduling and software pipelining. Today we will continue our discussion on instruction scheduling, more specifically the super block and hyper block scheduling and then we will continue with software pipelining. So, to do a bit of free cap global acyclic scheduling is the general name given to all the scheduling algorithms which look at more than one basic block. So, for example, instruction you know trace scheduling, super block scheduling, hyper block scheduling, software pipelining all this belong to this type of you know scheduling called global acyclic scheduling. The reason why such a an algorithm is required is due to the size of the basic block being quite small in on the average. So, if with very small basic blocks instruction scheduling of the basic block kind becomes a bit ineffective and this is a serious concern in architectures such with lots of instruction level parallelism such as VLIW and super scalar. So, trace scheduling is a method of scheduling and there are many types within that such as ordinary trace scheduling, super block scheduling and hyper block scheduling. A trace is nothing but a frequently executed acyclic sequence of basic blocks in a control flow graph. So, this is a part of a path. So, there is no very rigorous definition of a trace you can make the trace as big as possible or as small as necessary, but the reasons why we may want to make it big or small should also be kept in mind. So, identifying a trace we identify the most frequently executed block and then extend the trace starting from this block forward and backward along the most frequently executed edges. So, when we perform trace scheduling basically we combine the blocks of the trace and schedule them as if they all of them together perform a single basic block. So, if when we do this the execution time for the trace usually reduces, but the execution time for other paths definitely will increase. However, the overall performance generally increases, but the concern is regarding compensation code which needs to be you know inserted for the off trace blocks. So, sometimes such you know compensation code may become quite large in size and that may be a reason why we do not want extremely large traces, but would like to reduce the size of the traces. So, the second variety of scheduling is the super block scheduling a super block is nothing but a trace again, but it does not have side entrances. The reason we the trace you know the it is because of the side entrances that we had to introduce compensation code for the you know ordinary traces. So, control can enter only from the top in a super block, but many exits are possible and this eliminates several book keeping overheads specially the compensation code insertion. So, how do you form super blocks form a trace as before and then duplicate the tail to avoid side entrances into a super block. Of course, this will increase the code size. So, here is a simple example of the same set of instructions and the control flow graph. So, we had this block this block this block and these two are combined into a single block here. We formed this this and this as the main trace and now for the off trace we form a copy of this block before. So, that the side entrance into from the off trace into the main trace is avoided. So, with this and then of course, the scheduling continues as before we combine these three blocks and schedule them and then we combine these two blocks and schedule them, but observe that there is code duplication here and in general you know with a large control flow graph will have a lot of such duplication of code and this increases the code size. So, in the super block we have done even better there are only 5 cycles required for the main trace and 6 cycles for the off trace. The reason why it has reduced further is because of this you know the duplication of code. So, we do not have to jump into the middle of the main trace and we do not have to jump out of the middle of the main trace etcetera. The only jump is here. So, the code this is the main trace. So, 0, 1, 2, 3, 4. So, we take 5 cycles and this is one part of the main trace and here it says if r 1 less than r 6 go to i 1. So, this is the loop part. So, here so we execute these as if there is no and in the middle here there is a if condition if r 2 not equal to 0 go to i 7. So, that would be the off trace we jump to the off trace if the condition indicates. So, otherwise we just continue and everything has been scheduled in a very packed fashion and for the off trace we require up to this anyway 0, 1, 2 and then let us say we jump here. So, 3, 4, 5 so that would be 6 cycles whereas, for this we require 0, 1, 2, 3, 4, 5 so 5 cycles for the main trace. So, the advantage of super block is it is even better than the ordinary trace, but it requires extra code because of the duplication. Then the next technique in trace scheduling is called as hyper block scheduling. Super block scheduling does not work well with control intensive programs which have many control flow paths because if we have many control flow paths then the code duplication would be excessive. So, this does not work well hyper block scheduling was proposed to handle such programs the basic idea is to introduce guarded commands. So, with a rather predicated commands. So, we control flow graph is actually it goes through a process called if conversion I am going to give you an example of this very soon to eliminate the conditional branches and if conversion replaces conditional branches with appropriate predicated instructions now the control dependence gets changed to a data dependence. An example of if conversion we have a loop i equal to 1 to 100 do if a i less than or equal to 0 then continue I know and otherwise we go down and perform a i equal to b i plus 3. So, if a i is greater than 0 we do this otherwise we go to the next iteration of the loop. Now, because of this condition we would have code duplication. So, instead of that we are going to compute a i less than equal to 0 as a predicate. So, p and if the hardware permits predicated instructions then we can say a i equal to b i plus 3 will be executed as a predicated instruction. So, how we work and if the predicate is you know true then this will be executed and if the predicate is false then you know the loop just continues. So, that is how it would be. So, I think there is a minor mistake in the predicate computation and the guard. So, the predicate is a i less than or equal to 0 and if it is false then a i equal to b i plus 3 will be executed. So, it is indicated as p here this should have been not p. Now, the other example has four statements and there is a for loop for i equal to 1 to n and inside that we have a i equal to d i plus 1 if b i greater than 0 then c i equal to c i plus a i otherwise d i plus 1 equal to d i plus 1 plus 1. So, there is a block of code which is executed in the then part and also another block of code which is executed in the else part. As before we compute the predicate p equal to b i greater than 0 if the predicate is true then we must execute this. So, what we do is the predicated instruction has c i equal to c i plus c i and if the predicate is false then we execute the second statement. So, the second instruction is predicated with not p. So, not p has d i plus 1 equal to d i plus 1 plus 1. So, the semantics are as before if this instruction executes only when the predicate is true and this instruction executes only when the predicate is this predicate is false. So, but this requires extra support from the hardware side. So, coming back to our example. So, we have you know one piece of code which is executed in the then part another piece of code which is executed in the else part and then the join executes one more instruction. So, we have four basic blocks as before. So, this was the diagram we have already seen. Now, the hyper block is actually the entire control flow graph in this case except for this back arc. So, we require six cycles for the entire set of predicated instructions. Now, if you observe this code as the integer unit 1 and integer unit 2 they compute instructions load a r 1 and load b r 1 and then the instruction i 2 prime computes the predicate. So, the predicate was here right this a i equal to 0 that was the predicate here. So, that is computed here right if r 2 not equal to 0. So, r 2 equal to 0 is the predicate that we compute here and then the instruction b r 1 equal to r 4 if p 1 is the predicated instruction which executes if p 1 is true and b r 1 equal to r 2 if you know p 1 is false. So, not of p 1. So, if p 1 is false this is executed if p 1 is true this is executed then we have the other instructions as before, but we also have the second instruction r 4 equal to r 2 if not of p 1 all right. So, we have two instructions under the not p 1 category and one instruction under the p 1 category and then of course, we have other instructions which are not predicated. So, this is how it is. So, in this case you know we have two instructions in this basic block b 3 this is the after is that is the reason why we have those instructions here as well in the else part. So, this is the hyper block scheduling example basically we do an if conversion which implies we compute the predicates and emit predicated instructions and those predicated instructions are scheduled as if they are ordinary instructions. So, if you observe here there is absolutely no except for the last instruction here there is no branch at all everything is a predicated instruction. So, now we move on to the next technique of global scheduling. So, this is called software pipelining and here I cannot say that this is a cyclic scheduling because it is essentially for loops. So, this is a cyclic scheduling strategy, but it is a global scheduling strategy because it looks at more than one basic block. So, the most important aspect of software pipelining is that it overlaps execution of instructions from multiple iterations of a loop. So, whereas, instruction scheduling looks at exactly one iteration of a loop whereas, the software pipelining scheme it overlaps execution of instructions from multiple iterations. So, I am going to give you many examples of software pipelining very soon. So, it execute instructions from different iterations in the same pipeline. So, that the pipelines are kept busy without stalls. So, instruction scheduling actually helps in cutting down stalls, but if we take instructions from different iterations then there are even more possibilities for executing instructions without stalls. The objective is to sustain a high initiation rate. So, initiation rate basically says how soon we can start the next iteration. So, initiation of a subsequent iteration may start even before the previous iteration is complete. So, one of the iterations which has been started may be going through many phases. So, before it is complete it has completed all the phases the second iteration may actually begin. So, this is the objective of software pipelining. The other way of doing it appears to be unrolling loops several times in performing global scheduling on the unrolled loop, but definitely this is much better than you know scheduling you know loops without unrolling, but there is no overlap between across the iterations of the loop usually even if there is then it would be only near the border of the loop. So, this and of course, software pipelining in general has been observed to provide much more speed up than this sort of unrolled loop scheduling. This technique is obviously more complex than instruction scheduling and just like instruction scheduling this is also an NP complete technique. So, there is no option, but to use heuristics. The basic idea involves finding what is known as an initiation interval for successive iterations. So, again initiation interval says this is the interval with which we initiate the iterations of a loop. So, if the initiation interval is 1 then we can start each iteration after one cycle in successive cycles we can start the successive iterations. Whereas, if the initiation interval is say 3 then we start the iteration i and the next iteration can be started only you know in i is the iteration i plus 1 i plus 2 we can start it only in i plus 3. So, this is the you know significance of initiation interval. How do you find the initiation interval? There is no shortcut to it it is only a trial and error procedure. So, we start with the minimum initiation interval which can be computed using some techniques which we are not going to discuss here. The then we schedule the body of the loop using one of the approaches below and check if the schedule length is within bounds. So, if yes stop otherwise try the next value of the initiation interval. Basically it requires a modular reservation table. So, this is a global reservation table with i i which is the initiation interval i i number of columns and r is the number of resources r rows. So, instead of the g r t having as many you know rows as the rather you know as many columns as the number of the length of the schedule here we have i i columns and r rows. Schedule lengths are dependent on the initiation interval and dependence distance between instructions and resource contentions. So, it is not the just the precedence and resource constraints, but we have the you know schedule length being dependent on the initiation interval the dependence distance between instructions and the resource constraints. So, all these form a part of the package. So, computing the initiation interval and then checking out whether the schedule is within the bounds is the only way to find a proper initiation interval. So, let us take this simple example we have a for loop i equal to 1 i less than equal to n i plus plus. Then we have a i plus 1 equal to a i plus 1 b i equal to a i plus 1 by 2 c i equal to b i plus 3 and a i equal to c i. The dependence diagram for this the instructions in this loop is shown here. So, let us say these are the four instructions in the loop and the first instruction is the number 1 and the last instruction is number 4. So, the you know a i plus 1 is computed here and then used here. So, there is a dependence from 1 to 2 then the same is true for b i and c i as well. So, 2 to 3 and 3 to 4 further we have a we have a computation a i plus 1 equal to a i plus 1. So, in other words in computing i plus 1 the i plus 1 value of a we are also using the ith value of a. So, this is the dependence on the same instruction because this is the instruction which computes the value. So, I am computing some value and using it in the next iteration. Now, coming to the labels on these arcs the first component of the label is the what is known as the dependence distance and the second component is the well known delay. So, the second one is the time required to execute the instruction whereas, the first one is the dependence distance. Dependence distance simply says the number of iterations between the definition and the use. If you consider b i and c i b i is computed in this instruction and the value of this value which is computed here is used in the same instruction same iteration, but in the next instruction. So, the dependence distance between these two is really 0 because the usage is in the same iteration. Therefore, from 2 to 3 right. So, we have a dependence distance of 0 here. Similarly, from 3 to 4 also we have a dependence distance of 0 and from 1 to 2 again both are i plus 1. So, the dependence distance is 0 here as well, but this usage versus this computation this is from the previous iteration and this is in the current iteration. So, the dependence distance is actually 1. So, dependence distance 1 indicates that the value computed 1 iteration before is being used in the current iteration. So, this is our dependence diagram and any schedule that we produce must satisfy the dependences and the delays in this diagram. So, let us see how we can schedule the instructions here. So, this is the timeline and these are the iterations. So, now we have the instruction S this is S 1, S 2, S 3 and S 4. So, in time slot 1 we start S 1. So, absolutely no issues there and then in the time slot 2 obviously, S 2 executes time slot 3 S 3 executes and time slot 4 S 4 executes. So, all these actually belong to the same iteration. Now, the question is in a given enough number of resources is it possible to start the iteration number 2 in the time slot 2 itself that is the first instruction of iteration 2 can it be started in time slot 2 concurrently with the second instruction of iteration number 1. The answer is yes let us look at the dependence diagram to understand why this is S 2 this is S 1. So, between these two right the dependence distance is actually 0 and. So, there is absolutely no problem in executing S 2 and S 1 together alright, but now the S 2 of course, belongs to the previous iteration and S 1 belongs to the next iteration. So, otherwise we could not have we actually could not have started S 1 of the second iteration in this time slot 1 that is not possible, but that is because from you know from S 1 there is a self loop on itself with the dependence distance of 1. So, in other words the value computed in iteration number 1 and the instruction number S 1 is required for the next iteration instruction number S 1. So, and it requires one cycle to complete S 1. So, we could not have started S 1 here for the second iteration we can start S 1 only in the second cycle. So, S 1 of the second iteration we can start it only in the second time slot. So, that should be clear now there are no resource constraints. So, this thing completes on its own in a on the hardware now this also begins its execution in time step 3 we have S 2 of iteration 2 in time step 4 S 3 and time step 5 we have S 4. Now, the same question can be asked again in time step 3 can we start another iteration concurrently with S 3 of iteration 1 and S 2 of iteration 2 the answer again is yes if there are enough resources. Of course, as I told you before we could not have started S 1 here because of this dependence distance being 1. So, this is the and of course, we could not have executed S 1 S 2 S 3 S 4 in parallel because of these dependences. This is strictly sequential and again we have to wait for one cycle to start another S 1, but one started we can continue. So, the same is true for time step 4 we start another S 1 here then onwards you know it is a kind of a stable situation steady state. In time step 5 the iteration number 1 has completed because this is the last instruction of iteration number 1. So, this has actually completed what we are left with is only the instruction number S 4 of iteration 2 S 3 of iteration 3 S 2 of iteration 4 and S 1 of iteration 5. So, this situation continues here this completes. So, this iteration is over. So, at any point in time if you observe this steady state there are only 4 instructions which are executing start you know. So, this is the most recent instruction and this is the last instruction latest you know the last instruction and each of these instructions belongs to a different iteration number. So, and this S 4 S 3 S 2 S 1 is the software pipeline that we are trying to understand. Here the initiation interval is 1 because we have been able to initiate a new iteration in every cycle. So, in fact the software pipeline consists of just these 4 instructions and assuming that there are enough resources to take care of all the instructions here this all the 4 can execute in parallel. So, this is the concept of software pipeline. So, let us go further and take another example. Here again the example is quite simple A i equal to S star A i and here is the machine code corresponding to it. So, we have load and then we have a multiply and then we have a store instruction. Then the other instructions correspond to the loop increment value etcetera. Then we are checking the loop and going back to I 0 if the loop is not yet complete we keep iterating and then we get out. The dependence graph for this small program is shown here. So, as usual the dependences are shown by these you know these arcs and the dots on the you know arcs are the tokens present on the arc indicates the dependence distance. So, here I 3 actually supplies a value to I 0 we can see that I 3 supplies a value to I 0 and this single token indicates that the value computed in this iteration is used by I 0 in the next iteration. So, dependence distance of 1 I 3 also supplies a value to itself and with the dependence distance of 1. So, I 3 is actually t naught equal to t naught plus 4. So, this is the value from the previous iteration and this is the new computed value. So, there is a self loop self dependence and the dependence distance is also 1. So, because of these two similarly from I 3 to I 2 there is a dependence. So, this is I 3 and this is I 2 with the dependence distance of 1 again. So, the A t 0 which is used here is you know we compute it here in iteration I and use it in the next iteration I plus 1. So, this is the way the dependences are to be understood. So, let us understand how to schedule these. So, the number of tokens indicates the dependence distance this is something I already explained assume that the possible dependence from I 2 to I 0 can be disambiguated I 2 to I 0. So, here is t 4 and we are writing into the same location t 4. So, let us assume that we can disambiguate it. So, just for the sake of example, so I 2 to I 0. So, this is I 2 and here is I 0. So, that is why there is no dependence that we have shown between the two. So, assume there are two integer units with the latency of one cycle two floating point units latency two cycles and then we have one load store unit with the latency of two cycles and one cycle each. So, load has two cycles latency and the store instruction unit has one cycle latency the branch can be executed by integer units. So, the acyclic schedule takes 5 cycles. So, let me show you the picture. So, here is the acyclic schedule which has I 0 then there is a no op instruction and finally, I 2 there is another no op instruction here and then I 2 and I 5. So, in the time slot 1 2 there is I 1 I 3 and I 4 whereas, in time slot 4 there is I 2 and I 5. So, to get to this if we do a software pipeline then we really require only two cycles. So, by the way before this we have instructions to fill up the pipeline and after this we have instructions to empty the pipeline. So, let me show you that in the previous example. So, let us assume that that entire loop completes in 10 time units. So, actually these are all the instructions which are required to fill the pipeline. Now, the pipeline is full at this point it continues in that state until here and once the you know the pipeline cannot be sustained and the loop starts coming to an end we execute the rest of the iterations in this epilogue. So, this is the prologue this is the epilogue and this is the kernel or the steady state of the pipeline. So, let me show you how it requires only two cycles instead of the acyclic normal basic blocks schedule of 5 cycles. So, this is iteration 0, iteration 1 and iteration 2 and here are the time steps required for this pipeline execution. So, iteration 0 we have this schedule. So, as usual this is a you know there is a load here then the no op multiply add in sub store branch greater than equal to. So, this is the iteration 0 and here are the instructions which have been scheduled and now at time step 2 we can actually initiate I 0 of the iteration 1. So, this iteration now starts executing in the pipeline these two execute in a parallel fashion there is nothing to execute here in parallel and then this and this execute in parallel this and this execute in parallel. And in at time step 4 now we can actually initiate iteration 2. So, this again has these instructions following it now if you observe this is our steady state. So, if we actually have another instruction which is going to be execute for iteration 3 that would be here. So, again we see the same pattern. So, sub mulled and then load and here store VGE and add. So, these two instructions actually form the kernel or the steady state of the pipeline. So, we are going to iterate over this steady state of the pipeline as long as the loop executes and finally, the epilogue part of it executes the rest of the iterations. So, this is what was shown here it requires 3 cycles 0 1 2 3 4 cycles to actually fill the pipeline that is the prologue and then from 4 onwards the there is a steady state which executes many times and then an epilogue of a couple of cycles to flush the entire pipeline. So, this is the way a software pipeline loop executes. So, in the steady state is executing many instructions from different iterations and the pipeline stages are these are the various pipeline stages. This is first stage is the second stage and this is the third stage. So, let us take another example here again we have you know 0 1 2 3 4 5 6 instructions this is the dependence diagram for these instructions. So, and there is a cycle as well. So, of course, if we unroll you know and start scheduling the instructions here according to the software pipeline. So, we can execute 0 1 2 in one cycle 3 4 in another cycle and 5 in the next cycle because that is indicated by the dependences here without looking at this. And because of this we can actually begin the next iteration only with the concurrently with 5. So, that is what we really do here. So, i equal to 2 we have again the same pattern 3 4 and 5 i equal to 3 we have again 0 1 2 3 4 and 5. So, this will turn out to be the steady state for our software pipeline in which we have 3 4 executing you know in time step 1 and 5 0 1 2 all of them executing in the time step number 2 So, this is actually driven by the resource constraints as well. So, we have 2 multipliers 2 adders 1 cluster single cycle operation. So, this make sure that we execute the instructions in this fashion. So, you know this is the software pipeline loop which is executing in a steady state and of course, we have shown you some prologue instructions and epilogue as well here. So, now that is the end of software pipelining instructions scheduling etcetera. Now, we move on to the next topic which is very important called as the automatic parallelization. So, why do we require automatic parallelization and what is the process? So, automatic parallelization is the automatic conversion of sequential programs to parallel programs by a compiler. So, in other words the programmer does not write a parallel program the programmer writes a sequential program and automatically parallelizing compiler converts the sequential program to parallel programs. So, this is the purpose of automatic parallelization. The target machine may be a vector processor. So, in which case it is called as a vectorization it could be a multicore processor in which case it is called as concurrentization or a cluster of loosely coupled distributed memory processors in which case it is called as parallelization. Of course, we use parallelization and concurrentization you know with the same meaning we do not really differentiate too much between them, but vectorization definitely is a different process. So, we are going to see examples of vectorization and also parallelization. So, why is vectorization relevant at all you know some of the single core even single core processors of the x 86 variety they actually have a small vector set of instructions for multimedia operations. So, even those can be used if we are able to perform some vectorization the efficiency of the program will thereby increase. Parallelism extraction process is normally a source to source transformation. So, in other words if we take C or Fortran code the output is also C or Fortran it is not as if we go through the entire process of until the intermediate code generation and then perform parallelism extraction. In fact, some of the parallelism may not be so easily visible at the lower levels. So, this is the reason why we want to perform parallelism extraction at the source level itself. It requires a technique called dependence analysis to determine the dependence between statements. Informally I have already shown you many examples in the instruction scheduling and software pipelining and of dependence diagrams, but we have still not learnt how to determine the dependences. So, this is a fairly complicated process I am going to give you only a flavor of dependences in this lecture. The implementation of available parallelism is also a challenge. So, you know if you have a multi core processor say with you know 8 cores then it is easy to see that 8 iterations can run in parallel on this multi core processor right, but suppose we have a 2 nested loop right then is it possible to really run both the loops the outer loop and the inner loop in parallel mode well it is not so easy. In fact, we know how to run single loops in parallel, but running nested loops in parallel would be a very difficult task because of resource constraints. So, let us look at some examples here is a very simple loop for i equal to 1 to 100 x i equal to x i plus y i. So, if we have some vector instructions on the machine then this code can be very easily converted to this vector code. So, this can be read as the vector x 1 to 100 is vector x 1 to 100 plus the vector y 2 1 to 100. So, assuming that there are vectors of length 100 we basically read x 1 to 100 into a vector set of registers and add this concurrently you know all the vector registers are parallely added. So, 1 to 100 of x will be added to 1 to 100 of y in a parallel manner. So, all this can happen in one time and then in the next cycle we can actually store the value into x. So, if the instruction permits direct addition into the same register then there is no need to actually write it back to the memory location. So, the vectors x and y are fetched first. So, that is very important. So, usually the vector set of registers are used and then they are stored back into memory as well. So, this is very important this is read first this is read first again and then the computation happens. So, there is a overwriting all right, but the values from the previous iteration are not used in the current iteration. So, there is no dependence from one iteration to the next iteration. So, that is very important. If we want to run the same code on a multi core processor we can do that assume that we can start 100 threads. So, each one of the iterations can actually become an independent thread and for each value of i 1 to 100 there would be a separate thread which does the addition x i equal to x i plus y i. None of these interfere with each other that is very clear each iteration is different and each thread will do the work of just that iteration. So, this can be run in parallel on a multi core processor as well. Now, suppose there is a change in the code. So, this is x i plus 1 becoming is being assigned value of x i plus y i. So, this cannot be converted to this vector code that is even though it simply appears as i plus 1 it looks like we are computing x 2 here we are using x 1 here and y 1 here. So, if we simply write this as x 2 to 1 0 1 x of 1 to 100 plus y of 1 to 100 it would be incorrect. The reason is there is a dependence here that becomes very clear when we expand the loop. So, this first iteration becomes x 2 equal to x 1 plus y 1. The second iteration becomes x 3 equal to x 2 plus y 2. The third iteration is x 4 equal to x 3 plus y 3 etcetera. So, observe that what is computed in the first iteration is being used in the second iteration. What is being computed in the second iteration is being used in the third iteration. So, there is a dependence of values from first to second from second to third etcetera. So, whereas, this particular code does not actually respect this dependence that we have shown here. This says read the value of x from 1 to 100 the vector read the value of y 1 to 100 and then add them and put it into 2 to 1 0 1. So, in other words the value which is computed in a particular iteration is not being used in the next iteration, but the old values are being used. So, therefore, this vector assignment is incorrect just to do a bit of recap on the data dependence relations. So, if we have an assignment to a scalar variable x and then we have a read of x and there are no more assignments to x here then it is a flow dependence or true dependence. If we have a read of x here and then write into x here and there are no other dependents you know writes into x here then this is known as an anti dependence. And the output dependence is similar we have 2 writes nothing in between no other writes in between then from s 1 to s 2 we have an output dependence. Here this is anti dependence from s 1 to s 2 and this is flow dependence from s 1 to s 2 then we also have to understand the notion of a data dependence direction vector. We know what data dependence relations are we augment this information with a direction of data dependence this is a vector called the direction vector. So, I will give you examples to show what this is. So, there is one direction vector component for each loop in a nest of loops. So, if it is a 3 nested loops. So, in other words there is one loop outer another loop inner and a third loop which is inside the second then the direction vector will have 3 components one for each of these loops. Then the data dependence direction vector is written as psi equal to psi 1 comma psi 2 etcetera psi d where is d is the depth of nesting. So, this makes it one component for each level of nesting and each of these psi k's can be one of these 6 less than equal to greater than less than or equal to greater than or equal to not equal to and star. Out of these the primary direction vector components are less than equal to and greater than this less than or equal to is a combination of these two greater than or equal to is a combination of these two not equal to is none of these and star may be any one of these. So, the last three or last four are basically combinations of these. So, we must basically understand less than equal to and greater than in detail and rest automatically can be understood. There are three types of directions possible. So, that is what is shown here less than equal to and greater than less than is called as a forward direction which means that the dependence from is from iteration i to i plus k. That is we compute a value in iteration i and use it in iteration i plus k, k being positive. So, if the loop is running backwards even then you know the iterations actually the values may run backwards, but the iterations always proceed. So, if we number the iterations as 1 2 3 4 etcetera, then we take i and then go to k plus k. So, k is always positive. So, this is the forward direction. The second is the backward direction which means we compute in i and use it in i minus k. Well looks ridiculous right. So, it is true that this is not possible in single loops, but in two or higher levels of nesting this is definitely possible and I am going to give you some examples later. The third type of direction is the equal to direction which means that the dependence is in the same iteration that is computed in iteration i and used in iteration i. So, let us understand the less than equal to less than and equal to greater than can be understood only with respect to doubly nested loops. So, we have a loop j equal to 1 to j, the statement is x j equal to x j plus c. So, let us expand the loop twice x 1 equal to x 1 plus c, x 2 equal to x 2 plus c. So, here you know we have actually used x 1 and then computed into x 1 right. So, the x 1 is being and we are not using x 1 again in any other iteration. So, we are using and then computing. So, that means it is an anti dependence. So, delta bar and since the iteration in which we are doing it is the same iteration right, here in this case it is a iteration number 1, here it is iteration number 2 etcetera. So, the direction vector has the component equal to indicating that the value is used and then computed into in the same iteration. This is a single loop. So, we have only one direction vector component. Here we have j equal to 1 to 99, x j plus 1 equal to x j plus c. Again when we unroll we find this as x 2 equal to x 1 plus c and this as x 3 equal to x 2 plus c. So, we have produced a value x 2 in one of the iterations and in the next iteration we are using it. So, there is a flow dependence between these two values that is easy to see. Now, this is iteration 1 and this is iteration 2. So, that means we are producing in iteration i and using it in iteration i plus 1 that would be the direction vector would be less than and the dependence is flow dependence delta. So, we indicate it as s delta less than s saying the value which is computed by this statement s in a particular iteration i is going to be used in a later iteration in this case of course, it is i plus 1. The third example j equal to 1 to 99 do x j equal to x j plus 1 plus c. Again when we unroll we find that this is x 1 equal to x 2 plus c and this is x 2 equal to x 3 plus c. So, we have used x 2 here and then computed x 2 this is iteration 1 and this is iteration 2. So, there is a there is an anti dependence between these two and the iteration number is 1 here and 2 here. So, this is used first and computed later it is an anti dependence with a direction less than because the usage happens first and then computation in a later iteration number. Just to give you an example of a loop which runs backwards j equal to 99 down to 1 x j equal to x j plus 1 plus c. So, unrolling again we have x 99 equal to x 100 plus c x 98 equal to x 99 plus c etcetera. So, the loop the iterations are going forward, but the increment is negative. So, observe here that in this iteration again if we number the iterations as 1 2 3 etcetera in this iteration we compute and in the next iteration we use. So, even though the loop is running with a negative increment since the loop is running forward you know the iterations are increasing we have a flow dependence with a less than from s 2 s. So, compute in a particular iteration use it in the in a later iteration and this example shows j equal to 2 to 1 0 1 of x j equal to x j minus 1 plus c. The idea of all these examples is to you know make you familiar with the usual type of subscripts that are used in various automatically parallelizing you know lather the loops which can be automatically parallelized. Again this is x 2 equal to x 1 plus c and this is x 3 equal to x 2 plus c. So, there is a flow dependence with a forward direction. So, this is s delta less than s. So, we will stop here and continue with the rest of the parallelization in the next part of the lecture. Thank you.