 Welcome to part 2 of the lecture on instruction scheduling and software pipelining. So, we discussed you know one part of simple basic block scheduling last time, we will continue with that today and then go on to trace super block and hyper block scheduling. To do a bit of recap basic block scheduling consists of you know a basic block consists of micro operation sequences. These are the instructions of the machine and these micro operation sequences are indivisible that means the micro operations which constitute the MOS cannot be scheduled separately. Each MOS of course, has several steps and each of these steps requires one cycle for execution. So, it is possible that different instructions have different number of micro operations within their MOS and therefore, they may require different number of cycles and also different number of resources. So, we have two constraints for basic blocks scheduling one is the precedence constraint and other is the resource constraint. The precedence constraint relates to data dependences and execution delays whereas, the resource constraint relates to the availability of shared resources. So, here is the formal description of the basic block scheduling problem. It is actually basic block is modeled as a directed acyclic graph and here the nodes or the MOS are the instructions, edges are the precedence constraints and the label on each node tells us about the resource usage of that particular node or the MOS and it does so for every micro operation of the MOS. And of course, we also have the length of the node which is nothing but the length of the resource vector. The problem is to find the shortest schedule sigma which is a mapping from the nodes to natural numbers this n is nothing but the time line such that the precedence constraints are met and also the resource constraints are met. The precedence constraints simply you know can be shown diagrammatically as follows. So, this is the node which has been scheduled already and this is the node which is to be scheduled and the delay on the edge u v is d. So, it is this constraint simply says that sigma v cannot be scheduled d plus sigma u steps you know before d plus sigma u steps are completed. So, this is quite clear because this instruction takes d cycles to complete. Similarly, for the resource scheduling this is the time line in which the nodes are on which this nodes are scheduled. So, v 1 is scheduled at 0, v 2 at 1, v 3 at 2, v 4 at 3 etcetera. This is the step MOS of step again in terms of time because each MOS of step requires only 1 cycle. So, if the resource constraint simply says add up the resources along the diagonal and that should be less than or equal to the number of resources available. So, here we have shown only one resource and the number of resources available is 5 and there is a clear violation here 3 plus 3 plus 2 being 8 that is because v 1 is still active in this sub step 2, v 1 v 2 is active and it is in sub step 1, v 3 is active and it is in sub step 0. So, the resource requirements of these sub steps are 3 3 and 2 respectively. So, that adds up to 8. So, this schedule is not a feasible schedule and for this step you know 2 plus 2 plus 1 is fine. So, the resource constraints are satisfied, but since there is no satisfaction of the constraints here the schedule cannot be used. The algorithm for list scheduling is quite straight forward it is a topological sort based algorithm. What we do is we pick up the root nodes of the directed acyclic graph as the starting point they are put into a queue called ready queue and we keep doing it until the ready queue is not empty. So, we pick the highest priority node in the queue and then we find the lowest time slot in which the precedence constraints are satisfied and then from that l b onwards we find the slot in which the resource constraints are satisfied as I explained before. So, that would be the schedule for the node v. So, now v goes into the set of schedule nodes and the ready list is updated by removing v and adding those successors of the schedule nodes you know for example, u not of u in schedule. So, that means you should have already you know u is not a schedule node already and then we also pick up the successors of schedule node w is scheduled. So, these are all ready to be placed into the ready queue. So, the ready queue update is very simple you know is as follows. So, if these are the scheduled nodes and we already know that you know v has been scheduled. So, u 1, u 2, u 3 are ready to be put into the ready queue that is because the predecessors of u 1, u 2, u 3 have already been scheduled whereas, the predecessors of x 2 is not yet scheduled. So, these are the two functions which I explained you know corresponding to the satisfaction of constraints precedence and resource. So, precedence constraint satisfaction simply says we find a slot for v which is the maximum of sigma u 1 plus 2, sigma u 2 plus 4 or sigma u 3 plus 3 in this case it happens to be 29. So, this is the earliest time at which we can be scheduled and that is l b as returned by the function satisfy precedence constraints. As far as the resource constraint satisfaction is concerned we check at every time slot whether the requirements of resources of various kinds and for the various sub steps is indeed under control. So, for example, if we schedule this node here as we have already seen we exceed the number of resources available. So, this slot is left free the same is true for this slot as well. So, for example, if we place this here we get 3 then 1 and 2. So, that is 6 which exceeds 5. So, 3 is also kept vacant and then we can schedule 4 and 5 here. So, these are the NOOP slots. So, now the last issue that we need to look at before we look at examples is how do we order the ready queue by priority what is our priority ordering function. So, one possible priority ordering function is the height of the node in the directed acyclic graph that is the longest path from the node to a terminal node I will give you an example of this to explain it. So, suppose we consider this graph right. So, this is a directed acyclic graph. So, the legend says the left side the blue is the path length that we need to compute whatever is inside the circle is the node number and then what is written to the right of the circle in red is the execution time and the label on the edge latency of the instruction. So, this is a very generalized you know model it allows execution time to be attached to the node and latency to be attached to the edge. The reason why this may become important you know is the one example of that is the load instruction. So, for example, immediate if there are many load units available then you know after one of the loads is started and which requires one cycle of execution time. We can actually load you know schedule other loads on other load units, but for this particular load which was started it may require several units of time to make the result available. So, the load execution time is one whereas, the load latency is two that could be an example of a an instruction which has an execution time and a latency as well. The path length is computed as the execution time if n is a leaf. So, in this case these two are the leaves so we have the execution time as two and one. So, the path length is initialized to two and one respectively here otherwise we take the maximum of the latency of the edge and added to it the path length of the target node. So, for example, if we take so we have computed the path length for these two as one and two now we consider the node and these are the two successors which have already been scheduled. So, we take the execution time and of this node all right. So, this is the node for which we want to compute the path length. So, what we do is we take the latency of this instruction with this instruction which is given as two and we consider the path length of this particular node so that is one. So, two plus one is the path length as computed along this path along this path we have a latency of zero and a path length of two. So, this is two plus zero equal to two. So, the maximum of these two is three so that is the path length of four. So, you can see that quite easily so we have to use two plus one here and then we move on to this node and for that node the execution time you know the delay or the latency is zero and the path length here is three. So, three plus zero is three so that is noted as the path length here for this node the latency is two and the path length is three. So, we note five as the path length here and for this node this is three plus one that is four. So, this is how we compute the path lengths we will see another example a little later. The second possibility is to use early start and late start latest start times as the priority ordering values. So, priority ordering values so E start is the earliest time at which a node can be scheduled and L start is the latest time at which the load can be scheduled. So, violating E start and L start may result in pipeline install. So, we may have to introduce no ops in that case. So, every node can be scheduled between E start and L start so that is the idea. So, how do we compute E start? E start of a node is the maximum of E start u i plus d of u i comma v. So, this is the delay and this is the E start of the node. So, where u 1 to u k are the predecessors of v so let me and the E start of the source node is 0. So, let me show you an example of this we want to compute the E start of v that is the earliest time at which this can be started. So, to do that we know the E start of the predecessors. So, we add the delay to it and find out the maximum so 25 plus 4 the 45 plus 7 and 16 plus 2. So, the maximum comes to 52. So, in a similar way the latest start time can be computed as the minimum of L start v i minus of d u v i where v 1 to v k are the successors. So, the L start of the sink node is set as the E start of the sink node itself. So, again taking this example we want to compute the L start of v for that we know the L start values of the successors. So, w 1 w 2 and w 3 so we compute 12 minus 2 36 minus 1 and 21 minus 3 so that comes to 10 as it is written here. So, basically we are working backwards to compute the L start starting from the sink whereas, for the E start we start from the top and go towards the sink. So, E start and L start values can be comfortably computed using a top down pass and a bottom up pass. So, basically we start the E start computation from the top go to the sink node and then initialize the L start value of the sink node to it is E start value and work backwards to compute the L start value. So, of course, this is during the this can be done either during the you know before the scheduling begins or it is possible to do it dynamically during the scheduling itself. So, the difference between these two is you know quite important if we do it before the scheduling begins then we cannot alter the priority of the node during the scheduling process whereas, if we do it during the scheduling process itself then the priority of the node can possibly be altered. So, it is also possible to use a slack value as another priority item. So, before we look at the slack we could also use you know E start or L start value as the priority item. So, if the lower the E start value the higher the priority and the lower the L start value again it has a higher priority. In the case of slack which is nothing but L start minus E start the nodes with lower slack are given the higher priority and instructions on the critical path may have a slack value of 0 and they and therefore, automatically they will get the priority. So, let us look at an example of scheduling with the path length first and then using slack as the second example. So, we have already computed the path length for this example a few minutes ago. Now, let us try to schedule using the path length. So, to begin with the node number you know 1 and 3 these are the roots of the DAG. So, they will be put into the ready queue let us assume that the resource constraints are always satisfied. In other words there are enough number of resources in the system. So, we do not even have to check the resource scheduling resource constraint at this point we will see that in the next example. In this case there are no resource constraints. So, between 1 and 3 we need to pick one of the nodes to be scheduled in the first time slot. So, to do that again we look at the path length this has 4 and this has 5. So, the higher path length instruction is picked up first. So, that gets scheduled at 3. So, the sorry in the first time slot. So, number 3 gets scheduled in the first time slot. Therefore, now the ready queue has this 1 right ready queue cannot contain 4 at this time even though it is the successor of 3 because 4 has a predecessor 2 which is not yet scheduled. So, we can only put those nodes whose predecessors are already scheduled into the ready list. So, after we schedule 3 we are forced to schedule 1 there is no other option and after we schedule 1 we can schedule 2. The reason is so, once we start this node 3 1 is independent of it. So, it can be scheduled in the second time slot and node number 1 executes with just 1 delay. So, in this next third time slot we can schedule node number 2. So, once we do that of course, this seems to require 0 number of slots and node number 3 requires 2 slots to complete. So, 4 since this was scheduled in slot number 1 you know in number this is number 2 this is number 3 slot number 4 is available for instruction number 4 both 3 and 2 would have completed by the time. So, we can schedule you know 4 in slot number 4 as we have done here. Now, there are 2 possibilities after slot number 4 for the slot number 5 we can schedule either 5 or 6, but it so happens that 5 you know requires 2 cycles and it has a path length of 1 this requires 0 cycles and it has a path length of 2. So, since the path length indicates that this can should be scheduled first we can put that into the slot number 5 and by that in slot number 6 you know 4 would have completed and 5 can be scheduled. So, the difference between 4 and 5 is actually 2 time slots. So, that is sufficient for this instruction to complete. So, we could not have placed 5 into slot number 5 because the delay involved in completing 4 is 2 cycles. So, if we had tried to place 5 we would have placed a no op in cycle number 5 and then we would have placed 5 that would have been an inefficient schedule, but our heuristic of using the path length takes care of it and says we can schedule node number 6 in slot number 5. So, this is a packed schedule with no no ops in between. So, that that is assuming there are no resource constraints. So, this is how we produce the schedule you know and this is how we use the path length in order to break any conflicts. So, there was a conflict here and there was a conflict here as well. So, now we go on to the second example this is the example we had studied before. Here the latencies of the add sub and store instructions are 1 cycle each, the load instruction has 2 cycles of latency and the multiply instruction has 3 cycles of latency. So, we actually this is the dag with all these extra you know anti and output dependences already marked. So, if you look at the instruction sequence i 1 to i 9 we are not sure whether we can actually schedule everything without any no op. So, this dag does not tell you anything right. So, now the early start e start and L start values can be computed in a top down pass and in a bottom up pass quite easily. So, we start with the e start value of this as 0 then you know using that simple computation we assign e start values to these 2 and then this cannot be assigned e start value this 2 cannot be assigned in start values immediately. So, we will have to assign the e start value of 0 to this and then once that is assigned we can assign the e start value for this as well. So, the e start value has be is indicated as the first component of this parenthesis and once we complete the computation of these e start values up to this point you know and then this point we assign the L start value of this node as the e start value and then we work backwards in order to produce the e start values. So, this is a fairly straight forward pass. So, I will not spend too much time in explaining how it is computed then the number to the left of the parenthesis is the path length. So, you know that is easy to compute again you know we start with the path length here as 0 and work backwards the number to the right of the parenthesis is the slack which is nothing but the L start minus e start value. So, to schedule this. So, let me show you how the scheduling can happen. So, let us assume that we have 2 integer units and 1 multiplication unit and all these 3 units are capable of handling the load instruction and store instruction. The heuristic used you know whether we use the height of the node or the slack value the the schedule comes out to be the same. So, let us begin to begin with in this diagram we have both these you know I 1 here and I 2 here both these are eligible to be scheduled right. So, among these 1 of them can be scheduled on integer unit 1 and the other can be scheduled on integer unit 2 there is no need to wait. So, both these loads are scheduled in the same cycle number 0 and in cycle number 1 we cannot schedule anything the reason is t 1 is needed in I 4 and you know it is also needed in I 3 right. So, both these instructions require this value of t 1. So, t 1 plus 4 and t 1 minus 2. So, the next cycle after this has to be left vacant we have to introduce a no up here there is no other option and by the time we arrive at cycle number 2 the loads have completed. So, we have you know this load and this load has been completed. So, we have this this and this all the 3 as possibilities for scheduling they get into the ready queue. Now, we choose to schedule I 3 and I 4 because they have higher priority. So, in that would be they would be scheduled again on integer units int 1 and int 2 we have still not used multiply. So, these are the 2 instructions. So, they have been scheduled in cycle number 2 and there is no need to make them sequential. So, cycle number 3 now we have you know I 5 and I 6 available to us right. I 5 was available in cycle number 2 also, but you know there was no availability of resources here. So, 2 integer units were already taken and we required another integer unit to perform this t 5 equal to t 2 plus 3 this had to be performed. So, we did not have another integer unit here. Therefore, we priority chose I 3 and I 4 the whether the priority is it is the same here the choice does not change really. So, once we have chosen this I 5 can be scheduled in the next slot I 6 was also available you know was ready to be scheduled in this slot. So, we can see that here. So, this is I 1 I 2 I 3 I 4 I 5. So, this is this multiply can be scheduled now. So, that is what we do. So, this is not yet ready because it depends on this right. So, we schedule I 5 and I 6. So, 1 is on the integer unit that is I 5 I 6 is a multiply. So, that goes on the multiply unit. Then until the multiplication is complete we really cannot do anything. So, we have to wait right. So, there is no other instruction that we can schedule. So, we just wait and multiply takes 3 cycles. So, we just have to wait for 2 more cycles then in slot number 6 cycle number 6 we are ready to schedule I 7 there is no conflict here. So, we just schedule I 7 then the in the next cycle we I 8 is ready. So, we schedule I 8 there is no conflict again and in I 9 we schedule again in cycle number 8 and that is again scheduled on 1 of the integer units. So, this is the way in which we work out the schedule starting from cycle number 0 we work on a cycle by cycle basis look at the instructions which are in the ready queue pick up those which are higher priority and schedule them. So, what you must observe here is because the number of you know function units is more than 1 we have scheduled more than 1 instruction. So, what we do is we just pick up one at a time based on the priority and schedule them on the available function units, but we do not have to increment the cycle number because the resource is not a constraint at this point. So, now how do we provide an input to the scheduling algorithm in terms you know the resource rather how do we provide the reservation the instruction resource requirements to the scheduling algorithm. So, this is provided in the form of instruction reservation tables this is a very simple table for each instruction we have a table of this kind let us assume that the number of resources in the machine is really 5 not 4. So, r 0 r 1 r 2 r 3 r 4 so 0 to 4. So, and then the instructions require a maximum of 4 cycles right and in each one of these slots we have mentioned the number of resources that is required for that particular time slot. So, this instruction requires 4 time slots to complete and in the first time slot it requires r 0 r 2 and r 3 the it requires 1 of r 0 1 of r 2 and 1 of r 3. So, similarly in the next cycle it requires r 0 r 1 and r 4 in t 2 it requires only r 3 and r 4 and in t 3 it requires r 1 and r 4. So, these are the resource requirements of this particular instruction and this table which is called as the instruction reservation table tells us the resource requirements of the instruction. So, there is going to be one table for each instruction in the machine. Now, how do we keep track of the usage of the resources during the scheduling because if the number if there are many resources we need to you know keep some keep some kind of a table which shows the usage of these resources. So, this indeed happens this is called as a global reservation table. So, it has as many columns as the number of resources in the machine and the number of rows is equal to the length of the schedule. So, to begin with we do not know the number of rows, but once the schedule is complete this table will be t rows in length in size. So, basically we start with you know the slot t 0 then depending on the instructions which are ready to be scheduled that is they are available in the ready queue and we the one with highest priority is picked right. So, and that instruction reservation table will look something like this. So, we super impose the instruction reservation table on this global resource reservation table and check whether the resource requirements of the instruction are met. How do we do that? So, for that the this is just a description of the GRT it is constructed as the schedule is built cycle by cycle and all entries of the GRT are initialized to 0. So, the GRT maintains the state of all the resources in the machine and it can answer questions of the type can instruction of some class be scheduled in the current cycle say t k. How do we obtain this answer? This is obtained by anding the reservation table of the instruction with the GRT starting from that particular row. If the resulting table contains only 0s then yes otherwise it is a no. So, that is what I meant here. So, when we we keep the reservation table here and then and the appropriate entries. So, if the so that means you know if the reservation table of the instruction requires a particular resource say r 0 and the GRT already has a 1 here anding of these two will produce a 1 that means the resource is busy. So, what we require is a 0 here if this were to be a 0 whereas, the instruction reservation table had a 1 here the anding operation would have produced a 0 indicating that the resource is not busy. So, that is precisely what is said here if the resulting table contains only 0s for the all the rows and columns of the reservation table of the instruction you know that the so many columns and so many rows then obviously all the resources required by the instruction are available and the instruction can be scheduled otherwise no. If the instruction can be scheduled the after checking this the GRT has to be updated. So, if the instruction is scheduled then we similarly you know we place the reservation table at the appropriate time slot the reservation table of the instruction and do an r operation. So, then we actually make those resources which are used by the instruction to be busy. So, we put a 1 in all those places right. So, that is precisely how the GRT is updated after scheduling the instruction. So, that is about you know scheduling instruction scheduling using the basic block scheduler. Such a simple list scheduling strategy has some disadvantages the first disadvantage is. See checking the resource constraints is a very inefficient process here because it involves repeated anding and oring of the bit matrices for many instructions in each scheduling step. So, this is a bit of inefficiency, but it is not that bad space overhead may become considerable, but this is not a serious issue. The checking of resource constraints is the slower operation compared to the space problems created by this, but still it is a very simple algorithm very effective and allows building you know building introducing many heuristics into it for computation for the priority. So, it is still very widely used. So, now you know let us move on to the next you know scheduling strategy that is the global acyclic scheduling strategy. Let me explain why such a strategy is indeed required it. So, happens that the average size of a basic block is small in the in most applications say between 5 and 20 instructions. So, the instruction scheduling is not that effective when we actually schedule very small number of instructions. In other words you know there are not enough choices for the various slots. So, we may be forced to put no ups there. So, this is a serious concern in architecture supporting greater instruction level parallelism. So, for example, VLIW architectures have several function units super scalar architectures have multiple instruction issue possibilities. So, on such architectures when we can initiate you know either more than one instruction per cycle or at least one instruction per cycle and then use the pipelining available in the machine to execute them in various phases. So, in such machines a very small basic blocks make the you know bring down the bring down the efficiency of the machine and make the program run slowly. So, global scheduling is actually going you know is in the same spirit as the value numbering that we did for extended basic blocks. So, we take a set of basic blocks and try to schedule the instructions of the basic blocks set of basic blocks as if they were a single basic block. So, this overlaps execution of successive basic blocks and there are several techniques for it one is trace scheduling the second is super block scheduling, third is hyper block scheduling and the fourth is software pipelining there are many more of course, and we will deal with only the four of these in our discussion. So, I hope that kind of clarifies why we require looking at more than one basic block. So, trace scheduling is a fairly widely applicable you know method and trace is a frequently executed acyclic sequence of basic blocks in a control flow graph. So, that is part of a path. So, how do we identify a trace. So, we identify the most frequently executed basic block and then extend the trace starting from this block forward and backward along the most frequently executed edges. So, to show you an example. So, this is the control flow graph. So, let us say this was the most frequently executed block then we grow it backwards and forwards and include this entire path as the main trace. So, once we identify traces using profiling and you know this as simple algorithm we can apply list scheduling on the trace including the branch instructions of course. Execution time for the trace may reduce now, but the execution time for other paths may increase I will show you why this happens, but the overall performance will certainly improve. So, for example, this is our main trace what we really do is we consider these three basic blocks as one unit or one basic block and try to schedule the instructions. So, we can move instructions between these basic blocks and that introduces you know some compensation code that we are going to see a little later. So, let us assume that we can move these instructions among these basic blocks by doing. So, since we have many instructions we will probably reduce the you know execution time of the set of these basic blocks, but then execution most of the time goes along this path, but sometimes it also goes along this path. So, this is outside the trace. So, this block is outside the trace. So, if we jump to this block apart from you know some compensation code etcetera which needs to be executed we will see that later. Jumping to this block kind of breaks this pipeline. So, we may have to execute this block at a higher cost compared to what it was before. So, that is what I was trying to explain here. The execution time for the trace may reduce, but the execution time for the other paths may increase because of compensation code etcetera. So, let us consider this example. So, here in this example we have an if then else condition as well. So, in the part we execute this and in the else part we execute this and after the if then else is over we execute this. So, there are four basic blocks corresponding to these things. So, this is the you know conditional block then we have the then part we have the else part and then we have the join corresponding to some equal to some plus b i. So, and here are the instructions corresponding to the four blocks. So, suppose we take the trace and then apply our basic block scheduling algorithm a very simple one. We have not found we still have not separated them into main trace and the site trace etcetera. We just take each block and schedule it using the basic block scheduling algorithm that is all. So, if we do that then you know we are actually we force we will be force to introduce a no up here a no up here and therefore, the number of instructions taken for this particular program you know. So, in nine cycles for the main trace and six cycles for the off trace. So, remember we have identify the trace, but that is this is the main trace and this is the off trace, but this is the block in the off trace, but we did not apply any scheduling algorithm combining these basic blocks. We are still applying basic block scheduling algorithm separately for b 1, b 2, b 4 and b 3 that is the idea that is the comparison part. So, how does this take nine cycles for this basic block we require three cycles and then if we go to i 7 that would be the i 7 is here that would be the i 7 is the off trace. So, if we continue here that would be the main trace. So, 0 1 2 3 4 5 6 all this corresponds to the main trace and then we also have you know i 7. So, after this six we have a go to instruction which will bring us to seven that is here i 9. So, this corresponds to the merger block here this block. So, we execute this we execute this and then jump to this or we jump right in the beginning jump to this point execute this and then fall through to b 4. So, these are the two possibilities. So, if we fall through here then we execute b 1 and b 2 then we jump to b 4 execute this. So, after six we have seven and eight. So, that means we have nine cycles the main trace for the off trace we are definitely have to execute b 1 then we go to i 7 execute this instruction and then fall through and execute the join. So, that would require three cycles here you know. So, this is the fourth cycle fifth cycle and sixth cycle. So, that way that is what we require. So, we require six cycles for the off trace and nine cycles for the main trace. So, that is our scheduling and we have two integer units available. So, we can schedule instructions freely on either one of them based on the dependences. So, this is a two way issue architecture with two integer units. So, we can issue instructions two instructions in the same cycle also and of course, it requires one cycle for add sub store two cycles for load and go to has no stall. So, we can actually schedule this and something else also. Suppose we consider these three blocks as a single block and schedule. So, this is the trace scheduling what we did so far was not trace scheduling. So, we have in effect we would be moving some of the instructions from here to this part and then we are also kind of deleting this branch because the flow of control will be maintained like this. So, we are not we are kind of falling through from execution from here to here we do not have to jump whereas, for the off trace it is going to be different. So, these instructions will be the effect is to move them here and then we have a store and then we have these three instructions that would be our main trace and then the off trace would jump to this point execute these and then jump to this and execute these. So, it takes more time. So, here there are no jumps whereas, on this part there are jumps. So, this is the way trace scheduling would happen. So, let us look at it. So, it requires six cycles for the main trace and seven cycles for the off trace whereas, we had required nine cycles for the main trace and six for the off trace in the normal scheduling you know application. So, this is integer unit one integer unit two. So, this is our main trace and this is the off trace block. So, how does the you know control go. So, this is. So, we have been able to schedule the instructions in a mixed manner. So, if the this is the condition. So, if the condition is false we go to Y 7. So, that is the breaking the main trace. So, otherwise we fall through we continue and this is the loop actually. So, if R 1 less than R 6 go back to Y 1. So, as long as we are executing the main trace we will be doing it very fast. So, the number of cycles required is 0 1 2 3 4 and 5. So, that would be for the main trace. Now, how does the off trace part execute. So, that is this part. So, as I told you we have to jump to this part execute this and then jump again into the you know middle of the main trace execute these instructions and then get out. So, here we go to Y 7 execute R 4 equal to R 2 then we go to the middle of the main trace. So, that is I 9 that is here execute this instruction this instruction and also this jump instruction. So, at this point you know obviously, we require 0 1 2. So, 3 cycles then you know 3 4 that is 5 cycles and then 5 6 that is 7 cycles. So, that is what is here you know the main trace is very fast, but the off trace because of the 2 jumps requires more time to execute. So, this is the trace scheduling, but then as I told you we require some extra code to be introduced into the various blocks. So, the side exits and side entrances are ignored during the scheduling of a trace and this requires compensation code to be inserted during the book keeping phase after the scheduling of the trace. So, basically for the main trace we do the scheduling then check whether the instructions have been displaced from their original position and then see if extra code has to be introduced and we introduce it in the off trace. So, there are also other possible side effects. So, one is the book keeping code which I am going to show you very soon. The speculative code motion load instruction moved ahead of conditional branch. So, in our example, so the register R 3 should not be live in the off trace path. So, let me show you. So, here as I said these instructions are all going to be effectively moved here right. So, that means the load instruction also moves here and the register R 3 would be loaded. And if after that suppose we take this branch load. So, the load has already kind of happened. So, now you can see that right. So, the load is going on on its own. So, it has been scheduled in parallel with this load. So, the load of R 3 has completed by the time you jump to the off trace. So, that is here. So, this R 3 is still live at this point because the load has already completed at this point. So, we come here, but we find that R 3 is live whereas in the original control flow diagram R 3 was loaded here in some other block. So, if we had made an exit at this point R 3 would not have been live. So, this is a side effect the load has been moved speculatively to this point assuming that the main trace would be taken, but the main trace was not taken. So, the register R 3 is live even in this block. The side effect of that is possibly some unwanted exception. So, this is not easy to take care of it requires additional hardware support to detect such exceptions and make sure that some repair is caused or is executed repair is performed. So, trace scheduling requires some extra hardware support to take care of such unwanted exceptions. These are not supposed to have been caused, but because of the main trace being scheduled separately this has been caused, but it should not cause difficulty when the off trace is taken. So, such unwanted exceptions should be caught and dealt with appropriately by the hardware. So, now the compensation code. So, this is the original sequence of instructions instruction 1 2 3 4 5 and these are the instructions which possibly correspond to many blocks of the main trace. Now, suppose the instruction sequence here is modified to this. So, we 2 3 4 then instruction 1 and then followed by instruction 5. So, if we were actually exiting after instruction 2 here right. So, in the original sequence we would have executed 1 then 2 and may be would have executed to the off trace in some iterations. Whereas, in this case now we do not execute instruction 1 at all. We simply execute instruction 2 and then we exit. So, this is incorrect. What we need to do is to insert instruction 1 in this path along this edge. So, this is the extra compensation code that is executed and there is nothing wrong in executing instruction 1 after instruction 2, because if we had used the main trace we would have still executed instruction 1 after instruction 2. So, dependence is permitted. So, there is nothing wrong in inserting instruction 1 at this point. So, this is one type of compensation code. Suppose, we had the same sequence 1 2 3 4 5. So, there was possibly a jump into the middle of this code to instruction 3 from outside. So, we would have now changed the order of these instructions for the main trace. So, it has now become 1 then 5 then 2 then 3 then 4. So, what compensation code is required when instruction 5 moves above the side entrance in that trace. So, what have the problem now is if we actually enter through this you know edge we execute 3 and 4, but we do not execute 5 at all 5 has moved up. Obviously, the compensation code that is instruction 5 has to be inserted along this edge and there is nothing wrong in inserting it here and executing it before 3, because even in the main trace we have scheduled instruction number 5 before instruction number 3. So, this is the compensation code that has to be inserted. So, compensation code actually can become quite large in some cases and this is one of the disadvantages of trace scheduling. We will stop here and continue with other types of scheduling in the next part of the lecture. Thank you.