 on data hazards. We have started our discussion on various types of hazards that arise in pipeline implementation of processors. And you may recall that the data hazards is one of the three types of hazards and in the last lecture we have discussed about the structural hazard and we have seen how structural hazards can be overcome with the help of additional hardware. Now, coming to the data hazards as we have seen the data hazards can be classified into three types. First one is read after write in where j tries to write a letter instruction and j tries to read a value before i writes into it and the other two are write after write w a w type and w a r write after read type. However, as you have seen for the pipeline that we have been discussing the simple five stage risk pipeline they are only read after write type of hazards may result in a pipeline stall. On the other hand write after read and write after write hazards cannot occur because of different reasons that we have discussed in the last lecture. Now, before I mention about this let us consider what is the outcome of this data hazard. Here is a code sequence add r 1, r 2, r 3 followed by sub r 4, r 1, r 3 then and r 6, r 1, r 7 and so on. So, you find here you have got five instructions and data hazard is arising because the instruction the second instruction is trying to read a data which has not yet been written into the register and this is leading to a hazard. Similarly, the third instruction is also facing hazard because it is trying to read data before the writing has taken place into the register. However, the subsequent instructions will not face any hazard because they are reading the data from the register after writing has taken place. Now, this is the how can we overcome this hazards. There are several techniques by which these data hazards can be overcome. The first technique is known as forwarding and bypassing. Forwarding and bypassing technique is based on hardware approach. That means, you have to add some additional hardware to overcome data hazards and we shall see how it can be implemented. The other techniques are essentially software based. First one is basic compiler 5 line scheduling and we shall see how compiler can help in scheduling instructions such that the data hazard is eliminated or its impact is reduced. Then the scheduling that is being done by compiler is known as static scheduling. On the other hand, the scheduling can be done with the help of hardware which is known as dynamic scheduling which I shall also discuss and this dynamic scheduling can be done without renaming and with renaming. Finally, we shall discuss how hardware speculation can also help in reducing data hazards. So, first let us focus on forwarding and bypassing which is a hardware based approach. If we look at this particular pipeline execution of these instructions, we find that data hazard is arising in spite of the fact that results are already available in the pipeline registers. Although it has not yet been data has not yet been written into the register that register where it has to be written that is R 1, but those values which was computed here in the third clock cycle are already available in different pipeline registers. So, as you can see the difference in the pipeline registers which are holding different values after the execution in different stages. For example, after instruction fetch the instruction is stored in this instruction register, after the instruction decode fetch the data which is being read from the registers are again stored in the pipeline registers. So, various values which are generated in a particular stage are stored in those registers in these pipeline registers. Similarly, after execution the results are available in these pipeline registers. So, similarly after whenever memory access is being done the result is again stored in the pipeline registers and finally, when write back is taking place data is taken from the pipeline register and which is being written to the proper register. So, we find that the key idea of forwarding comes from the fact that data which is already available in pipeline registers we are not utilizing it. We are trying to read it from the final register where it is stored at the end of fifth cycle. So, can we not read the operands from pipeline registers instead of the I mean before it is written into the register of the ALU that is your in our case the register R 1. So, how this can be done is shown here we shall require several multiplexers as you can see ALU inputs are coming from two multiplexer earlier they were coming only from the this particular stage that means the pipeline stage that means the first pipeline register. So, inputs were taken from there or from the immediate data that was taken. Now, from different pipeline registers the outputs are applied here to the input of the marks and for you can see from this the ALU output is applied to the marks. Similarly, this ALU output is also applied to the other marks and similarly from the data which is being read from the memory though those they are also written into the they are applied to the input of the marks because we shall read from the pipeline register instead of reading from the register. Similarly, finally whenever it comes of course it will go to the register. So, the basic idea is that with the help of multiplexers having more than two inputs say five inputs to each of these multiplexers the inputs can be taken not only from the final register, but also from different pipeline stages because the intermediate results which are hold in the pipeline registers are available in them and they can be taken from there and applied to the ALU and that is how this this hazard is overcome. So, to support data forwarding additional hardware is required first hardware is multiplexers to allow data to be transferred back and second is the control logic. We know that the control logic is performing the control of different data path hardware that is present in the processor that means it is controlling the different registers it is controlling the ALU the operation to be performed and so on. So, here whenever we add multiplexers with multiple inputs for example, this multiplexer is having five inputs. So, you will require three control signals which will select one of the five inputs and that will be applied to the ALU. Similarly, here also you have got another multiplexers five inputs are coming from five different sources and you have to select one of the five depending on the instruction that is being executed. That means the depending on the off code the controller will decide which inputs will be selected and that the data will go to the ALU. So, the controller control logic of the multiplexer will also be has to be implemented. In other words I am trying to tell that the controller without pipelining will be simpler than controller with sorry controller with forwarding. That means if forwarding is not done then the controller will be simpler, but whenever forwarding is implemented then the controller will be complex because it has to generate additional control signals and not those control signals can be generated by analyzing different values different I mean instructions. That means you have to see in what situations the operands will come from different and those pipeline registers and accordingly by that accordingly the logic has to be implemented in controller. So, this implementation will be little more complex. So, this is how forwarding is done. This is also called bypassing because we are bypassing reading from the final register by reading it from different pipeline registers which are present where the intermediate values are temporarily stored. This is forwarding. Now, whenever we do forwarding this particular diagram shows how forwarding is helping in overcoming hazards data hazards. As you can see earlier we were trying to read this value from this register. Instead of reading from this register now for the second instruction that value will be coming from this particular pipeline register instead of the register bank that is present in the ALU. So, it is reading from there. So, since it is in the forward direction there is no hazard. Similarly, we can see here also that the third instruction is also getting the operand from the pipeline register. That means this R1 is also read from the pipeline register and that is being applied to the one arm of the ALU. Similarly, of course the third instruction that operand is being read from the register itself and of course this is possible because we are using what is known as split phase axis. So, you have a register and that register what is being done it is controlled by a clock. There is a clock which is applied and the clock is being applied to control the operation of the register. Now, what is being done writing into the right operation is being performed in this the first cycle first part of the cycle. So, and then read operation is done in the second phase as a consequence after the writing is writing takes place in the same phase same clock during the same clock you can read it. So, that is what is happening in this particular case that means writing of the data is taking place into the register in the first part of the clock cycle and in the second part of the clock cycle which is shown by Green said the reading operation is taking place from the same register. So, from the same register in two phases you are able to do two operations writing as well as reading that is why it is called split phase axis and by if the split phase axis is not allowed that means if the only reading or writing can be taken place in a single clock cycle in such a case of course again there will be a data hazard in this case and in such a case you have to read it not from the register, but again you have to read it from the pipeline register. So, now the question arises this type of data hazard can be overcome by forwarding or bypassing can all possible data hazards be overcome by bypassing or forwarding the answer is no there are situations where I will see if and by using that complicated bypassing hardware the hazard cannot be overcome. So, in this particular case it involves a load operation. So, you will be loading some value into a register R 1 and so you will require a memory access. So, as we have seen the in the fourth clock cycles the memory access is being done. So, that means the value that is being read from the memory will be available only at the end of clock cycle 4. On the other hand the next instruction that sub R 4 comma R 1 comma R 6 is reading the value in the third in the fourth clock I mean in the beginning of the fourth clock cycle. So, in the beginning of the fourth clock cycle it is trying to read the value and at the end of the fourth clock cycle data is available. So, in this particular in this particular situation there will be hazard and so this hazard cannot be overcome by bypassing. So, we have to introduce a stall to overcome the situation. However, the subsequent reading of operands will take place from the register. So, this will lead to a stall which we call unavoidable stall. So, this type of stalls will be present whenever you are reading an operand from a memory and that is being used in the next cycle. However, whenever it involves ALU operations then this problem will not arise. So, this is all about the data hazards. Now, we shall discuss about the use of compiler for overcoming data hazards and basic idea is find sequences of unrelated instructions that can be overlapped in the pipeline to exploit ILP. That means, in our in our in case of our ideal implementation of pipeline we assume that all the instructions are independent originally, but subsequently we discovered that there is data dependency and because of data dependence there is some kind of I mean the instructions are not really independent. So, you have to maintain a sequence and you cannot arbitrarily change their position. However, there are instructions which are independent and which can be identified by the compiler and then they can be scheduled that means the original sequence can be modified by the compiler and change their sequence in such a way that the data hazard will be overcome. That is the basic idea and so a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the latency of the source instruction to avoid stall. So, this is the key idea behind this compiler based scheduling of instructions. That means, an instruction which is dependent on a previous instruction has to be separated from the source instruction. How much separation has to be done to avoid data hazard which is the latency of the pipeline that is being implemented that is dependent on the pipeline depending on the pipeline that we have implemented. So, this is how we can avoid stall. Let me illustrate this with the help of a simple example. So, a clever compiler can often residual instructions to avoid a stall. So, here we have got an instruction. I have already mentioned about this whenever it involves reading data from a memory that means the sequence of instruction is load word R 2 comma O R 4 then you are performing addition operation involving the same register R 2 and R 3. So, you are adding the content of R 2 with the content of R 3 and storing the result in R 1 and then another instruction is there load word R 5 comma 4 R 4. So, here we can see that this particular reading of data will take place at the end of fourth clock cycle. On the other hand here the reading will take place at the beginning of the fourth clock cycle. So, what we find that here there is a latency of only one clock cycle. If this instruction can be separated from this by one instruction our job will be done. That means the data hazard will be overcome. So, this particular instruction is independent if we consider the three instructions we find that this instruction has no dependency on the previous two instructions. So, these two instructions can be interchanged. So, if these two instructions are interchanged that means if we write in this manner load word R 2 0 R 4 then add sorry then second in load instruction load R 5 4 R 4 and the reordering is done in this way R 1 R 2 R 3. So, we find that in this particular case by the time this instruction will read the data from the register R 2 writing has been already completed because at the end of fourth clock cycle rather in the first phase of the first half of the fifth clock cycle writing will take place into the register by this instruction. And reading here is taking place in the fifth clock cycle and the second phase. So, here the reading will take place here the writing will take place and as a consequence the hazard will not arise and you will be able to read from the register. So, this is how the hazard is being overcome in this particular case. So, by reordering the instruction this is the transformed code there is no stall needed. So, earlier there was a stall needed here at the I mean after load word you have to introduce a stall if you have to read correct data and this is not required in the transformed code. So, this type of compiler scheduling can be done with the help of a optimizing compiler we call it optimizing compiler. Now, this type of situations of identifying independent instructions is very difficult in a small straight line code usually the number of instructions present in a straight line code is very few and since the number is very few finding independent instructions among them is difficult. So, how can we improve the instruction level parallelism? One very important technique which has been exploited that is known as unrolling loop unrolling and that involved that is actually related to loops. So, let us consider an example you have got this original code i is equal to 4 i is equal to 1000 i greater than 1 i is equal to i minus 1. So, it is reading data from an array and then performing the operation x i is equal to x i plus s. So, s is a constant scalar value which is available in a particular register. So, this is the high level language code the corresponding mix code is given here. So, first of all you are loading the value you are loading it into the register in a register f 0 f 0 is the first array element you are loading then you are performing the addition operation assuming that the second that the constant scalar value is present in register f 2. So, this is a and then the content of f 0 and content of f 2 are added and that is being stored in f 4 and then you are storing the data in the register f 4 using register r 1 as the pointer and result is being stored. And the next two instructions are essentially loop manipulation instructions which are used for housekeeping the loop. So, the r 1 is decremented by. So, these are double words. So, you require 8 bytes 64 bits. So, 8 bytes are required. So, r 1 is decremented by 8 and then it goes back 8 also checks whether that you know that 1000 is being stored in r 2 that is why it compares and branch not equal that means, if they are not equal then it goes back to the first one. So, in this way it keeps on doing the looping 1000 times looping will take place. So, these are the first these are the three instructions as you can we have already seen there is data dependency and because of the dependency among these instructions in the if the instructions are present in this order you have to introduce stalls. So, executing the loop on MIPS pipeline without scheduling we will require how many stalls as you can see after load you will require one stall this is based on this table. So, this is the source instruction and this is the user instruction or dependent instruction you can say and if it is a floating point ALU operation and next instruction is also a floating point ALU operation. And then if the dependent instruction is I mean there is a latency of 3 that means, this from the source instruction to dependent instruction you have to separate them by three instruction you have to if you want to avoid stall. And the second one is floating point ALU operation and the next one is stored double in such a case it will involve this the latency is 2 that means, you have to introduce two stalls to overcome data hazard in such a situation. The third one is load double followed by floating point ALU operation in such a case latency is 1. So, one stall has to be introduced. So, in the first case that first stall is coming out of this that means, you have got a load double and it is followed by an ALU operation floating point ALU operation. So, the latency is 1. So, you have to introduce one stall then coming to the second instruction add double f 4 comma f 0 comma f 2 here you have to you are using the f 2 read the value from the register in this third instruction and this belongs to the first type of problem that is a floating point ALU operation followed by floating point ALU operation. So, dependent instruction has to be separated by you know there is a sorry stored double. So, this is stored double sorry this is the second category floating point ALU operation followed by stored double. So, you will require two stalls because latency here is two if you want to avoid stall then here again you are decrementing a register. So, it is an ALU operation and here this is a branch operation. So, in this case also you will it will involve one stall because you are performing not equal you are performing ALU operation. So, loads double I mean it will require one stall here it will belong to this type of problem and one stall has to be introduced here because you will not get the proper value in R 1 and unless one stall is introduced you will not get correct value it will not jump to the proper value. So, we find that 8 stalls I mean 1, 2, 3, 4, 4 stalls are required and total number of clock cycles that is required to execute each pass of the loop is 9 clock cycles. So, 1, 2, 3, 4, 5, 6, 7, 8 and 9 clock cycles. Now, let us see how by reseduling the instructions the number of hazards can be reduced. What has been done here the that loop manipulation housekeeping instruction has been shifted moved earlier earlier it was present here somewhere here from at the after stored data it was present here. Now, it has been shifted earlier and since these two instructions are not dependent not dependent. So, there is no need for introducing stall here and these two load and add data has been separated by one instruction and as a consequence no stall is required since they have been there was a latency of one clock cycle. So, there will be no need for introducing stall here. However, these two stalls will be present because add data it is followed by stored data as per this you will require two stalls to be introduced. However, whenever this type of reseduling is done you have to change the instruction you may see that in the previous case this was you are you are you are you are decrementing by 8 the value of R 1 was decremented by 8 to point to the next element of the array. So, but in the in this reseduled code we are we are performing decrement earlier than storing the value. So, that is why to store the result in the same memory location instead of decrementing we are adding 8 because already decrement operation was done. So, addition of 8 is being performed. So, this instruction is modified so that the array element that means result is stored in proper locations from the proper memory location. So, it was present that means this will point to the same memory as it was done by this instruction. That means although the instruction has changed, but the value that will be that effective at this that will be generated by this instruction and this instruction will be same. So, as a result it will store the value in the same memory location because effective at this is same. So, in this situation we find that two stalls are still present is there any way by which these two stalls can be overcome. So, to avoid a pipeline stall a dependent instruction must be separated from the source instruction by a distance equal to the pipeline latency of the source instruction. So, in this situation whenever you have got very few instructions in your state line code you cannot really overcome these stalls. However, this can be overcome if you do loop unrolling. So, the loop will be unrolled by several times may be twice or thrice or four times and then you have enough number of instructions in your state line code there they can be residual and over it will be able to overcome the stalls. Let us see how the loop unrolling is being done. So, here the loop unrolling has been done and you have got four copies. So, the same three instructions load add double load double add double and store double these three instructions which are repeated four times you can see this is one copy this is second copy this is third copy this is fourth copy. And after unrolling is being done obviously, these two instructions are to be appropriately modified that means this is a unrolled code where unrolling has been done four times. So, in this case how many times this loop has to execute earlier the loop was executing thousand times, but now since each loop is performing four operations of a single loop earlier loop you have to loop only two fifty times because the number of I mean operations that is being performed is four in this case in a single loop. So, by this unrolling number of loops iterations will be reduced however, the size of the code is increasing the source side of the code is increasing. Now, if it is present in this form obviously, those stalls will be present one stall at the end of after load double and two stalls after add double and again for the next code again one and two stalls in this way we find total of three plus three plus three plus three plus three that means, four twelve plus one thirteen stalls will be required and only again of nine cycles I mean without doing unrolling earlier what was happening. So, two cycles were saved, but in this particular case how you are saving we are saving because this these two instructions is repeated only once instead of four times because you know since the unrolling has been done the number of iterations will be reduced from one thousand to two fifty six and as a consequence the number of instructions that will be executed will be reduced. These two instructions number of such instructions will be reduced and leading to a gain of three cycles because here you know it is executed only once instead of four times. So, one stall plus this one so, three cycles were required I mean for each loop earlier now you will require only one and of course, the number of loops is reducing that is why gain of nine cycles just for this computation. So, four computations are being performed and you are saving nine cycles. Now, you see the size of the straight line code is quite big. So, we have a long straight line code earlier you are having only four or five instructions we have seen here very few instructions were present and now you have got large number of instructions. So, the compiler has much more opportunity for residualing the instructions. So, scope for instruction level parallelism is increasing here. So, you can exploit instruction level parallelism more whenever this unrolling is being done and you can see the how it is being done in this particular case this is the loop unrolling with scheduling. You can see all the load operations are being performed because they are independent and all the load after performing all the load operations and obviously, the instructions will change because you have to read the array element properly that is why the first you can see after the first loading is done here it is minus eight that means, next array element is R 1 minus 8 and next one is minus 16 and hard one is minus 24 because each element requires eight bytes and that is why this adjustment of the address is required to generate effective address for proper elements of the I mean so that it points to the proper array elements. So, after loading is done you will be able to perform the add operations. Now, you see this load and you know that here you are writing in F 0 and this F 0 is here which is separated by three instructions. So, latency is one and it is separated by three instructions. So, there is no question of any hazard whenever you execute this instruction at double F 4 comma F 0 comma F 2. Similarly, this instruction this load and this instruction where F 6 is being read is separated by again by one two three instructions. So, latency is one and separation is by three instructions that hazard is being overcome. So, in this way there is no hazard present here. Similarly, whenever you store the data the stored data here you are performing and that F 4 was loading was done here and it is separated by several instructions one two three three instructions and second load is also being performed. Now, after this load second data loading is done here again one this subtraction operation which is loop manipulation operation has been shifted. The reason is in this add double is being performed here where you are writing in the register F 16 and that value sorry F 12 and that value is being read here. So, this separation is being again increased so that there is no hazard. So, in this way by scheduling the instructions in this manner you are able to overcome all the hazards. So, no hazard present here. So, gain of twenty two cycles you are able to achieve a gain of twenty two cycles whenever you perform this type of unrolling and scheduling of instructions. So, loop unrolling with scheduling we find we have been able to overcome all the hazards that was present in the execution. Now, the question arises what kind of problem is the problem we may face whenever we do this kind of loop unrolling and scheduling. It is definitely overcoming all the hazards, but will it introduce any new problem which may degrade the performance by overcoming hazards stalls are removed. Obviously, it will improve the performance we have seen here you are gaining twenty two cycles, but is there any possibility of any reduction in performance because of this loop unrolling and scheduling. We can see here before I come to the disadvantages what kind of decisions and transformations are taken for loop unrolling and scheduling is explained here. What are the things you have to do? Rather not you have to do rather the compiler has to do is being explained. First of all identify the loop iterations independent. It may so happen that different iterations may not be independent which is being performed in iteration i and in iteration i plus one they may not be independent. So, loop unrolling is profitable only when the loop iterations are independent. In our example we have seen the operations that is being performed in two different iterations are completely independent because operations were being performed on different elements of the array. Since they are performed on different elements of the array they are independent, but it may not happen for all kinds of loops. So, first thing that you have to see is identify the loop iterations are independent. Second is use different resistors to avoid unnecessary constants. We have seen in our original program we had used very few resistors R 1, R 4 and R 2. Only these three resistors were being used in our original program, but whenever we did loop unrolling we have seen we have used a large number of resistors. We have used not only those floating point resistors R 1, R 2 and R 4. We have also used a number of other resistors F 4, F 6. You know F 6 were not used earlier, F 8 were not used earlier, F 10 were not used earlier, F 12 were not used earlier, F 14 were not used earlier, F 16 were not used earlier. These all these resistors were not used earlier. That means whenever we perform loop unrolling it is essential to use to have large number of resistors to avoid name dependencies. Because you know if you use the same resistors again this will lead to some kind of dependency and this dependency although not real dependency, but it will lead to hazard in some kind of pipelines. So, that is the reason why we have made use, we have used a large number of resistors and that resistor use of different resistors to avoid unnecessary constants that is important whenever you do loop unrolling. Eliminate the extra test and branch instruction and adjust the loop termination and iteration code. We have seen we have to we have removed a large number of these two instructions this one. This instruction and branch not equal and we have modified them accordingly such that it goes to the next iteration. We have determined the loads and stores that can be interchanged in the unrolled loop. So, we have this was our original code that was present immediately after unrolling. Now, this has been done by scheduling after unrolling and we had to identify independent instructions which can be placed one after the other. So, determine the load and stores that can be interchanged in the unrolled loop. Schedule the code preserving any data dependencies needed to yield the same result as the original code. So, this is the this is the this is very important ultimately you have your result should be same. So, maintaining the same result you are able to avoid dependencies by scheduling the instructions. So, basic idea is key requirement is to understand how instructions depend on one another and how they can be changed and reordered. So, this is dependent on the processor architecture. So, what I am what we are trying to tell that compiler must understand the processor architecture because in the processor how many registers are available fortunately for risk processor large number of registers are available. For example, we have got 32 registers. So, availability of large number of registers is available. Secondly, the dependency that latency between instructions that is also dependent on the pipeline implementation of the architecture implementation and those latencies are to be understood by the compiler and accordingly scheduling of instructions are to be done. And this is how a compiler can do the job. Now, this loop unrolling has got three different types of limits. Number one is decrease in the amount of overhead amortized with each unrolled. So, the decrease in the amount of overhead as you are doing the unrolling with each unrolled you are able to reduce the I mean amount of overhead and we have seen that for each unrolled we are able to reduce two instructions. And as you do so that decrease in amount of overhead amortized with each unrolled. So, this is happening second is the growth in the code size due to loop unrolling. So, we have already seen that size of the code is increasing that means this has to be stored in the cache memory before the program can be executed as you know. Nowadays all processors are provided with cache memories and those if the code size is small a small number of instructions are to be stored in the cache memory. But if the size of the code is big then you know this will lead to what is known as cache misses. So, this will lead to cache misses. So, that means the growth in the code size due to loop unrolling may increase the cache miss rates. So, this will degrade the performance later on I shall discuss about these cache miss occurs whenever you have got cache memory. So, program that is transferred to the cache memory their cache miss may occur if the size of the code is big. So, this problem will also limit on the loop unrolling. Third is shortfall of registers created by aggressive unrolling and scheduling. We have seen we are using registers additional registers for each unrolled code to avoid various types of constants. And whenever we are doing so you know that a point will reach whenever we may not have any further register available. So, this is known as register pressure. So, within the you have to exploit the registers which are available and as the unrolling is done there will be a shortfall of registers. So, this is known as register pressure. So, this will again limit on the put a limit on the loop unrolling. So, loop unrolling improves the performance by eliminating overhead instructions that we have already seen. And loop unrolling is a simple, but useful method to increase the size of the straight line code fragments that we have already mentioned and sophisticated high level transformation led to significant increase in complexity of the compilers. So, these three statements are very important because how loop unrolling is improving performance at the same time it is also telling in what way the complexity of the compiler is increasing. That means the compiler writer has to be knowledgeable about the processor architecture and then only the unrolling can be done in effective way to overcome the stalls. Let us stop here. We have discussed about the static scheduling of instructions with the help of compiler. Later on in the next class we shall discuss about another technique which is known as software pipelining. The hardware pipelining we have seen in detail where the instructions are executed in an overlap manner different instructions. Now we shall see software pipelining where we will see instructions from different loops, different iterations will be executed in an overlap manner. This is also a compiler based approach by which the instruction level parallelism is improved and the stalls are avoided. So, that I shall discuss in the next lecture. Thank you.