 Before I do that, let us have a quick recap of various types of dependencies and hazards which I have discussed in the last couple of lectures. We have seen that the dependencies can be broadly divided into two categories data dependencies and control dependencies. And the data dependencies again can be divided into two broad categories. First one is the true data dependencies and which leads to read after write type of hazards. And we have discussed various techniques by which you can overcome these true data dependencies by using hardware and software means, we have seen you can use forwarding, you can use instruction scheduling, static instruction scheduling, dynamic instruction scheduling by hardware and by that you can overcome read after write type of hazards arising out of true data dependencies. Similarly, we have discussed about name dependencies having two different varieties, one is known as output dependencies and second one is known as anti-dependencies. And output dependencies lead to read after write type of hazards and anti-dependencies lead to write after read type of hazards. And these two types of hazards can be overcome by using register renaming and we have seen how register renaming can be done by the compiler or by the hardware as it has been done in Tomasulo's algorithm. So, this is how the data dependencies are tackled and various hazards arising out of data dependencies can be overcome by different techniques. So, far we have concentrated on data dependencies and overcoming the hazards arising out of data dependencies. Now, we shall focus on control dependencies, we have seen control dependencies lead to control hazards. And in simple terms, we can discuss about control dependencies, you can tell about control dependencies in this way. Control hazards occur due to instructions changing the program counter. We have seen the program counter keeps track of the instruction to be executed next, that means program counter holds the address of the next instruction. And this particularly when there are branches, this program counter value may not be known immediately. And it has been found that control hazards cause a greater performance loss than do data hazards. So, data hazards sometimes leads to some losses, you have to introduce stalls, but it has been found that control hazards are more leads to more performance loss. So, we have to focus attention to control hazards and we have to see how their impact can be minimized, the stalls can be reduced. We know that a branch can have two outcomes, number one is known as taken, another one is not taken. That means whenever you have got a branch instruction, there are two possibilities. In case of taken, you have to generate a new address, that means in case of taken branches, effective address is equal to program counter plus that immediate data that is available as part of the instruction, that is added with the program counter and effective address is generated and this is the address where the instruction execution should start. That means program counter has to be loaded by this effective address, PC has to be loaded by this value. Another possibility is that not taken, that means the branch may not take the condition may not be satisfied. In such a case, the program counter is essentially the address of the next instruction as we know this is equal to PC plus 4, because instructions are 4 bytes. So, the next address is the present value of the program counter plus 4, so this is the next address of the instruction. But so these two, I mean when we shall know the new address, if branch is taken and when we shall know the address when branch is not taken, so unless these two are known, we cannot face the next instruction from the till the value of PC is known. That means unless the new value of PC is known depending on whether the branch is taken or not taken, we cannot really proceed, we cannot face the next instruction and start execution. Now let us see what are the solutions. The first solution or the simplest solution is to stall the pipeline upon detecting a branch. That means as soon as you have detected a branch, what you can do is you will stall the pipeline and wait till the branch address is known. So, that is the simplest solution and the steps are given here. The ID stage detects the branch that means after the instruction is decoded, it will be known whether it is a branch instruction or not and do not know if the branch is taken until execution stage. So, as we shall see in our pipeline, we have to go to the execution stage because there the condition for branch is tested and it will be known whether branch will be taken or not and then the new PC is not changed until the end of the memory stage. That means until we go to the memory stage, we do not know what will be the new address if branch is taken. That means after determining if the branch is taken and the new PC value is known in the memory stage. And if branch is taken, we need to repeat some stages and fetch new instructions. That means that fetching that was taken place from consecutive addresses like PC plus 4, PC plus 8, those things are to be flushed out. Those instructions are to be flushed out and you have to fetch instructions from the new address and that is how it will continue. As it is clear from this diagram, as you can see the condition in this simple pipeline that we have discussed MIPS pipeline, it checks whether a particular register content is zero or not and as I mentioned it is done at the execution stage. So, the condition is known only at the execution stage and where the branch will take place you can see the address is known in the memory access stage. So, in the memory access stage the address will be known and that can be loaded into the program counter. So, we have to wait till the memory access stage to know both the things, condition whether satisfied or not and the branch address whether if the branch is taken. So, what will be the delay? So, obviously the delay it leads to a delay of three cycles. So, you have to stall the pipeline by three cycles in the normal situation. So, you can see here whenever a branch is taken and you can see after the execution stage the branch condition is decided and after the memory stage the new target address is known. So, these three instructions which are fetched this one, this one and this one that means after this BEQ R1, R3, R6 these instructions which are following these three instructions and R2, R3, R5 or R6, R1, R7 and R8, R1, R9 these three instructions obviously are to be flashed out. What do you really mean by that? You can see fortunately none of these three instructions have done any permanent damage or permanent change in the status of the processor and that will take place only in the right back stage. Whenever the content of a register is modified only then a permanent change in state is done. So, you can see before that happened both the condition and the new address is known. So, what you have to do all these three instructions are to be nullified by converting them into no operation instructions and obviously there will be no change, but we shall be losing the three cycles and the instruction fetch will take place and the branch is taken because this is where branch is taking place and I mean this is when it will be known whether it will execute this or it will jump to this instruction, I mean that is the address 36 where it will jump. So, you can see this is how it will happen, so we shall be losing three cycles. Now, the question naturally arises whether it is possible to reduce the number of such stalls. That means whenever a branch instruction is encountered we are finding that if we do not use any complicated technique, if we simply introduce stalls then we shall be losing three cycles for each encounter of a branch. So, let us see what will be our loss impact of branch stalls, so let us assume your ideal CPI is equal to 1 and let us assume that 30 percent of the instructions are branches and remaining 70 percent instructions are ALU operations. So, since there is a stall of three cycles, so new CPI is equal to 1 plus 0.3 into 3 1.9. Of course, we have not considered the situation that all branches may not be taken, here we have assumed that as soon as a branch instruction is encountered three stalls will be introduced, but that is not really necessary even for the simple pipeline that I have discussed because you see whether the branch will be taken or not taken is known in the execution stage. So, if it is not taken then obviously it is not necessary to wait for the next cycle because branch will not be taken, so if branch is not taken then the loss will be of two cycles. For example, if 50 percent of these branches are taken then the new CPI will be 1 plus 1 is the one that is ideal situation then 15 percent of the cases branch is taken, so in such a case the loss will be three cycles because address will be known only at the end of memory stage and so 0.15 into 3 and then whenever branch is not taken for the 15 percent of the cases the loss will be of two cycles. So, if you add up we find that the new CPI will be 1.7 not 1.9, now this penalty would be worse for current day pipelines, this is the case for the simple pipeline that you have considered, but in the modern processors nowadays for example, even the instruction fetch stage or instruction decode stage is divided into several sub stages, so in such a case the loss will be more, the number of clock cycles or number of stalls that will be happening will be more, but in our case it will be restricted to three. So, how do we reduce the impact of branch stalls? Question is there any way by which you can reduce the impact of branch stalls? There are two part solution, first of all you have to determine branch taken or not taken sooner, so if we can find out that branch will be taken or not taken earlier in the pipeline stage, if we can make some arrangement for that then there is a possibility of some gain, similarly if we know the branch address earlier even then we can have some gain in performance or we can reduce the impact of this branch stalls, so these two things have to be done, determine branch taken or not taken sooner and compute taken branch address earlier, so these are the things to be done and there can be a hardware solution for this, so let us see what is the hardware solution, what can be done, you can see that zero detector which checks whether a particular register is zero or not is in the execution stage, it can be moved to the instruction decode stage, you can see it has been moved to instruction decode stage and not only that this multiplexer which was in the memory access stage can also be moved to the instruction decode stage, so if we do that, however if we want to do that you will require an additional hardware that is an adder, you will require an adder earlier this addition to generate the effective address, you know you have to generate this effective address PC plus immediate value, so this address has to be calculated earlier this was done with the help of ALU which is available in the processor, now if you want to move it to the previous stage, then we will require an additional adder which will perform the effective address calculation earlier in the instruction decode stage, so we find that if we can shift this hardware that means this multiplexer along with an additional adder and this zero detector to the instruction decode stage, then we find that what is the outcome of this that means both the condition and the branched address are known in the second stage itself, we do not have to go to the fourth stage, so the loss the penalty is reduced to only one cycle, because in the instruction decode stage both will be known and accordingly depending on the outcome, either the next instruction can be fetched from PC plus 4 in this cycle after one cycle and also or from the branched address which is known that means program counter will be loaded by PC plus 4 or by the effective address which is calculated in the third cycle itself instead of waiting for the fifth cycle, so we find that these two solutions can be easily accomplished with the help of additional hardware and this is how we can reduce the branch penalty to one cycle, so after this we shall assume that for the simple pipeline that we are discussing the branch penalty is one cycle, that means we shall assume that this change has been made in the hardware and our branch penalty is now one cycle. Now here is some statistics about the control instruction based on the spec benchmarks on this DLX processor, it is taken from that computer architecture that is second edition book and qualitative approach computer architecture and qualitative approach from that particular book and branches it has been found that the statistics is like this branches occur with a frequency of 14 to 16 percent in integer programs and 3 to 3 percent to 12 percent in floating point programs, so this is the branch frequency the rate at which branch instructions are encountered in a program and this is more in integer programs than in floating point programs and another statistics about whenever a branch is encountered has been found that about 75 percent of the branches are forward branches, so the branch can take place in the forward direction in which case address is increased or it can take place in the backward direction when the address is decreased, so 75 percent of the branches are forward branches and moreover 60 percent of the forward branches are taken and 80 percent of the backward branches are taken actually you may be asking why 80 percent of the backward branches are taken why is it more the reason for that is you know this is because of loop in case of loop it goes back to a previous instruction, so because of that looping you know the percentage of backward branches taken is more. So, this is the statistics, now let us consider the techniques by which we can deal with control hazards, what are the different techniques that we can adopt, first techniques we have I mean we have already discussed some hardware solution by which you can reduce the number of stalls, now the number of approaches available is 4, first approach is simplest method is to redo the fetch of the instruction following the branch, so this is essentially by introducing stalls, so what is being done in this case until the branch direction is known until the address is calculated, so you continue to flash the pipeline, so every branch causes a performance loss, so whenever a branch occurs you simply introduce a stall, in our simple pipeline the fortunately the number of stalls has been reduced from 3 to 1, so whenever a branch instruction is encountered a stall is introduced, so every branch leads to one instruction loss, so this solution is very simple to implement because you do not have to check anything after the instruction is decoded, if it is a branch you introduce a stall then you proceed to the next cycle where both the condition is known whether the condition is satisfied or not is known and also the branch target at this is also known whether it is PC plus 4 or PC plus some that immediate value which is part of the instruction, so obviously this approach is not accepted if you are interested in improving the performance, so what is the second approach, second approach is to treat every branch as taken that means the compiler assumes that branch will be always taken, so if the branch is always taken what will be done, the address the next instruction will be fetched I mean sorry in this case approach to this branch is not taken, so to treat every branch is not taken that means it always assumes that branch is not taken, so when the branch is not taken obviously the next instruction to be executed is PC plus 4, so it proceeds in that direction, so execute successive instructions in sequence as if there is no branch and however this also I mean whenever when this assumption is made this is simply an assumption, it does not mean that branch will I mean all branches will be not taken some branches will be taken, so what will be done in such a case, so when branch is taken we need to turn the fetch instruction into a knob and restart the fetch as the target address, so this is the thing we have to do whenever the prediction that is done by the compiler or the assumption that is made by the compiler turns out to be wrong and it has been found that 47 percent of the branches are not taken on an average, so in 47 percent of the cases there will be no need to any modification, so there will be no performance loss for the 47 percent of the cases, but for the remaining 63 percent of the cases there will be some performance loss, because you have to fetch an instruction and you have to convert the already fetched instruction into a knob and you have to start the fetch at the target address, so this is the situation when assumption is made as branch is not taken. Now, third approach is an alternative scheme is to treat every branch as taken, so in such a case what is being done, it is assumed that all the branches are taken, that assumption is made, but unfortunately even for the simple pipeline we have seen that branch address is known only after only in the execution stage, when already whether the branch will be taken or not taken is also known, so as a consequence for the simple pipeline that we have discussed there is no gain, no advantage, so this approach has no advantage for the five stage simple pipeline that we are discussing, however there is some performance gain whenever the branch is considered as not taken. Now, there will be another approach which is known as delayed branch, we shall see how the instruction following the branch can be converted into a useful instruction, normally we have seen if the prediction is wrong then you lose one cycle, that means the instruction which was executed that has to be converted into no off, whether we can overcome this particular thing, we can execute an instruction and it is not necessary to convert it into a no off, so that is known as delayed branch, so we shall discuss these four techniques one after the other, of course the first technique there is nothing to discuss, I have already mentioned that you have to simply the processor simply introduce a stall after detecting a branch instruction in the instruction decode stage, so first approach has nothing we do not there is no need for further discussion to consider first approach, let us now focus on approach to predict not taken, so who is predicting, who is deciding here obviously the prediction is being done by the compiler, so compiler is assuming that the branch is not taken, so in such a case you can execute successor instruction in sequence, you keep on fetching PC plus 4, PC plus 8 and so on and keep on executing, however I mean PC plus 4 already calculated, so use it to get the next instruction, chances are the branch is not taken, so whenever branch is not taken as you have seen we have to modify the instruction and that is being done that if a branch is taken the following instruction you have to squash instruction in the pipeline if branch is actually taken, so this if the ith instruction is a branch instruction and our prediction was not taken and unfortunately it has turned out to be false, so when a branch is taken in such a case this i plus 1th instruction you have to introduce stall here, so one cycle is lost as you can see and of course the next instruction is taken, since it is a taken instructions you from the branch target address the instruction is fetched and then execution continues, so this is the approach two where predict not taken is done and of course this particular thing can be easily done as I have already explained because CPU state is not updated till the late in the pipeline, we have seen that CPU state is updated only in the later part of the cycle that is in the right back stage where you are modifying the register that means permanent change you are making, so before that if the decision is known and your prediction is wrong there is no problem you can convert it into no off of course there will be some performance loss but that has to be accepted and let us take consider and what happens what is the, so 53 percent branches taken on an average, so but branch target address not available, so here it is predict branch taken the second approach 53 percent branches are taken on an average but branch target address is not available after instruction fetch in MIPS, so MIPS still incurs one cycle branch penalty even with this predict taken, so as I have already mentioned for this simple pipeline there is no benefit for this prediction, this assumption, however there are machines where branch target is known before branch outcome is computed, so there are processors where this particular situation exist branch target is known before the branch outcome is computed in such a case significant benefits can occur because there may be processors where in the execution stage both the target is known and outcome is computed, so in such cases there can be some gain but not for the simple pipeline that we have discussed. Now we shall focus on the fourth approach that is your delayed branch, we have seen we have a branch instruction following that there are several successor instructions and this is the target address where the branch is taken it will jump to this address, so in between the successive instructions there is a I mean and the branch target address branch target if the branch is taken you have got several instructions, so these are known as sequential successors of the branch instruction and these instructions are considered to be inside in the branch delay slot, so here you have got branch delay of length n, so there are n instructions in this branch delay slot, however for the simple pipeline that we have already discussed there is only one slot delay required in five stage pipeline, so we have already seen that in the branch delay slot has got only one instruction and so in general there can be n instructions but in our simple case only one slot delay is required. Now we are interested in filling up that particular delay slot, so this is the branch instruction this is the delay slot instructions and this is the post branch instruction, so here this is the target, now this d i plus 1 this is the delay slot and this instruction we have to fill up with some instruction, so that is useful, so instructions in the branch delay slot get executed whether or not branch is taken, so the point that you have to understand is you can see that whether the branch is taken or not taken the instruction following the branch instruction is always executed, but if your prediction is wrong, I mean if the branch is taken then you have to nullify it, but that instruction will always get executed, so based on this observation we can think of some solution which will help in improving the performance of the processor. So the simple idea is put an instruction that would be executed anyway right after a branch, so this is the delay slot, this is the branch instruction, this is the delay slot and this is the branch target or the successor instruction. Now question is what instruction do we put in the delay slot, so we have to put an instruction in the delay slot with some objective, what is the objective? One that can safely be executed no matter what the branch does, that means whether the branch is taken or not taken that instruction can be executed and it will not lead to any I mean we do not have to convert it into a no off that is the basic objective and the compiler decides this, compiler has to decide which instruction to put in this delay slot and there are several approaches. One possibility is an instruction from before, an instruction can be taken from before this branch instruction, so here is a delay slot, you can see there are several solutions, first solution is an instruction from before, by this from before we mean you have your branch instruction and so let us consider D A D D R 1 R 2 R 3, this is an instruction before this branch instruction, if R 2 is equal to 0 then it will jump to this, this is the target address and this is the delay slot. What we are doing, this instruction you can see this is the normal instruction execution flow, so this instruction is executed after that this branch instruction is encountered, so this instruction will be executed whether this branch is taken or not taken, now if we move this to this slot that means we are converting it into in this way, if R 2 is equal to 0 then and we are filling up this slot with D A D D R 1 R 2 then it will go to this, so this delay slot is filled up with an instruction from before, it was here now you have moved it there, so we find that as we know this instruction will be executed irrespective of whether branch is taken or not taken and this instruction was supposed to be executed before this branch instruction, so here also this instruction is getting executed, so if branch is whether a branch is taken or not taken there is no loss, so we are able to put an instruction which is useful, you do not have to convert into know-off if prediction is wrong, so obviously, so this can be moved to this delay slot and this is possibly the best solution, so we get this get to execute that instruction for free, so as if we are getting this instruction executed free of cost, free of cost means it was some instruction was supposed to be executed here, we are taking an instruction before the branch, so as we go to the next instruction by that time we know whether branch will be taken or not taken and also we know the target at this, so this instruction can be either it can be the PC plus 4 or it can be that PC plus that immediate value which is available in the as part of the instruction in case there is a I mean if the branch is taken, so we find that this is the best solution that we can have and this is the preferred approach that we can follow in our fulfilling of instructions in the branch delay slot. Now what is the second possibility? Second possibility is that an instruction from target, target means where it is jumping, so it is if r1 is equal to 0, then it is jumping to this D sub r4, this instruction, what can be done? This instruction can be replicated here, it can be replicated here in this delay slot and then we can change this branch target at this, that means you can lower it and so it will be pointing to the next instruction, because it is getting executed here, so this particular thing can be done, whenever in most of the situations your prediction is that branch will be taken, so this is very advantageous, so you are filling up the delay slot by an instruction from the target and by doing this you are able to improve the performance, but improvement of performance takes place when branch is taken. So yet another possibility is an instruction from inside the taken path, so here you are taking it from the inside the taken path, inside the taken path means you can see here is a delay slot, this particular instruction or instruction can be moved into the delay slot only if its execution does not disrupt the program execution, so from the taken path you have filled up the delay slot by this instruction, so this is another approach you can follow. Now, let us see an example, here you have got three parts, so this is a delay loop, load R1, 0, R2, D sub R1, R1, 3 and VEQ R1L, now you can see and L is this is the target address, and this is essentially the delay slot, I mean where you have to put your instructions, that means following this VEQ you have to fill up the, there is a delay slot where you have to put useful instruction, in this instruction in this particular code sequence, first thing that we can do is to take this instruction D sub U R1, R1, R3 after this, unfortunately this cannot be done, because VQZ is dependent, this instruction is dependent on this, because of this dependency we cannot move this instruction to, I mean after this branch instruction, so this first approach, which is the preferred approach cannot be followed here, now what are the other alternatives, if we know that branch was taken within a high probability then DADDU could be moved into the block V1, that means if you see this is a target address, so we can take this instruction from the target and put it immediately after this instruction, I mean immediately after this VQZ R1L, that means the branch instruction and if we know that the branch was taken with high probability, so if you have high probability of branch taken, then this particular approach is followed, since it does not have any dependence on the block R2, so since there is no dependence, I mean there is no dependence of these instructions on this, so it can be moved without creating any problem, but this solution will be good whenever branch is taken with high probability, now what are the third possibilities, third possibility is that knowing the branch was not taken, the OR could be moved into the block V1, so this instruction that is the fall through, that approach that I told, this is the fall through, so this instruction can be moved, I mean can be moved to the block immediately following this and OR could be moved into the block V1, since it does not affect anything in V3, so we can see, we have got three possibilities whenever we can by fill up and depending on different situations, we can do that and this particular example illustrates various alternatives possible, so we can summarize the scheduling of branch delay slot possibilities, first one is delay slot is scheduled with an independent instruction from before the branch, so this is a preferred schedule, I have already mentioned, second is delay slot is scheduled from the target of the branch, you have to copy an instruction and this is useful only if branch is taken and this is preferred when the branch is taken with high probability such as loop branch, so in case of loop branch we have seen branch is taken and that probability is larger, so in such a situation this is suitable, then third possibility as I have already told delay slot is scheduled from the non-taken fall through, so non-taken fall through and this is useful if branch is not taken and if the branch goes in the unexpected direction it should produce correct result, so these are the three possibilities by which we can fill up the branch delay slot and we have seen this is how the performance can be improved. And this particular diagram summarizes the three possibilities, first one is from before that means this instruction which is before this instruction can be filled up here and this is the second approach from the target, this is the target address and you will copy this in the branch delay slot and change the direction, I mean this pointer value where this branch target address is there, that address has to be modified, third one is from the fall through, so this is filled up with this instruction sub r 4, r 5, r 6 and as we have seen this particular approach is suitable when there is high probability of branch is not taken, so the basic objective is of this compiler, the job of the compiler is to make the successor instruction valid and useful and this is the philosophy that is being used to fill up the delayed branch delay slot. Now, let us have some statistics compiler effectiveness for single branch delay slot, so it has been found that it fills about 60 percent of the branch delay slots, about 80 percent of the instruction is executed in the branch delay slots useful in computation, so that means out of 60 percent of instruction is executed in the branch delay slots which are filled up, 80 percent is useful in computation, so that means that in 50 percent of the cases the slots are usually filled, usefully filled and so in other words we can tell that there is a 50 percent improvement compared to whenever we assume you consider the first, follow the first approach where we introduce a stop. Now we have not considered a very important aspect of it delay branch down size that is what if multiple instructions issued per cycle that means whenever it is a superscalar processor, we have considered the simple situation it is a pipeline processor, so branch delay slot we are issuing one instruction at a time and the delay slot has to be filled up with one instruction, but if it is a superscalar processor then it is necessary to issue 2 or 3 or 4 depending on the degree of the superscalar processor, so many instructions are to be issued and so many instructions are to be filled up in the delay slot, so the task of the compiler becomes I mean difficult in such a situation for a superscalar processor. Now here the performance for different alternatives are given here, so pipeline speedup is equal to pipeline depth by 1 plus branch frequency and branch penalty, you see we have that performance is dependent on two things, number one is branch frequency and second is branch penalty and it has been found that this branch frequency varies from program to program and since depending on the branch frequency your improvement that will take place is will depend not only on branch frequency, but also on the branch penalty that branch penalty that will take place is dependent on the approach that you are following that means technique that you are adopting to improve the performance, so that branch penalty will be dependent on that, so first part that branch frequency is dependent on the program and second part branch penalty is dependent on the approach that you are following, so based on these assumptions that 14 percent of the instructions are branches, so that means branch frequency is 14 percent and 30 percent of the branches are not taken and assuming that 50 percent of the delay slots can be filled with useful instructions based on these assumptions this is the result for different situation, first case that slow stall pipeline where we assume that the branch penalty is three cycles, so whenever you have got a branch penalty of three cycles you get a CPI of 1.41 and speedup with respect to unpipelined is 3.5, so ideally it should be 5, but you get 3.5, so that is the speedup with respect to unpipelined whenever you have to it incurs number of stalls is 3 and speedup with respect to stall that means whenever of course in this case it will be 1, so whenever we are adopting the technique of introducing stall, so with respect to stall performance speedup is 1, so with this let us compare the other techniques first one is first stall pipeline, first stall pipeline means we have used additional hardware to reduce the number of stalls and as we have discussed the branch penalty can be reduced to 1 from 3, so as you do that the CPI improves from 1.4 to 1.14, so there is significant improvement in CPI and we find that the speedup is 4.4, there is a significant increase from 3.5 to 4.4 and speedup with respect to stall approach the first approach is 1.26. Then if we consider the third case the prediction is branch taken and we have seen for our simple pipeline there will be always a loss of one cycle, so in this case it is effectively same as the second approach, so there is no performance gain which is quite obvious, so the first stall pipeline and predict taken is giving the same performance that means CPI remains 1.14, so in this case also there is loss of one cycle, in this case also there is always a loss of one cycle, so CPI remains 1.14, but the speedup with respect to unpiled line also remains same and speedup with respect to stall is 1.26 that also remains same and predict not taken in this case the branch penalty is 0.7, we have seen there can be 50 percent of the cases it can be filled up with useful instructions, so branch penalty is 0.7 and CPI is 1.10, so we find that CPI is improving compared to the previous case and there is consequent improvement in speedup with respect to unpiled line and also there is consequent in speedup with respect to stalls whenever we take up the approach of stall, so 1.29 and last but not the least is the delayed branch approach in which case the branch penalty is 0.5 and CPI is 1.07, so it is very close to 1, 1.07 and speedup is also very good 4.7 with respect to the ideal of 5 and the speedup with respect to stall is 1.34, so here we call it static branch prediction, the prediction is done with the help of a compiler and a compiler can reorder instructions to further improve speedup that we have already discussed and later on we shall consider another approach particularly important because of the importance of stall. Reduction the crucial in modern processors which issue and execute multiple instructions for every cycle, so need to have a steady stream of instructions to keep the hardware busy and stalls due to control hazards dominate, so this importance of stall reduction is very important and so far we have looked at static schemes for reducing branch penalties and some same scheme applies to every branch instruction that means what do you mean by static? Static means if there are 100 branches for all the 100 branches we are adopting the same policy because it is done by the compiler and it is static. However, there is potential for increased benefits from dynamic schemes, what do you mean by dynamic scheme where you know dynamically when the instruction execution is progress for a particular branch prediction can be not taken, for another branch prediction can be taken, so it will dynamically keep on changing as the instruction execution takes place and that is done at execution time with the help of hardware. So, it can choose appropriate schemes separately for each instruction execution. So, the branches to top of loop have different behavior taken or not taken and can learn appropriate scheme based on observed behavior and dynamic branch prediction schemes can be used for both direction taken or not taken and target prediction. So, in my next lecture I shall discuss in detail this dynamic technique that means dynamic prediction schemes can be used and we shall see how the performance is improved by adopting to dynamic technique. Thank you.