 This is the combination implementation. So, a combination implementation what I mean is that you. So, there is a clock which point takes you get the new PC and then you compute this whole thing and again the next clock comes you register the new program counter and the process. So, essentially as you can see here the clock frequency is determined by the time taken to do this whole combination which is why it is called a single cycle implementation. So, in one cycle you are doing the doing everything. So, we talked about the multi cycle implementation where essentially what you do is you put pipeline latches these are not exactly pipeline latches, but these are latches. So, that you can carry the data from one cycle to another. So, essentially what happens is that in the first cycle you will do whatever is needed to be done in the instruction fetch stage and the instruction register will store the result of that. Next cycle only this particular stage will be active we will be reading the register pipeline and doing the 0 and sign extension and then whichever wires cross this boundary will hold the values of this cycle whatever is complete. Next cycle only this stage will be active and the next cycle this stage will be active and the next cycle this stage will be active. So, essentially what you are doing is you are now having smaller cycles, but each instruction takes 5 cycles. And the question arises which one is better the single cycle combinational implementation or multi cycle implementation. So, here is an example that we talked about last time. So, suppose you are so usually the stages are not valid because there are different things happening in each stage. So, the time should be valid. So, assume that instruction fetch takes 2 nanosecond and your decode register pipeline takes 1 nanosecond, execution takes 1 nanosecond, data memory takes 3 nanosecond and write back takes 1 nanosecond. And let us assume that branch frequency is 20 percent. So, I mentioned this last time also that you look at this one the branch instructions take 4 cycles to complete. So, 1, 2, 3, 4 is where your EC is known next cycle. Similarly, store instructions take 4 cycles to complete because in this stage your store gets the value comes in and address comes in. So, we do not need 5 cycles for store. Everything else will require something to be written in the register file will require the complete cycle. So, with this particular data we can compute whatever we want to compute. So, we have branch frequency 20 percent store frequency 10 percent. Assume that multiply divide frequency is 5 percent and they take longer 30 nanosecond. So, and total instruction count is 100 percent. So, given this particular configuration and given this particular program. So, of course, we are talking about a particular program here where 20 percent branch, 20 percent store and so on. We have to compare the multi-cycle implementation single cycle. So, let us see how to do that. So, first thing that we calculate for multi-cycle implementation is cycle time. So, it is a 3 nanosecond. Why is that? Well, that is because the longest stage takes 3 nanosecond. So, every stage has to be 3 nanosecond. So, that gives us a frequency of 333 megahertz and we can calculate CBI. So, branches take 4 cycles and 20 percent stores take 4 cycles 10 percent. Multiply divide take 10 cycles 3 nanosecond cycle 30 nanosecond 5 percent and whatever is left will take 5 cycles. So, that gives us CBI of 4.9. So, as I said it should be close to 5 slightly below because of store and branch. And as you can see because the multiplied divide frequency is so low it is highly it is actually does not happen much. So, now we can calculate execution time of this program which has 100 instructions. So, 100 times CBI times cycle time. So, that gives us 485 nanosecond. Any question on this? So, what about the single cycle? So, in single cycle everything will compute in a cycle. So, you add all these things up you get 8 nanosecond that is your cycle time that gives us a frequency of 125 megahertz. So, first thing to notice is that your multi-cycle frequency is not really 5 times higher. That is the first thing to notice. So, the ratio is between 2 and 3 and the reason is that as you said your cycle time in this case gets determined by the longest stage and there you lose a lot. Like for example, your decode register file state you will have to wait for 2 nanosecond doing nothing because cycle time is 3 nanosecond. So, what is the CBI for your single cycle implementation? Well pretty much everything remains unchanged. So, here everything takes a cycle. There is even the branches compute early you have nothing to do actually you cannot really exploit that. They have to wait in the cycle boundary because in a clock system there is nothing that you can do in the middle of a clock. You can compute and wait for an event to happen only at the clock boundary. Events happen only at clock boundary. So, 95 percent instructions take one cycle. These 5 percent instructions will take longer and how long will it take? Well they require selling of 30 over 8 cycles. So, that gives us a CBI of 1.15. So, again notice that your multi-cycle CBI is not really 5 times of that. The ratio depends on a lot of things. You just cannot blindly say that my multi-cycle CBI is going to be 5 times. So, what is my execution time? 100 instructions times CBI times cycle time. So, as we mentioned last time in most cases you should expect that your combinational design will be better in terms of throughput. So, as such there is no reason to do this to a multi-cycle implementation. You will do this we will see actually concrete examples of that in processors sudden floating point events are made multi-cycle because there you can actually save an area. Here you save nothing you are losing performance by doing multi-cycle implementation. So, here is one more example which takes a balanced pipeline. So, here we assume that all your stages take same time. Let us see that is what happens. Everything else is unchanged. The only thing that I have changed is I have made every stage 2 nanoseconds. So, now my multi-cycle cycle time is 2 nanoseconds. Frequency is 500 megabytes CBI. So, everything remains unchanged except this particular number is going to change 30 nanoseconds means now 15 cycles. So, that becomes 5.2. So, as you can see now actually your multiply divide actually hurts you a little bit beyond 5. And you can calculate your execution time which turns out to be 1040 nanoseconds. What is the what is happening in the single cycle site? My cycle time is 10 nanoseconds. Frequency is 100 megabytes. So, now you can see that the ratio is exactly 5 as you actually expect. If you have balanced 5 you get a 5 times slower clock in single cycle. So, you can calculate the CPI 95 percent instructions take one cycle and 5 percent multiply divide you take 3 cycles. That gives us 1.1 again very close to 5 you can see that the ratio is almost close to 5. And as you calculate your execution time you run slower than the multi-cycle. So, where do you actually lose in the single cycle? Branch and store instructions. They will run for less time. Exactly. So, here branch and store instructions will run for 4 cycles. And how much is that 8 nanoseconds? Here they will be finished by 8 nanoseconds, but we have to wait for 2 nanoseconds because events will occur only at cycle boundaries. That is why we lose in those nanoseconds. So, you can actually calculate that the deficit comes exactly from that the 16 nanosecond devastation. Ok. So, it is near theory. So, what happens in unbalanced pipe? What happens in balanced pipe? No question. Ok. So, now often multi-cycle designs are used as an intermediate design before you go to pipeline design. So, if you look at the multi-cycle design carefully you will find that in the second cycle I know if it is a branch. So, if I go back. So, here I decode the instruction in the second stage. So, I know that it is a branch instruction. So, in this case if I know that it is not a branch. Then I can actually start fetching the next instruction because I know that it is the next instruction will be sequentially next address. So, that is one option. Also, when the ALU is doing an addition let us say the decoder is actually sitting idle. So, when you say that in the multi-cycle design, first cycle will be fetch, next cycle will decode the fetch and then you say idle and next cycle will be execution, the decoder will say idle and so on and so forth. So, in summary exactly one phase is active at any point in time it wastes a lot of hardware. So, you can form a pipeline what you can do is you process pipe instructions in parallel. Each instruction is in a different stage of processing called a pipe stage and how do you synchronize between pipe stages? You put pipeline latches in between. So, this is what it looks like. So, now stage boundaries need pipeline registers or latches. So, wherever you see a red wire crossing a blue dotted line you know that you need a pipeline register there. So, for example, here this is a pipeline register which contains your instruction. Here for example, you will have to hold the zero or sign extended operand. Here you have to hold the two operands coming out of the register file, the pipeline register. Similarly, here you have to contain the comparison outcome. You have to hold the ALU outcome. You have to hold the store value coming out of the register file and so on. So, these are the pipeline registers and that is how you are going to synchronize. Whenever a clock ticks things will move from here to here, things will move from here to here, things will move from here to here, things will move from here to there and so on. So, as you notice is that although I have put the register file here, the register file right actually happens in this stage. So, keep that in mind. It is just a physical design, but the actual operation happens here. So, we will pick up whichever value goes to register file and actually that will happen in this cycle. So, these are the pipeline mix processor. Any question? So, what we gain and what we lose? We gain in terms of parallelism because we are extracting parallelism from a sequential instruction stream. This is known as instruction level parallelism and this is pipelining is the simplest form of IMP where we essentially form a pipeline of instructions and execute, well try to execute five instructions in parallel if you have a five stage pipeline. And ideally you should be able to complete one instruction at any time. Ideally one instruction will enter into the pipeline every cycle and one should be. So, ideally your CPI should be five times smaller compared to a single cycle or multi cycle design. We assume that you know there are no multi-clined pipeline operations. CPI should drop by five times because it would look like you are finishing one instruction every cycle because if you observe the pipe one instruction will come out every cycle. What do you lose? So, each pipe stage may get lengthened a little bit due to control over it. For example, these latches are not really ideal. They have something called a skew time. They have a set of time. For example, the clock ticks the value really does not get transferred from input to output. It takes time. So, that limits the gain due to pipelining because essentially now what is happening is that it is showing this example. Your fetch latency may not be exactly nanosecond anymore. It will be slightly more than that. So, that is one problem. So, each instruction may take slightly longer. So, essentially that will affect your CPI. This will really not be five times smaller. Would there be resource conflicts? So, that is one problem. Here you can see that my register file is going to be needed in two stages. Here I am going to read from the register file. Here I am going to write to the register file. And now both of the stages are going to be concurrently executed. One instruction will be reading from the register file. Some other instruction will be trying to write to the register file. So, there may be a resource conflict also that may be a resource in some way. Similar problem with this memory here. Instruction fetch will require accessing memory and your data of memory state also will be required. And this is where the risk versus sys debate becomes fairly complicated because if you look at the instructions in architecture, they are very clear about what stage should be required and what resource. It is very clear actually. It says it is not at all clear. If you have a instruction, it will be very difficult to figure out what all resources the instruction will require to execute. It may require 5 instruction accesses to complete a instruction. So, it may be actually, you really do not know how many times you require the memory resource. So, that is that because very difficult. So, in that way risk is much easier to pipeline these kinds. You require bigger memory bandwidth now because you will have to access memory twice every cycle. One for instruction, one for data because they will be happening concurrently now. So, overall execution time goes down. That is the overall benefit even though we have these shorter periods. So, pipelining has one major problem and these are called pipeline hazards and we have already looked at some of the most simplified lines. So, there could be two types of hazards. One is that when you get an input, you really do not know what to be done on that input. We have seen one example of that. You might have forgotten, but you can review that in your slides. And the second problem was that you want to do a computation where the data may not be available in time. So, there are two different types of hazards that can happen. So, if you look at this slide, you can figure out what kind of hazards you expect here. So, one problem as you can see here is on the first kind that is you do not know what to do that involves a program counter. Because your program counter gets updated here if it is a branch instruction. But you have fetched the instruction already here. The next instruction to be fetched should be here. It should be happening here, but you really do not know what to fetch because the program counter is not yet updated. So, that is called a control hazard. And the second problem arises because suppose you are having an instruction which produces a value. So, it writes to say register 20. And the very next instruction leaves from register 20. So, the first instruction will naturally write to register 20 here in this particular cycle. But the next instruction will be reading the register file in this cycle. So, if you look at the timing of these instructions. So, these are my clock cycles. So, first instruction that produces register 20 is here. This is the instruction. The next instruction that reads register 20 will be reading from the register file here. But this is written only here. So, it is going to get a wrong value from register 20 for sure. Is that clear to everybody? So, that is a data hazard. But we do something to avoid this problem. The first problem is easier to solve in the sense that the solutions are easier. Because well, branches are not every instruction. So, I can wait whenever I see a branch. But this is very frequent. Back to back register users I produce a value you continue in the next instruction. So, if you here say that well, I am going to delay this register read till this point. That is correct. It will take a very large performance. Is that clear? These two types of hazards. So, and there is a third type which is called structural hazard that rises due to resource conflicts. It happens if the same resource is accessed in at least two stages of the pipeline. Like you already talked about register pipeline or memory. Control hazards, problems with branches. A branch does not resolve immediately after it is fetched. So, what to fetch in the next cycle? You don't know. Defines an important parameter called branch penalty. We will talk about that. That is how many cycles do you lose if you choose not to go ahead with this. So, we will define this particular count very soon. And the third one is data hazards. Dependent instructions may not execute back to back. Dependence does not resolve. So, what you get is speed up of pipeline is pipeline depth over one plus stall cycles for instructions. Because of these hazards you may have to do stall cycles. So, essentially what happens is that you really do not get speed up of the pipeline. It is divided by certain other over effects. So, first let us look at structural hazard. Because as such it is boring in the sense that there is no smart solution. Usually structural hazards are resolved by throwing in more resources. So, if you have fewer resources that need it then you normally have structural hazards. For example, if you have a functional limit what that means is that if the functional limit takes 20 nanoseconds to complete one operation the next operation cannot be started on that functional limit before 20 nanoseconds. So, essentially every 20 nanoseconds you can start one operation. That is one type of structural hazard. So, essentially what I am saying is that even if I have an operation ready to go you cannot send it. Because I do not have functional limits. On the other hand if I had I could have issued two such operations in parallel. So, that is why it is a structural hazard. And we have already talked about these two things memory or register files course because as I mentioned we need to leave from register file and write to register file. So, unless you have separately and write course there is no way to resolve this structural hazard. So, essentially what I am saying is that you throw in more resources to resolve this. So, the question now is that well it sounds funny right that we are saying that well I deliberately have fewer resources than I need because I know that I need so much. But why should I have fewer resources right. So, why what is the reason. Well the primary reason is the reduction in complexity right. So, you may say that well I could have I could have gotten rid of structural hazard completely if I had 100 ALUs sure. But would you have a 100 would you have a processor with 100 ALUs not because the simple reason is that may be once in a while you require 100 ALUs but not always. It unnecessarily increases your complexity and also increases your power in the energy production. That is why you normally you design a processor for the common case. That is what the Amdahl law says right. And in the rare cases you resort to some other way of resolving these hazards. So, make the common case fast pipeline divider may only wastes and congested. So, divider send them pipeline even though you know that if you do not pipeline the divider it is going to take a long time. So, if you have two division operations side back to back the second one will have to wait. But you want to do this because that is not a common case. So, you look at the frequencies of such occurrences and decide what to do. So, all types of hazards introduce stalls or pipeline bubbles. So, here is an example. Suppose you have 40 percent data memory accesses in the program. And you have just one memory port let us say. Adding the second memory port slows down clock by k times. So, do not ask me why this happens. So, this has to do with how memory models are actually built. That why introducing a new port would slow down a memory model. Why should that be? So, that has to do with the electrical properties of this particular system how they are designed. So, I am not going to details of that. But nonetheless this is true that when you add ports to a memory structure it is going to slow down. And here I am saying is that adding a second memory port slows down the clock by k times. So, the question now is what is maximum k so that two ported systems is same game. Because now essentially I have to argue that I am going to pipeline a processor I know the statistics. So, now the designer has two options either you go in the single ported memory system and you know take the stalls because of that or you have a two ported memory system, but actually design a processor with slower clock. So, which one is better? So, for that I need to know the break even point. What is the maximum k so that two ported systems is the same game. Because then I can go back to the electrical engineer and say that well look can you design a memory model with this ratio k then I will define. Otherwise if there is no then I have to rethink that should I take the line. So, how would you calculate this? If I have a single ported memory what is going to be my execution time for this program. You can assume a clock frequency of something x is how you calculate this. So, if I have a single port whenever the data memory is in use I should not be able to use that memory model right. Because data instruction they will say this memory whenever I access data from memory if I have a single port I cannot access instruction. So, what is the implication of that? What gets affected? So, if you yeah. So, here is my pipeline I can fetch here I can fetch here, but I cannot fetch here right. This has to wait. I can only fetch here right. So, can you quantify? How do you quantify this? How is going to be my execution time? So, for every data memory access what am I adding to the execution time? One cycle extra right. So, I can say that my CPI is going to be what? CPI 1 port 0.6 plus 0.4 into 2 right. Can I can I do that for a pipeline implementation? That is a fair enough approximation right. So, how much is that? 1. So, my execution time 1 port let us assume that I have just one instruction. I can normalize that or if you want let us assume over here we have n instructions. So, that is 1.4 into frequency times n instructions n all right. Not frequency I am sorry cycle time. So, let us call it tau 1.4 tau n all right. Is it clear everybody? I have made a gross approximation here. But these will be more or less ok. What about CPI 2 port? CPI 2 port? Sorry. 1. So, execution time 2 port is going to be what? My cycle time sorry k tau n right. So, I want them to be equal right. So, or if I want my 2 ported system to be better I want this right I am sorry. So, k is less than 1.4. So, adding a port should not slow down my clock by more than 4 times. If it does then I cannot go to a 2 ported system. Any question for us? So, keep this in mind that every time you are going to throw in some resource to resolve a structural hazard we are going to do somewhere else. So, there is always a tradeoff in most of the things that you come across in computer systems. It is usually not a one way way you lose somewhere you have to carry out a tradeoff analysis which one which way. So, let us take a look at control hazard a little more carefully. So, this is the problem that is depicted here cycles go in this direction. So, these are my each cycle stage all right. So, here is a branch destruction which is which could be anything I am putting branch if equal to 0 and that executes here. So, what we have shown in the diagram is actually it resolves here so, let us assume that we can actually squeeze in that multiplexer it takes two stage. So, we are talking about this particular multiplexer this one. So, let us assume that we can actually squeeze this into this particular cycle. So, that it does not spill over. So, I can resolve my PC the next PC in that part cycle. The problem is that I do not know what to fetch in these two cycles. Because here the branch is not yet computed here the branch is not yet computed actually it is being computed and here I know where to fetch one provided I have a bypass path which tells me that this is my this is your PC to fetch one is this clear to everybody? So, these are called two pipeline bubbles it increases your average signal if you do nothing if you say that well I am going to do it. So, in that case essentially what you are saying is that for every branch instruction you are adding two cycles over it two extra cycles. So, now the question is can we really reduce it to one bubble? Let us go one one step at a time. So, instead of reducing both the bubbles together let us first ask this simpler question that can I solve get rid of one bubble out of these two? So, at this point we need a couple of definitions we have already defined these two terms target and fall through. So, in this case when the branch executes whatever label that appears here is your target that is called a target fall through is the next instruction. So, if the branch does not go to target it will fall through. So, MIPS are 3000 as one bubble that is a pipeline MIPS processor very similar to the one we are looking at. So, the question is how do they actually manage to do this one bubble instead of two? So, essentially what they have done is so by the way this particular bubble that they have is called a branch delay slot. These are instruction just after the branch which has the bubble. So, how do they actually get rid of one bubble? So, what they do is they actually exploit clock cycle phases. On the positive half they compute the branch condition. And the negative half they fetch the target. So, if you now look at they are saying that they always do instruction fetch in half cycle. It happens only in the second half of the cycle the instruction fetch. And the branch execution will complete within the first half of the cycle. So, now what do you have? You have the branch instruction in this here. Here you have a bubble and now you actually know the fetch target here, right here in this side. Because the branch execution completes in the first half. It communicates a piece to the fetcher and a fetcher fetching the name in the second half. So, essentially can somebody tell me what do I use by doing this? Of course, I gave back one cycle definitely one instruction. I have got rid of one bubble. But I must be losing something what is that? How can it should be long enough to. Exactly, my half cycle should be long enough to accommodate the branch instruction execution and also long enough to complete the fetching actually from memory. So, in the sense you may be sacrificing in terms of cycle time. You may be running across at a slower frequency, but you are getting back one bubble branch. Sir, only the branch instructions that you took in the half time execution. Yes. So, branches so essentially what does that mean if we go back again. So, we refer to this diagram over and over. So, what this means is that if you look at what the branches do they do two things right. They will compute the target here in the area and they will carry out a comparison here. They go in parallel. And in most cases you can expect that this is going to be the critical part. So, all I am asking is can I finish this addition in half a cycle. That is what I am really demanding. May or may not be possible depending on how long the half cycle is. Yes. Sir, branch instruction. Yes, you know it here. Yes, you know it here. So, we can have a alternative like the third cycle. Second cycle would be instruction itself. The third cycle is the target research. And on the fourth cycle when the branches execute we have to know whether the right answer was the next instruction or the target research. So, essentially we have created only one model. Yes. So, here we do not need a half cycle. We do not need to increase the clock time. Well, here you have two bubbles, right? The one after the branch and the target. No, I am fetching both the instructions. So, you are going to keep one, right? Yes. That is possible. Yes. So, you are complicating your PC sequence as well. So, essentially what you are saying is that after the branch I am going to fetch the follow through all the time. Then I will fetch the target. Then I will change my PC to the target back to the right one. Yes, that is possible. Yes, you can do that. So, you are saying that here I am going to fetch the follow through, right? Yes. Here I am going to fetch the target. Which I know from the Yeah, yeah, yeah. You almost know. There is one small problem. You are, can someone tell me what extra one thing I need to be able to do? The new target we need at the same stage. Yes, so if you remember the branches come with the offset within their instruction. But you need one more thing to complete the target. What is that? Condition. No, no, condition is different. So, we say that I just need to know the target. So, there are two things that come with branches, right? One is the follow through, which is physical. Otherwise the target, the part of the target is in the instruction. But I do not know the target yet here. I just need to know the offset. What do I need extra on top of that? Because, sorry. Adder. You need an adder, exactly. You need an adder here to be able to complete the target, right? Yeah. So, yes. So, if you can afford an adder in the decoder, yes. Then you can do that. MIPS does not have that. So, is it really what MIPS has done to solve the problem? I mean, it is not really solved the problem. They have reduced it from two bubbles to one bubble. The PC update hardware, that is, selection between target and the next PC works on the lower edge. So, yeah. So, on the lower edge, you have the multiplexer, which is CWPC and go and attach. Any question of this here? So, how do they actually maintain this constraint that execute branches, execute in the first half? Well, that is how they design the hardware. So, you design the hardware. You are saying through the same ALU unit. Yeah. But you are saying that the other ALU instructions may spill over the other hardware. Yeah. So, they have designed an adder to be operating the half cycle. So, the adder is optimized in that area. Yes. Exactly. So, the ALU has other operations, right? Other than the adder operations. They can speed up. But yeah, I mean, if you do this in, you have probably made the ALU to operate in half cycle for most instructions. The adder is often, you know, one of the longest lines. The logical instructions are the most simple. Any other question of this? So, this is your branch delay slot, this bubble. Okay. All right. So, now the question is, clearly I cannot get rid of this. Okay. The way it is given is impossible to get rid of this bubble. No way. So, what can I do here? That is still useful. So, the other question is, how can I make good use of this branch delay slot? Instead of using the cycle all the time, can I use? Instructions after the branch instruction and the delay cycle. What type of instruction can you put there? Or from where can you bring these instructions? Can I do any instruction here? Will that be correct? No, not any instruction. Not any instruction. Okay. All right. Instructions are sufficiently far from the branch instruction. Instructions are sufficiently far from the branch instruction. Let me give you an example of certain size. No, sorry. So, you can fill in some instruction from that part. It will be skipped at the branch's table. So, let's take an e-files control. Okay. So, this is my branch. This will be transferred to a branch instruction. And in most cases, this is going to be the target. This is the fault. Okay. Now, yes. What you are saying? From where can I bring this instruction? Which is going to be useful. I know it's going to be correct. As you have suggested, I cannot fill in any instruction. Can you tell me how that is? Why? Why can't I put anything here? Just in e-files. Okay. So, if the branch is taken then that won't be useful. Exactly. Right. So, that's a very good example. If the branch is taken, I should not be executing anything here. Okay. So, clearly I cannot put something from here in the branch delay slot. So, what can I do? We can go over here. This is the instruction. Here. Exactly. So, these are often called the convergence points. Okay. So, branch is diverging and then it's going to converge here. And these are often used for filling up the branch delay slot. Instructions from here. Okay. But you have to be careful. Why? Why do you have to be careful in the branch delay slot? Exactly. So, the value that is produced by this instruction. Like for example, suppose this instruction is writing to the instruction 20. 20 should not be used on the path that the branch is going to execute actually. Okay. Otherwise, it will get a wrong value of the instruction 20. So, you have to be very, very careful when boosting something to the branch delay slot. All? Any other option? Yes. We can move something above the. Exactly. So, we can move something from here to the branch delay slot. Okay. So, similarly, again that constant applies that only those instructions that can be moved here which can be delayed. The execution of which can be delayed actually. So, this is actually somewhat an easier option to bring something from above to the delay slot. Yes. So, it is a job of the compiler at the end of the day to fill in the delay slot appropriately. And the compiler cannot find anything to fill in. It will put a no option. You will say that, well, I have nothing to fill in. So, that essentially amounts to losing one cycle. So, it is still better than losing one cycle all the time. Okay. Sometimes, you will be actually doing something else through the branch delay. All right. So, given the condition at that point in time, given the situation. This was considered to be a very good solution. A smart one. That given the compiler the flexibility to fill in this particular slot. Okay. As opposed to doing something in hardware. So, you could hardware something actually. But instead of doing that, you have given the compiler the flexibility to fill up this particular slot. But later, it became a big headache for this. You carry forward this for the next branch delay slot. Because the problem is that once you have designed a compiler which emits code which a branch delay slot. Suddenly along the line, you cannot throw away those codes. They actually still have to run correctly from the machine. Because later when you go on and we will see sophisticated techniques to get rid of all bounds. Okay. This becomes a big headache actually. Now, essentially what you are saying is that, well, whatever you do, you have to execute the instruction following the branch. Because your compiler is emitting code which may actually be needed to execute here some instruction. Okay. You cannot just omit that particular instruction. So, anyway. So, that is your branch delay slot. So, the question is can we utilize the delay slot as the compiler guy. The delay slot is always executed irrespective of the framework branch. So, boost instructions common to fall through and target paths to the delay slot or from earlier than the branch. So, that is what we have discussed just now. Not always possible to find. You have to be careful also. So, must boost something that does not alter the outcome of fall through or target-pacing blocks. If the branch delay slot is filled with useful instruction, then we do not lose anything in CPI. Otherwise, we pay a branch penalty of one side. That is the branch penalty. The number of sites you use if you have nothing to do. Okay. So, what else can we do? So, this one we looked at actually. Prediction, right? One of the examples. So, you could actually try to predict the outcome of the branch. So, essentially what we are asking is as early as possible with the pipeline, right? Can you tell me the two questions? First I need to know if it is a branch. And once we know that it is a branch, can you tell me where the branch is going? Okay. Before the branch is actually executed. So, that is where the importance of prediction comes in. So, one of the simplest techniques is to put a branch target cache in a page. This is called a branch target buffer. So, essentially what this one stores is for a branch instruction, what happens to the branch when the branch is executed last? That is what it remembers in this particular branch target buffer. So, next time when you fix the same branch, you can look up this buffer and know where it went last. Of course, that could be all. Now that the branch will go every time along the same direction. So, how do you, what does this BTB look like? So, it is a cache as it mentions. So, it has multiple entries. Each entry will have a valid bit just like your cache. Each entry will have a tag. And each entry will have a branch target. And each entry corresponds to some branch in your program. And of course, in addition to this, if you have a set associative branch target buffer, you will have some bits to carry out replacement points. So, there will be certain bits for replacement algorithm, which we will not discuss at this point like your LRU bits. How many of you haven't heard of this recently used replacement policy? Raise hands. Ok, that's wonderful. So, if you want to do a LRU replacement in BTB, I will need some bits. You remember which one is least recently used, which one is most recently used and so on and so forth. But I will not refer to this for now. So, the first question that arises is, fine, I have this table. How do I index it in the table? So, what am I doing? I am fetching the instruction. I am at a particular PC, I fetch the instruction. What do I need to know? I need to know two things. Tell me if it's a branch. And if it's a branch, tell me where it went last time. I should get answer to both the questions by looking of this table. So, what I do is, I take the current PC that I have just fetched from. And suppose I have here is a normally power of 2, 2 to the k entries. So, what I do is, I shift out last two bits, because instructions are 4 bytes long. Last two bits will be 0 anyway. So, they have no information. And take out the next k bits. So, the same as doing that. So, I take the last, this significant k bits after moving out the last two bits of this. Use that to index into the table. So, that gives you 1 entry in the table. So, then I look up this particular tag entry. This tag entry stores the remaining bits of PC, the upper bits. The upper bits of PC match the tag. Then I know that, oh, this is the branch I just fetched, which is currently the table. Because the tag actually matched with my upper bits of PC. So, then I got actually the answer that, oh, this is a branch, because it's in the BT. So, then I look up the target. And I immediately know where the branch went last time. And then I can change my PC and start fetching from that particular instruction in the next time. And if I am correct, I have nonified the particular problem that we are going to. Is it ok, we can do? How I look up the BT? So, now essentially what I have assumed here, there is an invariant that only branches will go into this particular table. Otherwise, by just a hit, I cannot say that it's a branch. It's actually serving as a decoder. It is decoding that this is structuring the branch. So, that means I have to be a little careful when I insert something in the BT. So, what do I insert? And when do I insert? There are two questions actually. So, what are the answers in suggestion? What goes into this table? And when does it go into the BT? Sir, branch insertion that we get from the decoder insert it along from the decoder in the BT. Now, when do we insert? At what time is the pipeline? The branch goes to 5 stages of execution. After execution is actually in the target. Right, exactly. So, when the branch finally executes, I know the actual target. And of course, I know its PC because it was fetched. So, then I can insert it into the table, into the target and the tank. Now, to reduce the space over in BTB or rather improve the decalision of the capacity of BTB, usually the branches which are not taken are not inserted in the BTB. Because if you miss in the BTB, that is if you don't find this particular miss in the BTB, it will just fall through. So, you never insert a branch which is not taken. You only insert the taken branches in the BTB. Because that is what matters. All right. So, in case of a hit, the BTB tells you the target of the branch when it executes last time. You can hope that this is correct and start fetching from the predicted target provided by the BTB. Later you get the real target, compare with the predicted target and throw away the fetched instructions in case of a miss prediction. Keep going in predicted quantities. This is also clearly what BTB exactly stores, when it stores that and how you look up the BTB. So, it flows with one small question. Can somebody tell me for what kind of control transfer instructions? BTB is going to be great. It will be 100 percent accurate. What are they called? They are mean. Unconditional jumps. Unconditional jumps will have 100 percent. Not exactly 100 percent. Last time you will be supposed. But for the remaining execution of that unconditional jump, you are going to be correct. Every time it goes to the same place. Any else? Any other instruction? 100 percent. Switch case. Switch case? No. No. Only connected between them No, but remember that these branches will have the same ec, so it will be overriding the same entry over and over, it will be very bad actually. What else? something that is like unconditional job. Now, you will return to the same place every time you will call the same procedure from different places right sorry loops you think so, you will be fairly accurate except the last time what else something like unconditional job what about function calls they will always go to the same place right you call a function so, these are called direct procedure calls there are direct calls also so, they will be