 Okay, so just to remind you what we were discussing, we were talking about control habit. And last time we saw a typical pipe stage pipe. So this is what happens, right? You fetch a branch instructor here and you get to know the target at this point. So essentially you have a two instruction delay before you know the target but MIPS has a phased execution where they make sure that the branch target is ready in the positive part of the cycle so that the fetcher can fetch in the negative part of the cycle. So that's how they nullify one of the bubbles. But they still have one bubble left and they rely on the compiler to fill up this particular instruction. So it's called a branch delay slot. And if a compiler can fill up something, we discussed last time from where you can pick up an instruction to fill the slot. That will not be a wasted cycle. Otherwise, it will probably fill up with a NOAA which is essentially a bubble. And is this clear to everybody? Any question on this? Okay, and then we also discussed something called a branch target buffer where essentially the idea is that in the fetch stage you look up a particular cache. So that's the branch target buffer. It's a cache holding your branch targets. So you look it up in the fetch stage and the BTB gives you a target. Which you can use in the next cycle to fetch. So essentially what that means is I look up my BTB right here and the problem was I didn't know what to fetch here but now I can use the BTB's outcome to fetch an instruction here. And in the BTB outcome, essentially I have nullified all the bubbles without relying on the compiler. So this slide summarizes nicely what a BTB does. So the BTB is looked up with a program counter of every instruction in parallel with fetching the instruction. So essentially you have the program counter you send the program counter to the memory for fetching the instruction you send it to the BTB also for every instruction. On a BTB hit it provides two pieces of information. What is that? The first one is that this instruction is a control transfer instruction which is why it is hitting the BTB. And the second thing is the target of this control transfer instruction is seen last time. The BTB always stores the last target that this control transfer instruction is going to be. So this target will be used to fetch in the next cycle. On the other hand, if you miss the BTB, the fetcher really has no option but to fetch from the powerful part. It is easy for some. And a control transfer instruction is inserted in the BTB under the execution stage when its target is known. So you always insert the BTB when you know for sure where a particular control transfer instruction is going. You cannot insert anything wrong in the BTB. That's where all the time comes. So essentially what you do is once the branch is computed in the execution stage you look up the BTB once more and at this time the instruction may hit in the BTB because the BTB might have seen this branch already before. So now there are two options. If the branch is not taken in this particular execution the BTB entry is invalidated. Why is that? Because we only want to store 18 branches in BTB. Because remember that if you miss in the BTB in any way, you have to take the fault-through path. So it's better to save BTB space by not storing the not taking branches. Otherwise the entry is updated with the taken branch target. If the lookup at this point misses in the BTB so it's cut out here. New entries allocated provided the branch is taken. So if the branch is not taken then of course you don't allocate anything. Is this clear to everybody? This particular protocol? You look up the BTB at the fetch stage you update the BTB after you have done executing the branches. And of course you can optimize this part a little bit to save BTB bandwidth. You can say that well we carry forward the BTB outcome within because remember that a branch instruction must have looked up the BTB at the fetch stage already once. So you can carry forward its outcome to the execution stage and then match and decide whether to look up the BTB and insert anything or not. Any question on this? Is it clear? So there was a branch delay slot and we also had a BTB so still in the branch it's the compiler's responsibility to fill the branch delay slot. And the processor has to obey that it has to execute that. In that case BTB is of no use. Which is why MIPS R3000 did not have a BTB. So BTB is going to be useful only when you say that the compiler cannot fill up the delay slot. So what else can I do there? How to save the problem? So in many cases your BTB will probably outperform a compiler printing delay slot because it can see the dynamic of a branch it can learn. Whereas the compiler sees the static piece of code and may not know what will happen at run time. But of course the penalty you pay here will see gradually how we actually change the pipeline hardware to include a BTB. So there are extra pieces of hardware that we have to go. And there is a danger of lengthening your cycle triangles to include a BTB. So let's take a simple example. Just evaluate usefulness of the BTB. So let's assume that for a program 90% of all control transfer instructions hit in the BTB. 90% of outcomes provided by BTB are correct. 20% of control transfer instructions that miss in the BTB result in taking branches. So we want to know what fraction of bubbles are saved. With these political statistics. So what is that? So fraction of bubbles saved is same as BTB prediction accuracy which is what is that? 0.9 fraction hit and 0.9 of that are correct. And your 10% of branches that miss out of those 20% are actually wrong. Because these are taken branches. So 80% is correct. So 0.1 multiplied by 0.8. So we say 89% of bubbles. But we lose 11%. So 11% branches suffer from mis-predictions and we require some recovery mechanism for character execution. Because in 11% of these cases what will happen is that we will take some BTB prediction and go along that path but that's wrong. Which we will discover later. So we have to fix up something in a pipeline to make sure that the wrong page instruction is removed from the pipeline. Alright. So this one we have already discussed that the BTB will work great for new branches. So essentially we say that we will mis-predict the first time and the last time. Otherwise it will be the same target all the time. Because a new graph will always go back. Only last time it will actually fall through. And the first time there will be a missing BTB because we haven't seen the branch. Submitting calls these are also great because they always go to the same place. So you call a submitting from somewhere it's always going to the submitting target line. And non-traditional branches. So in these cases your BTB will be highly accurate. So again in this case the first time you will become mis-prediction. And this case also first time you will be mis-prediction. Alright. And subsequent predictions will be correct provided that particular entry is not replaced from the BTB. So if we have enough capacity in the BTB now we talked about this indirect procedure calls which are jump and link register instructions in names. What do you think about this? How will they conform with the BTB? Indirect procedure calls. What are these? How do they actually combine? What kind of instruction do they produce? What does this structure do actually? Jalla. How is it different from a direct procedure call? Does anybody remember? What kind of program constructs would lead to a Jalla? So you will get this from function pointers. When you have and the function pointer may result to any procedure pretty much in your program. So which means your targets are going to vary for these instructions. They are not constant. So the BTB is probably highly inaccurate in this case depending on your program behavior. But of course if you have a pattern of locality that is over a phase of a particular execution, your particular function pointer results to the same function all the time. Then the BTB will of course be good. So there is no guarantee that it will be given in this case for the BTB like here. So that is about these three categories of branches and then of course we have left out one major category is conditional branches. Where like an EFELS kind of constructs will lead to conditional branches. Where depending on the condition outcome will take a branch. Which is very different from these two. Loop branches are conditional branches but they actually behave very regularly. It is not like EFELS type of branches. Which may go either way. So for conditional branches usually processors use a separate type of predictor. Which are called direction predictors. Because here all you want to know is which way am I going. Because I go the target. See the instruction actually. The offset in the instruction I just need to add the offset to the PC. To get the target. All I need to know whether should I add that offset or not. That is should I jump the target or should I just fall through. So it is a binary prediction. However it is very dynamic in nature. The last target is not very helpful in general. Because the last time the way you went you may not go next time the same way. So maybe this time you take the EF execute the EF part of the code. Next time you might execute the else part. Depending on the conditional. So you need a direction predictor. Prediction is taken or not taken. Which is a binary prediction. Once that prediction is available we can compute the target. And the question is how does this co-exist with the BTB. We will talk about that very soon. One thing that we will require to make good use of for direction predictor is that we need an ALU in the state of the pipeline. To compute the target. So this one was discussed last time somebody raised this issue that why can't you compute the target right here. Because part of the target is inside the instructor. It comes from the offset. So the point was that well I need an ALU here. We need an adder. So we will assume the existence of an adder. So we will see how that actually works very soon. So last thing that is left is a type of instructions that return from a procedure. That's one type of control transfer instruction which is not covered in any of these types. So you end the procedure you take a control transfer which returns you back to the calling site. And here also your BTB won't be very useful. Why is that? Can somebody guess? Or can somebody see? Why is a BTB not useful for return instructions? Why is that? Each time you call a certificate and you will be returning that it will be different. No, no, no, no. This is not about return value. See jump instructions are concerned about the next instruction to execute. The question is what is my next place that I return to? It's not the return value that you're talking about. That's how you do that. You understand what I'm saying? I have a function f. I'm executing the function. I call g here. And this is my g. At some point g will return. And when g returns I'll have to start executing here. The question is when I fetch this particular return instruction do I know where to go next? That is the question. And that's not covered in any of these types. It's a very special type of instruction. And I'm saying that BTB is not going to be very helpful in this case. Exactly. So I might be calling g from many different places. So this time I call within f. Next time I might be calling g from h. So my return address will be all different actually. Of course, it may not be so in some cases. But anyway, it depends on your execution. So BTB is not going to be very helpful for this case. So what processors today include to tackle this matter is another hardware structure which is called a return address stack. So it has nothing to do with your execution stack. It's a hardware structure which has a push pop interface. And you can guess what it does. Whenever it fetches a JAL or JALA instruction which are procedure calls. It would actually push the return address because it knows the return address at this time. When the JAL executes the return address is just the PC plus 4 or PC plus 8. And whenever you encounter a return instruction it will just pop the stack. The top of the stack should be your return address. So that's a prediction actually. That gives us a prediction. In most cases, this will be very accurate. So provide it. You don't overflow the last because if you have a very deep call stack you might overflow the last in which case actually the predictions are correct. So it's a hardware structure. It can overflow. So that can happen. So what's the size of it? It depends. For example, the MIPS R10K had a 4 entry return address stack. So we can make it bigger. Bigger stacks probably would be better. So it depends on how much budget you have for hardware and all that stuff. But if you have a too smaller stack then of course you have to leave and start giving out these predicted values. Okay, so with all these things so we have now the VTB, we have a direction predictor we have a return address stack. The question is what does my hardware look like that decides the next piece? We have too many inputs now to be chosen from, right? So first I'll assume that well first we make an observation that is in a five stage MIPS with half phase instruction page a conditional branch prediction is of no use same is true about RAS so let's try to understand why that is so. So let's go back to the instruction. So we are saying that we assume that we have a phase execution just like MIPS R3K has. And we are saying that well of course still the VTB is useful, right? We have already talked about that. Because we know that if we have a prediction of VTB here we can use it here. So if I do not have a So now let's take up the the branch the direction predictor for conditional branches. So first question that we have to answer is in which stage can I make a prediction? In which stage can I ask the direction predictor? Detection decode, right? So I cannot be asking the direction predictor before decode, why is that? Because we have to decode it first when we know where the address is. I mean so why is the address important? I just want to know yes or no. Address is not yet important. I just want to know whether I should go down the fall through or the target. That's what the predictor tells me, doesn't tell me the target. I'll come to the target separately by using offset which I can do anytime, whatever that. Because I know that my last 16 bits are going to be offset. So I can just assume that everything is a branch even here I can actually take those 16 bits and add it to PC plus 4. Maybe a garbage value if it's not a branch instruction but I can do it here actually. So all I'm asking is that the direction predictor will only tell me yes or no, it will not tell me the target. I'm asking where can I find that predictor? What is the earliest possible stage in the pipeline? Your answer is correct. I cannot do it before the decode snippet. Why? Condition evaluation will happen here and if I evaluate the condition why do you need a prediction then? Prediction is important only when I don't know what's going to happen. If I can evaluate tomorrow's synchronization why should I have a predictor? After fetching instruction we will know that it's conditional. You will know what? Whether it's conditional whether it's an initial instruction or something else. So I need to know if it's a conditional branch or not. That's the important point and that I will get to know only when I reach the decoder. I decode the instruction and know that it's a conditional branch. Then only it makes sense to look out the branch predictor. So there is no hope of using the direction predictor in the fetch stage. So I have to wait till the decoder for sure. If I have to wait till the decoder then you can clearly see that there is no point in using this predictor because in the next cycle anyway I load the correct target actually. Is it clear immediately? That in this particular setting I don't need a direction predictor. What about the return at the stack? Can I argue in the same way? Because I push or pop from the stack only after you know that I am dealing with a procedure call or a return. Otherwise if I start popping the stack I will be actually popping wrong things. I will be popping useful things at the wrong places actually. So I should be popping from the stack only if I know that it's a return instruction or I should be pushing on the stack only if I know that it's a procedure call. Is it clear? So for this particular pipeline I am only interested with dealing with a BTB. So let's see how to do that. If something is not clear you can ask questions. Just remind me. So what does it mean? That means every cycle the fetcher has to select from three options. One is PC plus 4 from the fetched decoder register. So by the way I am using this particular notation to denote the pipeline register sitting in the middle sitting between fetch and decode stages. So remember that every stage every two consecutive stages are separate by pipeline registers. So this is my IFID register. This is my IDX register. This is my XD name register and so on. Okay. I have the PC plus 4 which comes from this register because remember that I have the PC I have incremented it and I am going to latch it in the IFID register in the example. So from there I can take PC plus 4 I have the BTB output also from the IFID register. Okay. So whatever output the BTB has given me and the actual target which is bypassed from the execution stage. Okay. This is available early for these three instructions because these three instructions don't require any extra evaluation. I can evaluate the target from the instruction itself if I have a decoder if I have an adder in the decoder. Okay. Because here the target is inside the instruction here the target comes from the from the register value. Okay. And here the target is also the instruction. So under what condition do I have an input from the execution stage for the PC. If a branch instruction is currently executed there. So which means two cycles earlier I fetched a branch which is now resolved. So I have an input coming from the from the execution stage telling me that you know you might have to change the PC if you made a misprediction. Okay. So let's assume that the BTB lookup returns a tuple which is the first entry of the tuple is a heat release indication and the second entry is the BTB contents. Okay. And on a miss the second entry is when we PC transfer. That's not exactly coming from BTB but I'll assume that that's how the tuple is generated from the BTB hardware. If last to last instruction was a control flow instruction you compare BTB contents for that instruction with the actual target. So if the last to last instruction was a control flow instruction you have to compare the BTB contents for that instruction with the actual target. Right. Is that clear to everybody? On a mismatch select actual target and zero out the IDX register because the problem is that currently instruction the decoder is actually wrong. Right. So you cannot put the contents of that into that IDX pipeline register. So what MIPS does is that it zeroes out that entire register and the good thing is that the low op the op code of low op is actually all zero. So that will be actually as low op as it goes to the pipeline part. See it doesn't really work. And the IDX inputs are ended with a tilde of kilnsignal. So we look at. Okay. So this is my program counter. So and this is actually my pipeline register at the front of the fetch stage. So here I show this in a dotted line. It's actually the PC. At every clock I'll add a new PC which will actually go to the memory, start fetch and also do this additional log. That's why I show the clock to the program documents. So what is happening? So in a particular clock cycle I get a new PC I fill it to the adder which adds four to that. So I get PC plus four here. And this PC is also sent to the memory for fetching the instruction which will go into the instruction register in the fetch decode pipeline register. This PC is also sent to the BTB the branch target buffer for looking of the BTB. And what comes out from the BTB are two things. One is the BTB content and other one is an indication of thickness. And the third thing that I get is a target from the execution stage. This is for an instruction which is currently executing in the execute stage. And this also comes accompanied with a kill signal. And this is enabled only if the currently executing instruction in the X pipeline stage is a control transfer instruction and its target does not match with the BTB outcome. Which means I did something wrong in the past. So I have to kill some of the instructions inside the pipeline. So are the inputs ok? Before you look at this particular multiplexer which does the choice of the next piece. So what are my selections? So when do I pick the BTB contents? Well when the kill is 0 which means that everything is ok. And I have a pick in the BTB so that is 0 0. In that case I will pick the BTB content in the next piece. When do I pick PC plus 4? When kill is 0 and I miss in the BTB in the selection is 0 1. And if kill is 1 then that overwrites everything else. Which means I made a mistake in the past I have to fix it now. So if kill is 1 the other bit is don't care I choose the target as the next piece that is coming from the execution stage. The next clock essentially this piece will be latched here. And I will be on the right path from there on. Is this hard work here? The PC selection? So I essentially need it is actually 3 to 1 marks these two are actually saving it. The only thing that may happen now is that I have too many things on my critical path. I have a BTB in my critical path I have a multiplexer on my critical path I have the instruction in between that prediction Yeah right coming to that Yes So that's exactly what is mentioned here. The idx inputs are ended with tilde kill so tilde is not kill. So if kill is 1 that means you will be feeding 0 in this particular instance. So if you go back to this diagram the wrong instruction is currently here currently being decoded. Because I fetched an instruction I made a prediction which came here So I used a prediction to fetch a new instruction and that instruction went into the here. And when the branch executes the wrongly fetched instruction is here in the decode stage. So this is where my time goes. So let's suppose this is my branch instruction. Alright This is the let's call it I0 that fetches here and that was fetched using the BTB outcome coming from here alright and then I have some I1 which is fetched here. So that's a fetched execution which is going to be correct all the time because I will get the target from here So what we are talking about is how do you select the PC at this particular stage I have 3 inputs one is this target ok one is coming from here which is PC plus 4 the next instruction of this alright and one is that I I used this PC to look up the BTB and the BTB told me something there are 3 things Now the point is that if I made a wrong prediction for this branch when I looked up the BTB that wrong instruction now where is it currently is currently in the decode stage right ok when the branch is currently executing that wrong instruction is here so what I have to do is that when this instruction when the contents of this decode stage will be fed into this register pipe that has to be killed here it should not be feeding these contents here because these are wrong so what I am doing is that I am ending all the idx inputs with not kill ok so idx pipe so idx pipe register is in here ok so if kill is 0 then whatever you have completed will go in there that register otherwise only 0s will go in there is the timeline clear to everybody what is happening when I1 is also because of that I0 I1 IA no I1 is always correct because we have a phased execution I1 will get the correct outcome can be because of branch prediction no no no I1 so in this diagram I am actually selecting I1's PC that is exactly what is being done here so PC plus 4 is coming from here this instruction this instruction is also looked at the BTB the BTB outcome is coming from there and I have a target coming from here which of these three should I pick that is what I am asking so I1 will actually proceed with the correct PC and remember that it has to happen every cycle this will be a continuous process every cycle will be running this hardware which will pick up one of these three things of course in many cases your your target will not even be a legitimate thing because there may not be branch instruction executed in now in next stage which is fine because in those cases the kill will be 0 for sure nothing to give actually so in that case you will be selecting one of these two depending on whether you hit or not so how do I generate the kill signal so this is the logic for kill it takes each control instruction so that is computed by the decoder which is currently sitting in the decode execution pipeline register it tells me whether the instruction currently executing in execution stage is a branch or not and I add that with this particular clause which says next PC IDX is not equal to timing next PC gets the PC whatever I calculated if there is a mismatch with this and the current instruction is a control instruction I should kill and NPC is carried till IDX register because beyond that point this is not required anymore and this control instruction is generated by the decoder this particular variable will be high whenever the decoder comes across any control transfer instruction this is clear if it is not clear you will have a hard time in the next slide there will be a fourth outcome there will be a fourth input in this last map it is not clear so now let's assume that either the branch target cannot be made ready in half cycle or instruction phase cannot be at all ready in half cycle so what does that mean that means I have genuinely two bubbles that I have to nullify so now the pipe timeline looks like this so this is my branch instruction I will know the target only here ok so you have to fix up something here in these two cycles and now actually a direction predictor makes very much sense because I can make a prediction here which I can use in this cycle to overwrite the BTB prediction alright so and a return at the stack is helpful provided instruction decoding followed by prediction can be completed in a cycle so now we have to make sure that you decode the instruction look up the predictor and this whole thing can be completed in a single cycle otherwise of course there is no use you have to complete decoding and prediction in this one cycle alright and they have to go serially they cannot go concurrently you first decode and then look up the prediction you first decode then you will pop or push from the rash ok and we assume that the direction predictor or rash can offer a better prediction than BTB and hence we will overwrite the BTB prediction so we always do that ok so in some cases of course it may be opposite that the BTB may actually give you better quality prediction that is possible but in this particular discussion we are going to overwrite the BTB prediction with whatever the direction predictor tells you or the rash tells you and now the feature has an option namely the predicted target from the decode stage the decode stage is now going to give you another input into the future which is the outcome coming from the rash or the direction predictor and now I will generate one more signal that is BTB kill ok so let's try to understand what is really happening so here I use this pc to look up the BTB the BTB will tell me something which I am going to use here alright and in this cycle I will figure out that I get a prediction from here from the direction predictor so I have to overwrite this one ok so that is this signal which kills the instruction based according to the BTB's outcome alright and finally when the instruction executes I may get another indication saying that oh you made a mistake here also then I will also generate a kill signal so I will have two kill signals now and another is the traditional kill that we had last time so let's see what it looks like I will have a bigger marks for sure so I have my pc as usual I compute pc plus 4 I look up the BTB BTB gives me some contents and now to my multiplexer I have what are the inputs I have the BTB contents I have pc plus 4 I have predicted pc coming from the decoder and I have a target coming from the execution stage and what I will assume here is that whatever is coming in as predicted pc can be either from the direction predictor or from the internal step I will not open up this particular thing here so now what is the logic you can easily figure out so when do I pick my BTB contents when kill is 0 when BTB kill is 0 and I hit in the BTB in that case I have become the BTB contents when do I pick up pc plus 4 when kill is 0 when BTB kill is 0 and I miss in the BTB when do I pick up the predicted pc so the predicted pc is picked up when kill is 0 and I have a BTB kill signal coming in and I do not really care what is next and when do I pick the target when the kill is 1 everything else is ok because kill is the golden rule tells me that oh here is the correct target you should go along this direction everything else here is a prediction but this is always correct so whenever kill is on I have to ignore everything else I have to pick up this target and send the pc to the next this is it so what is predicted pc coming from the decode stage so let us assume the existence of an address adder in the decoder so we have 4 options here what are the options if instruction is conditional branch you look up prediction predictor and compute pred pc based on that then the prediction predictor will tell you either take fall through or go to the target so whatever it tells me to do I will send predicted pc to that if the instruction is a return statement that is jr you pop from the return of the stack and use that as pred pc if instruction is a jump and link or jump so these are direct procedure calls and unconditional jumps compute pred pc based on target field from ir so whether I call it pred pc this is going to be correct this is actually correct there is no prediction going on here and in all other cases pred pc is npc if it whatever this can it forward from the pred stage so remember that what is npc so npc is this one I call it pred stage I call it forward so essentially what I am saying is that in the decode stage if it is not a control transfer instruction my pred pc is just a pc that I call it so which means even in the decoder pred pc will be selected by a multiplexer I have a multiplexer which I am going to show you in the decode stage going to push on the RAS procedure call instruction in the decode stage after the instruction is decoded and you may have to repair the RAS if kill is enabled can somebody explain what this is actually kill is enabled when I made a miss prediction which means I fetched at least two instructions I fetched two instructions around the wrong part exactly two instructions then I got a kill signal telling me that these two instructions are wrong you should redirect yourself along the right path why do I have to repair the RAS we might need to pop out some kind of RAS why is that and how can that happen maybe some kind of function call is there and we have pushed that which one is a function call here you are right just tell me which one of these is a function call why should I push something only when I see a function call and when do I see a function call here suppose these branches is predicated so I get a kill signal here when I try to fetch at the time of execution sorry when I push on the RAS in the decode stage so when I get the kill signal here why do I have to repair the RAS what might have corrupted the RAS yes new instructions which we have just fetched they might have so if one of these is a function call you might have pushed something on the RAS which is wrong you should not have pushed actually what if one of these is a return instruction you have popped something out of the RAS that is a major reaction there is no way to repair it now you have popped something out so there are no details of this there are ways can somebody guess what might be a way to fix this you do not do RAS operation what is the point I have to push a pot right is it clear if I have a return instruction it is a nightmare now there is no way to get the RAS back to the correct state so what can you do how to fix it secondary RAS secondary RAS you can copy the RAS so one possibility is that so you will probably encounter this particular idea over and over as we go forward and do all these traditions it is called checkpoint so before you pop the RAS you can checkpoint the RAS you can copy the RAS entries somewhere so that whenever you see a kill signal you copy the RAS entries back so you are in the right state so it gradually gets complicated we will see why now we are just dealing with maybe one or two bad instructions as the pipe gets longer and longer you may make too many wrong predictions and too many wrong things may integrate into the pop too many into the RAS and too many pops may happen out of the RAS the question is how many checkpoints can I make so there are many issues coming up so anyway so this is a difficult problem sir can you repeat the second instruction can correct the RAS second this one yes because this one only can a chance to go to the decode this one will get killed even more how to compute BTP kill and kill so BTP kill is is control instruction and predpc not equal to fpc so whatever you predicted the decode state does not match with whatever pc I carried forward from the fetch stage so that means I made a mistake in the BTP and kill is same as before is control instruction and predpc not equal to target so last time we had NPC now NPC will be changed to predpc because the predpc is the one that overwrites NPC and that is what is carried forward so note that predpc gets paid into the idx register in place of NPC is it clear this particular hardware so this is a very generic hardware now we have incorporated pretty much every possible prediction that it can be related to control transfer instructions so this is very generic this is what is found in the picture of every modern processor only thing is that as your pipeline gets longer you may have to kill more instructions our particular selection of pc hardware is more or less same and when to put something in the BTP we have already discussed this actually put taken control instructions after real term it is known and it includes all control transfer instructions unconditional jumps procedure calls which is the returns and taken conditional values all control transfer instructions will go into BTP even though you have a separate RAS for handling these instructions because BTP is providing you an early prediction for everything on a missed replacement entry you can have a replacement policy like LRU or anything and I hit update the entry so you have talked about this so possible optimizations you can optimize the BTP to store one or more target instructions instead of actually pc so instead of storing that target pc you can actually store the instruction so that you can nullify one fetch operation so is it clear everybody what I am talking about here so currently BTP stores the program counter and you take the program counter and fetch the instruction instead of saying don't even fetch put the instruction itself in the BTP so for unconditional branches this is called branch folding essentially you are nullifying one instruction there fetching optimizations you could explore both branch paths simultaneously for conditional branches so do you see any problem with this so I am saying that well I don't really have a predictor so I will go along both the paths and start fetching from both the sides and eventually I will figure out which one is correct so I will nullify one of the paths so what is the problem with this so no processor action does it so that was fixed on future issue what is that what will they actually actually try to quantify what the other okay no but okay so let's actually for now that whatever you compute you don't store them actually it will persist and remove let's suppose that we have some hidden buffers where you can buffer the results so we will get to that point later so that's not a problem is there any other issue so what she has pointed out is very interesting she is saying that when we are fetching we can actually modify some register or something so till now the problem is not there because these instructions don't execute even before executing they get killed so we will get to that point very soon okay what is the problem here exploring both the paths simultaneously yes exactly so we are just wasting resources here by doing this essentially half of our instructions will be killed anyway we have a predictor which will give you more than 50% probability of being correct because here essentially I am assuming that I am connected with probability half that's all I am saying because I know that other half will get killed but your predictor predictors may be smarter and they can give you very high accuracy much more than 0.5 probability of being correct you can pre-decode instructions for predicted access in the future because we have a problem here that we cannot access the direction predictor until we decode the instruction that is that you could have a single bit in your instruction which says oh this is a conditional branch so the predictor can immediately look up that bit and can access the predictor writing the fetch stage so then you have a much better prediction to start with actually you don't have to rely on the BTB which tells you what happened last time and there is something called a trace cache so we will talk about that more later it's essentially roughly speaking it's a cache because the dynamic sequence of instructions that we encountered last time we executed along this path essentially what happens is that if you are executing this particular function f you would actually store instructions in this sequence instead of storing f somewhere and g somewhere you will actually store it like this so you store the trace that you executed so next time when you look up the trace cache it will go but again the problem is that it tells you what happened last time so traces may change depending on your control paths so of course there are ways to supplement that problem and we will talk about that later okay so before I go direction predictor where we are going to start we are going to introduce that any problem or any question in this we haven't yet opened up the direction predictor model which gives you an yes or no answer you ask tell me what to do with this conditional branch which either says take it or says follow so there has to be something going on inside we will open that up now but before that any question which one second one here I am saying is that here we could not actually actually predictor until we decode the structure because I know that I am saying that suppose you predecode the instruction and you put a single bit in instruction which tells me if it's a conditional branch or not the picture can look up that bit and can access the predictor right here we optimize even if we save instruction we have to spend entire cycle in instruction search what do we achieve the cycle time will be oh no you saved you saved one instruction right so let's see what's happening so you have an unconditional jump instruction right so traditionally what would have happened this instruction would go through the pipeline right in this state it gets fetched in this state it actually computes the target or maybe here depending on where you have your address adder and then it will have nothing to do with these two stages here what will happen is that you take the PC look up the BTB BTB tells you the target instruction which would have got fetched here but actually it gets fetched here itself and gets in the decoder here so you save a cycle any other question so I'll just introduce the basic thing here so people started with so it's the problem here to everybody that I have a conditional branch and I want to build a function the input to which is this particular branch instruction and the outcome is binary zero so to begin with to begin with people looked at static prediction that is you take a conditional branch and you say you predict always not taken or always taken so what is the penalty associated with it the branch if you say always not taken maybe you are always correct that depends on the branch however the point is that it is very simple and this can be actually done at compile time so you can say always not taken so for example if you take an unconditional sorry if you take a conditional branch the loop conditional branch and if you predict always not taken it will be wrong most of it because the loop ending branches are actually taken most of it it's not taken only once last one last one so once people made this observation that loop branches are special types so they said well let's improve this static prediction we will say forward not taken and backward taken so whenever you have a forward branch that is here is a branch instruction and its target is actually in front of it then I will say that it's not taken but the target is before it then I will say take it essentially what happens is that the loop branches will fall in this static in the backward branches and I will say always taken for those branches so now I am correct except the last time right and usually people would use forward not taken for if else type of branches where the target is actually forward in front of it so essentially and if condition and then you have else right so these branches target would be this one actually it's a forward branch and so you say forward not taken means that you are actually saying that for if else type constructs the if part will be exhibited most of it and actually talk about it too that's very interesting so it may be because the way we think we put the true part first before we put the false part so it turns out that forward not taken backward taken is a very good thing in most cases now what happened is that this was pretty good actually but of course you can improve about that if you observe the dynamic behavior of that so just to give you the problem give you some idea about the problem assume that you have a pretty deep pipeline not like a five stage pipeline actually so we talked about branch penalty right at some point so branch penalty is essentially number of cycles that you use because of a wrong prediction so it's essentially the time from when the branch is fetched to the time the branch executes so however many cycles you have in between is the maximum branch penalty that you pay so as an example suppose a processor takes 3 pipe stages to compute the target and evaluate the condition after fetch so after fetch you have 3 more pipe stages before you get to the execute stage as opposed to 1 assume 4% unconditional jump 6% not taken conditional branch 10% taken conditional branch evaluate the CPI increase for 3 schemes unconditional flush predicted always taken predicted always not taken so I'll talk about this problem next time so we'll see what exactly the value is so we'll start from here