 to today's lecture, which will continue our discussion on branch prediction. In the last lecture, I have introduced to you the basic concepts of branch prediction and three different schemes have been introduced that is one-bit branch predictor buffer, two-bit branch predictor buffer, then correlating branch predictor buffer and today I shall introduce to you the other techniques related to branch prediction like tournament branch predictor, then branch target buffer, integrated instruction fetch units and return address predictors. So, these are the topics I shall cover in today's lecture and let us have a very quick recap of the different schemes that I discussed in the last lecture. This is the simple one-bit branch predictor and this is accessed early in the pipeline using the branch instruction PC that means in the instruction fetch cycles, then that branch PC is used to index particularly the lower order bit is used to index a buffer where one bit is stored and that is the prediction it can be taken or not taken. So, if the prediction is right or wrong depending on that it is modified, so if it is 0 then it is not taken, if it is 1 then it is taken, so if the prediction turns out to be wrong then actual outcome is written into this buffer. So, this is how one-bit predictor works and we have seen one-bit predictor does not give you very good performance, so a two-bit branch predictor was introduced where instead of one bit two bits are stored and the two bits represent 0 0 corresponds to not taken 0 1 not taken that means whenever the value is half or more then it is taken that means 1 0 and 1 1 corresponds to taken and if this is the prediction and depending on the outcome you know the modification is done in the branch target buffer and for example, if the present state is it can be described as a finite state machine and so it can be represented with the help of a next state table. So, here you can see this is the previous state that corresponds to the prediction 0 0 0 1 1 0 and 1 1 if the next if the outcome is not taken then this is how it is modified and it is written into the branch target buffer, so if it is not taken if it was 0 0 it remains 0 0 if it is 0 0 1 it remains 0 0 and if it is 1 0 then it becomes 0 1 and if it is 1 1 and if the it was not taken it is 1 0, so that means if it is not taken as you can see it is dictimated on the other hand if it is taken then it is incremented if it was 0 it is incremented to 0 1 if it was 0 1 it is incremented to 1 0 and whenever it was 1 0 it becomes 1 1 and if it is 1 1 it remains 1 1 because it is a we are realizing a saturating counter. So, the actual outcome is again stored in this 2 bit branch predictor and essentially what is done by this 2 bit branch predictor it adds hysteresis to the decision making process that means unless 2 predictions are right or wrong the prediction bit is not modified. So, this is how this 2 bit branch predictor works then coming to the third type that is your correlating branch predictor basic idea is that how can we capture the behavior of the last n branches and adjust the behavior of the current branch accordingly. So, in the 1 bit branch predictor and 2 bit branch predictor what is being done it tries to identify what happened for the current branch it does not concern about the other branches that is present in the program, but in correlating branch predictor it is a kind of it takes into account a global scenario. So, what happens to the other branches that is taken care of in this correlating branch predictor as I have introduced in the last lecture and how it is being done the answer is use an n bit shift register and shift the behavior of each branch to this register as they become known. So, the previous n branches may not be the branch under consideration it may be other branches their outcome is stored in a shift register and this is being used and question naturally arises how many possible values will our shift register have. So, the number of bits will again decide the number of possible outcomes you would like to consider for example, if it is 3 bits then there will be 2 to the power 3 8 possible outcomes you have to choose one out of 8 tables. That means, imagine that many tables and there are many tables and you have to select one of the table on the basis of this shift register value and decide it and as we have seen I have discussed it with the help of this diagram how this correlating branch predictor works here 2 bits of global history that is being used and that selects one of the 4 possible tables and where the prediction values 2 bit prediction value is stored and that can be used for the purpose of prediction. So, this is the correlating branch predictor and in general we can have m n predictor use predictor which uses the behavior of the last m branches and obviously total number can be 2 to the power m and then with the help of the lower order bit address branch address it selects n bit predictor and that is being used for the purpose of prediction. So, this is the recap of correlating branch predictor and one popular correlating predictor is known as GCR correlating predictor. In this GCR correlating predictor as you can see how the global history is being incorporated for the purpose of prediction. So, here the branch PC lower order bits of the branch PC is exorbitant with the global history and then that is used for the purpose of indexing in the branch history table and that is being used for the purpose of prediction. So, this is how GCR correlating predictor works. Now, we shall focus on tournament predictors this is based on the observation that performance is improved by adding global information based on the observation that performance is improved by adding global information tournament predictor takes a step further. We have seen in correlating predictor we are considering we are combining local and global predictions and trying to come up with a prediction. So, we have seen that this correlating predictors perform very well and can we really goes a step further here is the that is the basic idea behind tournament predictor. What is being done? It uses multiple predictors one based on global information and the other based on local information. So, for simplicity you can have only two predictor one based on local information local information in this particular branch whether it was taken or not taken earlier and global information means other loops other branches whether they were taken or not taken that information is used. So, this is the difference between local and global. Now, you can have two predictors one local and one global you have to select one of the two predictors for the purpose of prediction that is the basic idea of the tournament predictor. So, it adaptively combines local and global predictors with the help of selectors and obviously, it does dynamically with the help of hardware because at run time for a particular branch at a particular instant it can be taken and at some other instant it can be not taken. So, it dynamically changes so adaptively combines and selects one of the predictors at run time. So, this is the most popular among the multiple branch predictors later on we shall have a look at the contemporary processors and we shall see how this tournament predictor is being used in different recent processors. So, this tournament predictors ability to select right predictor for the right branch that means you may have many branches. So, it identifies which predictor will give you good result. So, it identifies right predictor for the right branch and that is the basic approach that is being used in tournament predictor. So, let me explain this tournament predictor let us assume we have got two predictors one and predictor two and there are four states say here it is use predictor one and another is here also use predict one predictor one predictor one and here it is use predictor two and you have got four states predictor two. Now, how it reaches from one predictor to another so what is being done let us assume it is in the state in this state use predictor one. Now, you have got two predictors predictor one and predictor two. Now, if both of them turns out to be zero that means both turns out to be wrong then it remains in the same state. So, if it is if the outcome is zero zero that means both the predictors turns out to be wrong then it remains in this particular state and say predictor one turns out to be right and then the predictor two is wrong then it since it predictor one has remained true. So, it remains in this state again for zero one one zero and if it is one one I mean if it is zero one then it will switch from this to this if it is zero one then it will go from this state to another state and this kind of I mean this predictor one if it fails twice then means if it is zero one and again zero one and predictor two becomes right in this case that means if it is zero one again then it goes to predictor two that means that predictor one has to fail twice and predictor two has to become true twice only then it will go from predictor one to predictor two. So, this is the boundary between predictor one and predictor two so it will go from here to here. So, for all these states zero zero one zero and one one it will remain in this state and only when it is zero one that means predictor one has turned out to be wrong and predictor two has turned out to be right it will go from this state to another state and it will remain in this state again as long as both of them are false both of them both the predictions are false or both the predictions are right and only when it is only when the other way happens that means if it is one zero then again it will go back to this state that means if this predictor one is right and predictor two is wrong it moves in that direction so it will go from here to here and whenever it is zero one it will go from predictor one to predictor two. Similarly, here it will remain in this state or let us consider this one first so whenever you are using predictor two and I mean let us assume that you are in this state predictor two state predictor two is being used and if it is zero zero again it will remain in this state if it is one one it will remain in this state as we have seen that is true for all the cases zero zero that means both the predictors are wrong or both the predictors are right so in both the cases it will remain in the same state the state does not change. Now it will move in the direction of predictor two and it will remain in this state as long as this predictor two is correct and predictor one is wrong so it will remain in this particular condition as long as this is true and it will come from this state to this state whenever it is one zero that means predictor one is right and predictor two is wrong so it will come here and again if the same condition holds that means predictor one is right and predictor two is wrong if it happens twice it will go to predictor one that means predictor two if it fails twice then it will go to predictor one and on the other hand here it will go from this to this if it is zero one so from this state to this state it will go that means you are moving in the direction of predictor two if predictor one has failed twice and predictor two has turned out to be correct and on the other hand it will go to predictor one if it turns out to be true twice and predictor two fails twice. So this is the state transition diagram for this tournament predictor and so basic idea is the counter is incremented whenever the predicted predictor is correct and the other predictor is incorrect and it is decremented if the reverse is situation. So based on this the tournament predictor prediction works and you can see here only two predictors have been shown it is not necessary that you can have only two predictors. So for the sake of simplicity to illustrate the operation of tournament predictor we have considered two predictors but there can be more than two predictors in real processors as we shall see later. Particularly later on we shall see that deck alpha processor where this tournament predictor has been used and of course with much more complexity that we shall discuss later. Now this particular diagram shows fraction of the predictions coming from local predictor. So we have seen that earlier either the predictor one can be used or predictor two can be used and here this particular diagram shows for different benchmarks whether predictor one has been used that is the local predictor has been used or global predictor has been used. So as you can see the tournament predictor selects between a local predictor two-bit predictor and a two-bit GCR predictor that GCR predictor which I have already discussed earlier which is a correlating predictor which uses global information for the purpose of prediction and each predictor has 1024 entries each of two bits for the for a total of 64 kilo bits of information. Now here you can see the local predictor is used different percentage of times for different applications. So it is very much application dependent for example it can vary from 100 percent to 33 percent that means local predictor has been used 100 percent for some application and in some cases only 37 percent of the time the local predictor has been used and for the remaining part of the time the global predictor has been used. So this simulation shows how the tournament predictor works and the variation of the use of different predictors. So this is a simple case of two predictors but in real life as I told there can be more predictors and these are the different application programs which I have already mentioned earlier. So faction of predictions by local predictor is shown in this diagram and the remaining percentage corresponds to global predictor. Now this is the performance comparison of different types of predictors. We can broadly categorize the predictors into three types local based on purely based on local information and we have seen two bit predictor gives you better result. So local two bit predictor another is which uses both local and global information that is known as correlating predictor and third category is tournament predictor. So here as you can see for different predictor size how the performance changes for different predictors. So this is based on the misprediction rate of spec 89 as the total number of bits is increased. So as the total number of bits is increased how the number of misprediction changes. Obviously as you increase the number of bits the performance would improve that means misprediction rate should reduce but for one bit predictor as you can see the reduction is not really much. So it from little more than 7 to little less than 7 that is the decrease and there is no improvement even if you increase the size of the branch prediction buffer. So even by increasing the size of the branch prediction buffer there is no improvement in performance for local two bit prediction as you can see here. Then for correlating predictors which uses both local and global information and you can see starting from 5 percent it decreases to little less than 4 percent for spec 89 programs and you can get a pretty quite little relatively good performance misprediction rate is only is less than 4 percent that is for the correlating predictor and actually it has used that GCR predictor for this purpose and third is and here what has been done optimal correlating predictor has been chosen at each point. So optimal correlating predictor has been used for plotting this diagram and the last curve where the misprediction rate is less than 3 percent for the tournament predictors which I have just discussed in detail. So for the tournament predictor you get very good performance and you can see you can get a performance I mean less than 2 percent that means for more than 98 percent of the cases prediction is correct. So it will definitely give you good improvement in processor performance. Now, coming to another very important need that is your important requirement for good performance of processors that is branch target proper. So far we have focused on branch prediction buffer. So outcome of the predictions are stored in the buffer and which is used dynamically in different predictors. Now let us see why do we need branch target buffer on top of branch prediction buffer. So in the classic five stage pipeline an instruction is identified as a branch only in the id stage and branch prediction buffer can help decide whether to fetch from target address as you have already seen or from fall through. So however instruction fetch still ends up fetching a possibly useless instruction that means whenever you are using only the branch prediction buffer then what happens that is being given here. So even when prediction perfect branch prediction it cannot achieve zero cycle branch latency. The reason for that is you know that calculation of the address that will be taken place is done either in the execution stage usually it is done in the execution stage or in a later stage. So even if your prediction is correct taken or untaken where the branch will take place is not known in the second cycle. So it is necessary to have branch target buffer the solution is to have branch target buffer and this branch target buffer which is accessed during the instruction fetch cycle. So a cache that stores predicted address for the next instruction after a branch. So you can use a branch target buffer which will store the information of the target address in a cache memory and that will be used for the purpose of jumping to a particular address. That means you know whether your branch will be taken or not taken and where it will jump that is also known. Of course if your prediction is untaken then no problem it is PC plus 4 but when your prediction is taken then that branches has to be known and that branches is calculated if the address is PC relative we have already seen that in most of our I mean the example that you have given that is for the instruction set of the simple MIPS pipeline they are you know PC relative address is being used. So in the PC relative branch case what is being done that content of the program counter is added with the displacement or offset that is being added to get to find out the effective address. And this effective address calculation can be done in the instruction decode stage or in the instruction execution stage depending on whether we are having additional hardware or not. If we do not have any additional hardware or adder it is done in the execution stage and if we use a special adder for the purpose of calculating the effective address then it is done in the instruction decode stage. So either in the execution stage or in the instruction decode stage if the calculation is done that means you have to wait till the end of either the instruction decode stage or the end of the execution stage to get the branch address and obviously it will lead to it cannot achieve zero cycle branch latency. So there will be a delay of either one cycle or two cycles depending on whether when the branch address is calculated. And now there are some variants you can store only the predicted taken branches. So branch may be taken or not taken it is not really necessary to store the address of the I mean when the branch is not taken because in that case your address is already known that is your PC plus 4 but only when branch is taken then you have to calculate this way if you use PC relative addressing which is used in the context of branch addresses. And this works well for one bit local predictor and store entry when changing the prediction of T that means when the prediction changes then this that address has to be changed only taken branches information is stored in the branch target buffer and use separate target and prediction buffer. So you can use two separate buffers one where you will store the branch predictions and another where you will store the branch target buffer that means where it will jump and this is done in this way. So branch target buffer address of the branch index to get the prediction and branch address you must note that must check for branch match now since cannot use wrong branch address. So earlier that branch target buffer that was used without any tag but in case of branch target buffer we have to use tag which is conventionally used in cache memories as we shall see this is very important because for the branch target buffer must check the branch match now since it cannot use wrong branch address. So this is how it is being done. So you can see the cache memory is in this case is storing not only predicted address that is the predicted program counter address but it is also storing the branch PC as part of the cache memory. So after you get the PC value of the instruction fetch after you have fetched instruction and you know the branch PC then that is being compared with the branch address that means this branch PC. So it is done in a conventional manner and it is known as content addressable memory. So what is being done you are essentially comparing with a value and then you are using this address and correspondingly you are getting this value. So this is the cache memory used in this manner and if there is a no match branch not predicted so proceed normally. So you will go for PC plus 4 so if there is no match with this value. So what you are doing you are in content addressable memory you know the content is compared with this address that is being done here. So here also you are trying to match a content with the PC of the instruction fetch and if there is no match then the content is not present here and branch is not predicted proceed normally. On the other hand if there is a matching with any one of the contents that is being present here that means in your branch target buffer that target buffer the PC values which are stored program counter values which are stored if there is a match with any one of them then there is a match and you say that yes instruction is branch and use the predicted PC as the next PC. That means corresponding for the corresponding program counter value you will get a predicted PC where the branch should target branch should take place. So here you get this predicted PC and that can be used for the purpose of fetching instructions immediately. So you can see using this you can have zero delay the reason for that you are getting the information whether the prediction is taking place or not you are getting the information where the branch should take place if it is taken. So in the next cycle itself that means in the after the instruction fetch has been done in the next cycle itself you can fetch the instruction from the target address and its execution can proceed. So delay is zero delay it can work with zero delay. So the prediction and address at the same time that is the basic idea of using branch target buffer. So branch target buffer contents prediction for the target address can be used along with the separate branch prediction buffer and at the end of instruction fetch stage we know whether a branch should be taken and if yes and if the target address is known then we can have zero cycle penalty for branches as I have elaborated in detail. Now let us come to some intricacies of this branch target buffer. So target address to be stored in the BTP. So will you store only the predicted taken branches if it is not taken then what will happen? If the prediction is not taken we know it is PC plus 4. So it is not necessary to store the target address when the prediction is not correct and can the branch prediction buffer and branch target buffer be combined. So this can we use here we are trying to tell can we use a single buffer to store the prediction as well as to store the branch target address. This can be used very conveniently if it is a single bit one bit prediction. However whenever we go for two bit predictor then it leads to some problem complication arises. Why complication arises? We have seen in a two bit predictor 00, 01, 10, 11 predict untaken predict untaken that means for that entry and again 01 predict untaken predict taken. So what you have to do this will require that target address for branch untaken would also have to be stored. That means whenever we are using two bit predictor it is not only necessary to store the information of only the branch taken addresses but also you have to store the addresses of branch untaken addresses. That is the complication that arises whenever you go for a combined branch prediction buffer and branch target buffer in the case of two bit predictor. And in many commercial processors like PowerPC separate branch target buffer and branch prediction buffers have been used to avoid this complication. And invariably we will see the branch target buffer is not a single bit always two bit is used and that is the reason why two separate buffers are used one for branch target buffer and one for branch prediction buffer. Now here how we can combine branch target and branch prediction buffer is illustrated with the help of this flow chart. And how the pipeline behaves whenever you have got branch target and branch prediction buffers. And we have assumed that instruction phase which only have branch target buffer and ID is responsible for prediction and instruction decode stage is responsible for the purpose of prediction. So here the first step is send program counter to the memory and branch target buffer. So the program counter value is sent to both in the instruction cycle it is sent PC to memory as well as branch target buffer. So it is branch target buffer and after sending that whether you have to check whether entry is found in the branch target buffer or not. We have seen there is a if we go back you have to check whether that entry is present in the branch target buffer or not. So that is being done by this in this step itself entry found in the branch target buffer. If the entry is found in the branch target buffer then question arises whether there are two possibilities branch may be taken or not taken. If the instruction is taken instruction a taken branch if the answer is no normal instruction execution. Normal instruction execution means the program counter will be that address will be fetched I mean instruction will be fetched from PC plus 4 for the MIPS pipeline that we have discussed. So that is the case and you can do it in the ID stage. Now if the instruction is the instruction taken branch if the answer is yes that means your prediction is branch is taken in such a case what will happen enter branch instruction address and next PC into branch target buffer. So in this case you have to enter your information in the enter branch instruction address and next PC into the branch target buffer if your prediction is taken and that we will do in the execution stage. Now let us see whenever entry is in the previous case entry was not found in the branch target buffer and that is why you have to go up to the execution stage and there will be a delay of you can see you are losing two cycles in this case. So you cannot fetch instruction in this stage you cannot fetch instruction in this case. So there is a loss of two cycles in this particular case. Similarly, let us see consider the case where the entry is found in the branch target buffer. So if the entry is found in the branch target buffer you can send out the predicted PC and if there are two possibilities again branch may be taken or not taken. If the branch is taken then branch correctly predicted and continue execution with no stalls. So in this case there will be no stalls and you can go ahead with your predicted address. Now if your outcome turns out to be wrong in this case also there is no loss of clock cycles but whenever your prediction turns out to be wrong in cases then you have to your prediction is wrong so mispredicted branch. So what you have to do you have to kill fetched instruction restart fetch at the other target address delete entry from the target buffer. We have seen only the address of the taken branches are stored in the branch target buffer. So when it is untaken you have to remove that particular entry and not only that you have to undo the instructions which have been fetched and you have to fetch from the next from another address and as a consequence here also you will lose two cycles. So in this case you will lose two cycles. So in two cases you are losing two cycles and in two cases you are not losing any cycle. So current processors combine target and prediction logic into a separate instruction fetch unit and operates independently of the pipeline and there is a special variant stored decoded later on I shall discuss about it this variant. So I was mentioning about the I mean when there will be penalty of two cycles and when there is no penalty that is being given here assuming a new target is written into PC only at the end of execution cycle. So instruction in buffer whenever it is yes and prediction is taken and actually is taken there is no penalty and your prediction was taken but actually not taken even there will be loss of two cycles as I have already explained and if the instruction is not in the buffer then if your prediction is not taken but actually it is taken then again you will lose two cycles and that means in this case you will lose two cycles but if you are actually if it is prediction is not taken then again there is no loss in cycles because it was not taken so you will not lose any cycles. So this is how the branch target buffer works and we can have losses in two cases and no loss in two cycles that means we can say zero latency. Now there is another requirement particularly for unconditional cases return address predictor you know techniques discussed so far work only with direct branches that means you have got direct branches and address calculation is done with the help of that PC relative and then branch is taken or not taken but there are many indirect branches whose outcome is known whether it will be taken or not taken at execution time so where you know that target is not PC relative in such a case an important category of indirect jump is returns you know you will find you are doing subroutine calls say here you have got a main program and you may be calling a subroutine and so it will jump to a subroutine and here you will have a return so you will be returning to this point in this particular case but this subroutine this subroutine may be called not by one main program this is a main one and another program say main two another main program is again calling subroutine say this same subroutine so in this case it will go here but you have to return not there but to this point so we find that this particular situation we find that for the same subroutine that return address has to return to different addresses if this particular subroutine is called by different main programs or from multiple places so in this type of return instructions you will encounter in many situations for example more than 15% of the branches in spec 89 and particularly in object oriented languages like C++ and Java this type of you know unconditional branches are present in many places so in such cases you know that the techniques that we have discussed will not work properly so can we predict the return address so in this case what you have to do you have to predict the return address just like you are predicting the with the help of a bunch target buffer you are predicting where the branch will take place here you have to predict where the return will take place so there are two possible alternative or two possible options one is use branch target buffer so if we use simple branch target buffer to store the return address what will be the outcome accuracy tends to be low the reason for that I have already explained say the return address cannot be same cannot be stored for you will not correspond to a particular branch PC that means it can be it can be dependent on which particular program is calling this subroutine and return has to take place to different points so return address depends on the call site so call site can be different for different instances of the call of a single subroutine so what is the alternative alternative is to use a return address stack so this is how the problem can be overcome so with the help of this return address stack you can push an entry to the return address stack at a call and pop the entry upon return so this is how you can resolve this problem so in addition to branch prediction branch target buffer you will require a return address stack so this the perfect prediction if call depth does not exceed the return address stack buffer size of course one thing you have to keep in your mind a stack will have limited size as long as the there is no stack overflow your prediction will be that prediction of the return address will be always correct only when there is stack overflow you will not get correct branch prediction so as long as your I mean that stack depth is enough you will get perfect prediction but if there is stack overflow then of course it will not give correct result let us see what is the simulation results on misprediction rates for different sizes of return address carried out on spec CPU 95 benchmarks so we find that as the number of entries in the return address stack is increased the misprediction rate falls significantly so we find that for different benchmark programs the misprediction rate falls sharply as the number of entries in the return stack is increased but you can see you do not really require a very large stack size only by using a stack of 16 entries your misprediction rate is significantly lower except for I mean except for one brand one particular application program that is your Li for this particular benchmark it is not 100% prediction I mean prediction is misprediction rate is 0 for all other application except for only one application program that is your Li where the misprediction rate is about 2.5% even with this return stack size 16 entry if you increase the size of the return stack for this application also it may become 0 but this particular simulation result clearly demonstrates the usefulness of the return stack and the size need not be very high that is also clear from this particular simulation results. Now the last topic related to this is known as branch folding branch folding can be considered as a variation of the techniques that we have discussed here what we have done so far we are in a branch prediction buffer we are storing the predictions 1 bit or 2 bits in branch target buffer you are storing the target addresses and then from the target address you have to fetch the instruction instead of that why not store the instruction itself instead of storing the branch target address why not store the instruction itself and that is the basic idea of this branch folding to store one or more target instructions instead of or in addition to the predicted target address the advantage is it allows a larger branch target buffer and it allows us to perform an optimization called branch folding so essentially branch folding this particular technique is known as branch folding because it allows unconditional branches and can run in zero cycles that means normally unconditional branches do not run with zero cycles but using this branch folding technique even this unconditional branches will run with penalty of zero cycles. How it is done when the branch target buffer signals a hit and indicates that branch is unconditional the pipeline can simply substitute the instruction in the branch target buffer for the branch instruction so in this case this is how we are able to achieve zero cycle I mean penalty for unconditional branches so this is known as branch folding. Now let us consider how different types of predictors have been used in different processors first we shall consider the case of Pentium processors you can see the Pentium 2 bit local predictor direct jump from 0 0 to 1 1 state so simple 2 bit predictor is being used but there is some difference with the predictor that we have discussed that saturating counter predictor here from 0 0 to 1 1 jump is taking place. So it is little different from the saturating counter 2 bit predictor that we have discussed then for Pentium MMX Pentium Pro and Celeron Pentium 2 Celeron and Pentium 2 they use 4,2 correlating predictor that correlating predictor where you have got global and local predictors so that means there are 4 global predictor and 2 bit predictor has been used so you can have 2 to the power 4 different tables you can say out of 2 to the power 4 16 different tables you have to take one of them from the using the lower order bits of the address and then the 2 bits is being used for the purpose of prediction. And Pentium 3 uses two level adaptive tournament predictor but unfortunately the details given for Pentium 3 is very sketchy and so far as the branch target buffer is concerned it uses 512 entry branch target buffer. On the other hand Pentium 4 uses 4096 entry branch target buffer and it also uses execution trace cache later on I shall discuss about it so far we have not discussed about this execution trace cache later on when we shall discuss about cache memories I shall discuss about it. Coming to the predictor in deck alpha 2 1 2 6 4 deck alpha chip where the predictor is the most sophisticated one which has used tournament predictor as we have seen a tournament predictor requires 3 components selector global predictor and local predictor so and the selector will select between select either local predictor or global predictor so 3 components are there. So, the selector is it uses a 4 k 2 bit counters indexed by the local branch address to select between local and global predictor as it is shown in this particular diagram. So, this is the selector global history which is being used choice predicts so this is a selector that is being used for the purpose of prediction then you have got global predictor you have got 4 k entries indexed by the history of the last 12 branches each entry is a 2 bit predictor. So, global predictor uses 4 k entries in the global predictor and each entries 2 bit predictor and 12 comma 2 4 k means you require 12 bit 12 comma 2 correlating predictor is being used as the global predictor. So, as it is shown global predictor where you have got 4 k into 2 bit this is your global prediction using the global history of 12 bit that is being done so this is the global prediction this is the selection by using the global 12 bit global predictor and now let us come to the local predictor. So, local predictor itself is 2 level top level is 1 k 10 bit history of the local predictor branch outcomes. So, it detects the patterns pattern means you know 10 bit if it all are taken the bit sequence is 1 1 1 1 all 1 if it is alternative bit sequence is 1 0 1 0 1 0 and if it are all are untaken then it is all 0 that kind of sequences are taken that is the 10 bit history is used and history used to index a 1 k 3 bit saturating counter table and total of 29 k bit provides high accuracy in the branch prediction and this is the local history table as you can see this is the local history table then the 10 bit is used to look for the purpose of local prediction and 1 k into 3 bit. So, this local prediction either local prediction or global global global prediction is used and that is decided by the selector. So, it will use one of the two for the final prediction. So, this is the predictor used in decalpha chip and possibly the most sophisticated one used so far in different processors and in modern processor somewhat similar type of predictors are used. So, we can summarize now prediction is becoming important part of scalar execution. So, branch history table 2 bit for loop accuracy correlation recently correlated recently executed branches are correlated with next branch either different branches or different executions of the same branch and tournament predictor is used where which requires more resources to competitive solutions and pick between them and branch target buffer includes branch addresses and prediction and predicted execution can reduce number of branches and number of unpredicted branches and return at the stack for prediction of indirect jump that I have already discussed and here we can conclude branch prediction techniques achieve 80 to 95 percent accuracy and exact benefit varies based on the programs type size of buffer and it is crucial for current day processors because nowadays we are using super scalar architecture where multiple instructions are issued and so branch prediction is extremely essential because need to supply multiple instructions per cycle in the subsequent stages based on super scalar architecture and it can also reduce branch penalties by reducing misprediction penalties and fetch from both predicted and unpredicted branches and store buffer instructions from both paths in the BTB and later on we shall see extension supply of this idea is execution trace cache you may be asking how all these are possible the reason for that is we have seen that Moore's law has is providing us large number of transistors the dimension of the transistors is increasing you can put more and more transistors on a chip and that has helped us to use sophisticated branch predictors in the processors so with this we have come to the end of today's lecture thank you.