 And welcome to today's lecture on dynamic instruction scheduling with branch prediction. In the last two lectures, we have discussed various techniques of branch prediction and earlier we have discussed dynamic instruction scheduling. Now, today we shall see how these two can be combined to achieve higher performance in a processor, but before that let us have a quick recap of one very important concept that is data flow architecture, which is used in dynamic instruction scheduling. You know earlier data flow architectures were proposed with this basic idea that hardware represents direct encoding of compiler data flow graphs. So, for example, you have to perform this computation y is equal to a plus b by x and x is equal to a into a plus b plus b and inputs are a and b and output is y and x. Now, one can create in data flow architectures a data flow graph is created. So, here this is the data flow graph corresponding to this computation, here you can see a and b are the inputs and y and x are the outputs and various operations like addition, then your multiplication and this division y x and addition, this addition. So, you can see all these are given here, now what is the basic idea of data flow architecture. So, data flow along arcs in tokens, you can see here two circles are given these are known as tokens. Then both the tokens are available, then when two tokens arrive at the compute box. So, this is your compute box, adder is a compute box and then the box fires and produces new token. So, that means, if there are two tokens, I mean if there is a token on this arc, if there is another there is also a token on this arc, then this compute box will fire and produce a token at the output. And if there are split operations like this, then copies of tokens are produced. That means, here for example, a and b tokens are there, so this box will fire produce a token here and that token will be available at this input on this arc as well as on this arc. Now, you can see tokens on a and tokens on this arc are available. So, this multiplication multiple compute box will fire and it will produce a token here. Now, you can see token is present on this arc, token is already present on this arc. So, this add execute unit will fire compute unit will fire and as the token is available here and earlier there was a token here, now tokens are available on both these arcs. So, this particular compute unit that is divide will fire and finally, it will produce a token here. So, you see this is how the different computations are controlled with the help of tokens in data flow machines. Similar thing is done in dynamic instruction, I mean scheduling and particularly in Tomasulo's approach what is being done, this data flow graph is implemented while executing instruction and the hardware does it. That means, this data flow graph is created within the hardware with the help of reservation stations and other functional units to do it. So, this is one very important concept and we have already discussed about this dynamic instruction scheduling hardware rearranges the instruction execution to reduce stalls while maintaining data flow. You can see data flow and exception behavior here as I mentioned, I have briefly explained the operation of data flow architecture, same concept has been borrowed here and of course, it also maintains exception behavior which is also important. And it has got many advantages as we have already discussed, it allows handling situations not known and compile time that means, it is better than compile time instruction scheduling and it simplifies the compiler, it allows the processor to tolerate unprecedented delays such as cache misses by executing other code while waiting for the miss to resolve because it allows you out of order execution. So, if some instruction if it is delayed then other instruction which is out of order can proceed particularly in processors in the early years when cache memory was not present or even when cache memory is present it may be delayed. So, if cache memory is not present if it is a memory instruction it will take long time. So, other hand if there is cache memory, if there is cache miss it may take longer time because of cache miss. So, this type of problem is resolved that means, executing other codes while waiting for the miss to resolve. So, that means, the different delays of different instructions are tolerated. Then it allows code that was compiled for one pipeline in mind to run effectively on a different pipeline as I have elaborated earlier. Now, in my next lecture I shall discuss about another very important technique that is hardware speculation. You will see that hardware speculation goes hand in hand with dynamic instruction scheduling. So, dynamic this dynamic instruction scheduling along with hardware speculation which I shall discuss in my next lecture which can lead to further performance advantage and particularly as I have already mentioned Tomasulo's building data flow graph on the fly. So, this that means, this data flow graph which is created here is created on the fly in your Tomasulo scheme. And we have already discussed about the basic structure that you require to implement Tomasulo scheme. You have the instruction queue then these are the floating point registers. Then innovation is the availability of reservation stations as we have discussed and you have got load buffers and store buffers. And this it the main features is register renaming that is achieved with the help of these are reservation stations which buffer the operands of instructions waiting to issue and by the issue logic that we have already discussed and this avoids WA type of hazards without stalling. And distributed header direction and execution control by using reservation stations buffers the instructions and operation operands the that means, by the help of this buffering you can pass the results directly to the functional units from the reservation stations rather than going through registers. And this common data bus bypassing is done because the result is broadcast and on the common data bus which is received by all the functional units waiting for the offerant to be routed simultaneously. So, this is these are the key contributions of Tomasulo's approach and another thing is integer instruction can pass branches. So, this allows floating point operations beyond basic block of FPQ and this can be done provided we have got branch prediction only with the help of branch prediction this can be done as we shall see. So, let us illustrate this with the help of Tomasulo's loop example. This is a loop example earlier we have seen how dynamic instruction scheduling can be done over the basic block and you can increase the size of the basic block by loop unrolling and all these things, but now you will see that loop unrolling will be done dynamically with the help of branch prediction. So, this is the example and these are the assumptions we shall make as you multiply takes 4 cycles as you first load takes 8 cycles because of cache miss and second load takes 4 clocks because as you know later on I shall discuss about this cache memory and you will see that whenever there is a cache miss trying to read a particular word if that particular word is not present the entire block comprising several words are transferred to the cache memory. And as a result subsequently if you read another word from that block then it will lead to a hit. So, first time there will be a miss subsequently there will be hit so that is what is being explained here and later on we shall discuss in detail how the cache memory really works and these terminologies like cache miss and cache hit. So, but for the time being let us assume that first load will take 8 clock cycles because of cache miss and second loads will take 4 clock cycles. And also we shall see although I mean we shall for the execution of this we shall show clocks for these two instructions which are essentially used for the purpose of housekeeping for loop computations. You know you have to subtract R 1 which is decremented which is I mean R R 1 is decremented gradually which is acting as a pointer to different elements of an array. And this branch is done by checking care of how many times the loop will proceed or not. So, by equating it to 0 checking it whether it is equal to 0 or not. So, these two also the clocks for these two will be also shown and we shall see that integer and integer instructions will proceed ahead compared to the floating point, but we shall primarily focus on floating point operations. Now, hazards due to out of order executions if a branch is predicted as taken multiple executions of loop can proceed at once using reservation stations. So, normally the execution is restricted to a basic block, but now if we know that a branch is predicted that means that loop will be taken I mean the branch will be taken in a loop. Then obviously the instructions of the other iterations can be carried out, so this is the basic idea here that means if a branch is predicted as taken multiple executions of loop can proceed at once using reservation stations. How we shall discuss later and a load and store can safely execute out of order provided they access different memory locations. So, this is another thing we have to take care of. We have already discussed about that read after type of hazards which are essentially arising out of true dependency, which will be taken care of by Tomasulo's algorithm we have already seen how it is being done. Now, there are other two hazards write after read and write after write. If they involve registers that means if the read after write and write after write operations are involving registers that can be taken care of by register renaming can be taken care of using register renaming that we have already discussed. And in Tomasulo's algorithm this register renaming is done dynamically without actually involving registers that we have discussed in detail. Now, question actually arises if these hazards arise involving memory then how it can be taken care of. So, write after read and write after write hazards I mean hazards arising out of name dependencies. If they arise involving memory then how you have to take care of. You may have a series of load and store instructions present in a program. So, the way the sequence in which they appear in a program as you know this is known as program order. Now, suppose you have got this is one load you may have another load say let it be F0 0 R1 and there can be load and store there can be several other load and store before and after it. Now let us consider load and this load let us consider and suppose this is also F0 0 and it is involving a memory let us assume here it is something 8 R1 but in between the value of R1 may have changed which I have not written here. Now if these two involve the same memory that means from the same memory you are loading then what can happen. So, if you change the order this will lead to a hazard which is known as which is essentially write after write hazard. So, this is a write after write hazard you are writing in the same register. So, how this type of hazard can be overcome in Thomas Souleau's approach. So, this is you are reading from the same memory and you are from a particular memory on writing into a register you are reading from memory writing into same register. So, in this case that means it is involving memory but it is writing into a same register in that case if this is executed fast and then this one this will lead to write after write hazard how can it be overcome this can be overcome in this way. So, whenever this instruction is executed I mean it encountered then the effective address is calculated. So, effective address of this load of a particular load is calculated and that is available as you know in the address field of Thomas Souleau's data structure which is available for the reservation stations. Now, whenever then after knowing the effective address of a particular load it checks whether there is any other load that is present earlier which is active that means which has not yet been carried out then what you have to do this particular load has to be delayed that means it will not be loaded into the load buffer. So, there is a load buffer and store buffer. So, the instruction following a load using having the same effective address is not loaded. So, this is how it has to be taken care of and similarly if there is a stored data again it is involving same register so you are storing the data into a memory from this register and if this is executed first then you are loading it and then you are loading it into storing it into a memory location. So, this will lead to again hazard so here also that means if the load is succeeded by store and load instructions which are appearing having the same effective address then that particular load has to be delayed. So, similar situations will happen in case of store instructions that means you have to look at the calculate the effective address which is stored in that a field and that effective address has to be compared with other effective addresses preceding that load and store instructions which are which have not yet been carried out but active that present in the reservation station. So, this is how you can take care of the load and store. So, if you load and store access the same address out of order execution leads to WAR or RAW type of hazards particularly WAR or WAW type of hazards that we have already seen to detect such hazards the processor must compute the effective address of the load and store instructions in the program and Tomasulo schemes combine two different techniques the renaming of the architectural register to a large set of registers buffering of the source operand from the register file but this is essentially corresponding to registers. But for memory we have to do it this way now let us consider the loop example that I have already mentioned and let us we can see here we have got two iterations present to the instructions of two loops present in this instruction window and this is the code to be executed which is present here and initially this register R1 is having the value 80 that means the memory location that is being available I mean from where it has to be loaded is 80. Now with this starting point let us see how the execution will proceed so these are various added store buffers load and store buffers are shown here instruction loop is given here iteration count is given here this is the first iteration this is the second iteration and value of register used for address in the iteration control is given here. Now let us go to clock cycle one so it will issue the first load instruction and you can see the load unit is busy and the address is effective address is 80 that is available and the register in which it will be loaded from I mean from this load instruction is given here and this is pointing this load is being issued this instruction is being issued. Now we go to the second clock and the second instruction is the multiplication double that is being issued and necessary data structure is updated and this shows that this particular instruction has been issued and here the busy then operation and these are the values from where it will get and here you can see the value will be taken from read F2 R F2 that will be that means the register it will be from the register R F2 means it will be coming from register F2 and here the one operand will come from load one this is F0 will come from load one. So, this is being provided here and now we go to the third cycle and here the this store instruction is issued of the first iteration and again the corresponding data structure is updated. So, this busy that means that store unit is becomes busy address corresponding address is 80 effective address is 80 and the multiplier that functional unit involved is I mean from where it will come this F4 will come is multiplier one that is multiplier one. So, that is being shown here and here there is no other change then it will go to the fourth cycle. So, here implicit renaming sets up data flow graph that means implicit renaming means here you can see you are that loading I mean implicit renaming means you will not really perform any renaming of these registers that means whenever you go to the second iteration, but it will be done implicitly because the information will be taken data will be taken from the reservation station as we shall see. So, you can see here load it will come from load and again it will be provided here that means this F0 will come here and it will be provided here also for this store. So, you can see that means both the things are done here. Now, it will dispatch this Saba instruction, but it will not go to the floating point queue. So, here it shows the floating point operations that queue, but the that subtraction immediate instruction it will not go to the floating point queue, but it will be executed. So, this will lead to change the value of R1. So, and that is what is happened that now that register R1 has been changed the value has been changed and it is now 72. So, 8 has been subtracted from it and so this instruction is executed then this instruction will be executed and it will go to the second iteration. So, this instruction is getting executed now it has got to the second iteration. So, you can see there was two cycles missing here 4 and 5 which is missing here before the second iteration starts. So, before second iteration starts the two cycles are not present here because that calculation of the effective address for the second iteration was necessary and then computation of the condition. So, this condition is being performed and then it will issue this instruction load instruction now in the 6 o'clock cycle and accordingly the this load unit will now become busy, but here the effective address is 72 not 80 and as it is shown here. So, although you can see two load instructions have been issued and but they have not completed their operation. Now, you can see what how it is happening here earlier the value here it was written load 1. That means, this floating point register was supposed to be loaded from the operation of load 1 now it has been changed to load 2. That means, before loading was done the load 2 is written here. That means, the f 0 never sees the value that is loaded from well loaded by this loading one instruction. Because that this operation is not carried out it is the information is stored are available in the reservation stations, but it was not written into the registers before it was written into the registers the another load operation has taken place which will load into the same register that is why this has been modified. So, this result writing before it proceeded to result writing that load instruction has not completed writing result. So, before that this has happened. So, this will not see the load operation and it will proceed to the next cycle in the 7th cycle. Again the second multiplication operation is issued because you have got two multiplier. So, second multiplication operation is issued and necessary updating of the data structure is available here. So, here it will read one operand will come from register f 2 that value is written here is available in v k and after load is performed this value will be available here. So, load 1 and load 2 will provide the I mean this shows that this operand will come from load 1 and load 2 and value of course, will come here in v j. So, this instruction has been issued now it will go to register file completely detached from computation because you are that computation will proceed without involving the registers. Now, we have come to so you can see here that first iteration and second iteration both the iterations are now getting overlapped. They are getting overlapped that means the instructions of both the iterations have been issued although that completion has not yet taken place and now the store instruction will be also issued because there is a store unit available and store instruction is now issued and accordingly the corresponding address is modified here it will come from 72 effective address is 72. So, you can see there is no hazard so far present in this. Now, it will perform this computation that load operation it will continue it will perform the it will complete the load operation and because we know that this first load will take eight cycles that was our assumption because this was a cache miss. So, it has taken eight cycles it was issued here and it has taken eight cycles. So, in the nine cycles the load operation is completed and now the value will be available in the I mean in the corresponding reservation unit and it will appear here you will see load is completing and obviously this multiplier one is waiting for that data to be available here. Now, it is dispatching this instruction you can see it will again now dispatch this one that means it will corresponding to the for this instruction I mean for the second iteration this will continue and now here you can see as I was telling that load has now completed and corresponding value is transferred directly to this register V j V j is a register which is available as part of the reservation station. Now, both the values are available and obviously in the next cycle this multiplication instruction will start computation. So, here four is written here indicating that the computation has started and it will require four cycles to perform the computation and in the meantime as you can see it is now prepared for the third iteration because R 1 has been modified corresponding to the third iteration. So, the first iteration was corresponding to I mean the effective address was 80 for the second iteration effective address was 72 and for the third iteration it is 64. So, it is already prepared for that and load 2 is completing now because it was started in instruction cycle 6. So, it will take four cycles. So, in the 10th 4 plus 6 plus 4 10 cycle it will come this load will be completed and it will dispatch this instruction and you can see this load is completing. So, result is being written here. So, I mean the result writing will take place in the elephant cycle and here the upper end is available. That means result is available means it is providing to this multiplier 2 this load will provide multiplier 2 and it will directly go to the reservation station buffers and it will be available here. Now, here you can see in the meantime that F 0 has been updated to load 3 that means again this result which is coming from the second load will not be written into the register because already another instruction has been issued which will provide the upper end to this register F 0. So, this will continue in this way. So, next load in sequence is coming for third iteration has started now. So, third iteration has started next load in sequence. Now, question arises will it be able to issue you see although this instruction can be issued but it is not shown here because that this instruction window we have not updated it has remained the same although this instruction has been issued this third iteration instruction has been issued. Now, you can see this multiplier this multiplication operation can it be issued the reason for that is I mean it cannot be issued the reason for that is both the multipliers are now busy. Since, both the multipliers are busy we have got 2 multipliers. So, this instruction cannot be issued until the multipliers the other the one of the 2 multipliers become free. So, as a consequence although we shall the third iteration has been reached but this multiplier multiplication operations cannot be issued because of structural hazard. Now, here the third store. So, the third store cannot also be issued why it cannot be issued the reason for that is we have seen this Tomasulo's approach performs in order issue. That means, if multiplication cannot be issued no other subsequent instruction can be issued and as a consequence since multiplication operation has not been issued here cannot be issued this store third store cannot be issued also. Now, this multiplication operation is getting completed it will it has taken it will be performing this multiplication is complete. So, and multiplication operation that result will go to F4 and F4 and this stored data this instruction is waiting for the data. So, in the next cycle it will do that. So, you can see it has started this it has perform I mean provided the result that result will be used I mean by the by the store data instruction to store the result. But that will start in the next cycle after it is completed and in the meantime the another this second multiplier operation will be also completed. You can see both are this is just in one cycle difference both are completing because here it was started issued in seven cycle. But result were available you see in this particular case it has taken 10 plus 4 here it is 11 plus 4. So, multiplication operation is requiring four cycles. So, that is why here it is completing in 15 cycle and this will complete in 16 cycle as we shall see. So, it will completing in 16 cycle and now the stored data will be completed stored data will perform completion. So, it will require four cycles 14 plus 4 18 and it is completing the writing the result and in the 19 cycle both this one is completing and I mean writing result and this one is completing this store is completing. And now you see this loop example we have the across the loop boundary computation has been performed and let us look at this here we find that in order issue no instruction has been issued out of order 1 2 3 6 7 8. So, in order issue has been performed. However, out of order completion execution has been done out of order execution has been done and as a consequence you can see although I mean it is completing in 18 cycle, but this is completing in 10 cycle. So, out of order completion again this is completing in 15 cycle and this is completing in 19 cycles. So, we can see here out of order completion is taking place not be way then not in the same order. So, this one is completing earlier than this one is completing earlier than this. So, out of order completion is taking place and out of order execution is taking place and out of order completion taking place in this part. So, as you can see, this is how it is happening. Now, the question arises, why can Tomasulo scheme overlap iterations of loops? How is it possible? Number one reason is, multiple iterations use different physical destination for registers. So, it does register renaming. Because of register renaming, it different physical registers are being used and essentially you may consider dynamic loop unrolling. Earlier we have discussed in detail loop unrolling which is carried out with the help of compiler and there we have seen explicitly you have to use registers available in the processor for the purpose of loop unrolling. But here as you can see even without having architectural registers by that I mean registers available in the processor, it can do loop unrolling dynamically using the registers buffers available in the reservation stations. So, that is the reason why it can do. Second is the reservation stations permit instruction issue to advance past integer control flow operation that we have already seen. Also buffer old values of registers totally avoiding WIR stall that we saw in the scoreboard. So, I mean starting from scoreboarding it is that how the right after the stall is avoided that we have discussed in detail that is also performed and as a consequence we are able to overlap iterations of loops in the Tomasulo scheme. These are the three major advantages provided by Tomasulo scheme. First of all distribution of hazard detection logic we have seen there are several reservation stations which will detect the hazards. And if multiple instructions wait on a single result instructions can be passed simultaneously by broadcast of common data bus. And if a centralized register file were used units would have to read their results from registers. So, this thing is avoided because we are not using centralized register file, but we are using the registers files available in the reservation stations. And elimination of stalls of WA, WA and WA are hazards I have already elaborated on this how these stalls are overcome by using that dynamic register renaming and by using that effective address calculation for memory units. So, possible to have superscalar execution now this will lead to another possibility. So far we have considered that number of instruction issued is only one, but if we have enough instructions available which can be executed in parallel we can go for superscalar architecture. In other words Tomasulo's basic scheme can be extended for superscalar architecture. However, in that case it may be necessary to have more than one common data bus and we may have to duplicate the reservation stations for the two I mean more than one common data bus, but it is possible to have superscalar execution. So, this is the summary reservation stations renaming to larger set of registers plus buffering of source operands this prevents registers as bottleneck that means the physical availability of registers in the processor that restriction is overcome and I have already repeated many times it avoids WA and WA for hazards of scoreboard using Tomasulo's algorithm. And it allows loop unrolling in hardware by dynamic register renaming is possible in this scheme and this is also quite clear it is not limited to basic blocks because you can go beyond branches interior units gets ahead and it goes beyond branches and it helps cache misses I have already explained this. And these are the three lasting contributions of Tomasulo's approach number one is dynamic instruction scheduling register renaming and load and store disambiguation. Load and store disambiguation means that I mean that WAW type of hazards and WAR type of hazards arising out of load and store these are being overcome I mean disambiguation of load and store which is being done in Tomasulo's algorithm. And as a consequence the many descendants of 3691 that means as we know the Tomasulo's approach was originally conceived or developed for 3691 to improve the performance of floating point operations. But subsequently they have been incorporated in all the I mean a good number of modern processors like Pentium 2, PowerPC 604, MIPS 10,000, HP P8 8000, DECalpha 21264 of course there are some drawbacks that is the common data bus connects to multiple functionate units because of high capacitance you know it will be slower. So, number of functional units that can be completed for cycle is limited to one because of common data bus. So, common data bus is a major concern in this Tomasulo's approach because all the functional units are feeding to a common data bus and as more and more functional units are feeding to it the capacitance increases and only one of them can output the result through that. Of course, it can go to multiple functional units result can go to multiple functional units, but only one functional can unit can produce result on the common data bus. So, this is a this is a severe restriction and so the alternative approach is to have multiple common data bus. So, more functional unit logic for parallel stores. So, this can be done, but it will definitely increase the complexity. So, that is one that is also a drawback Tomasulo's scheme. So, the hardware complexity is relatively high because you require reservation stations having large number of registers and if you go for multiple common data bus it will again increase the complexity of hardware and another important aspect is imprecise exceptions. So, effective handling is a major performance bottleneck. Let me briefly discuss about these interrupts and exceptions which can arise in a processor. Now, as you know interrupts are essentially generated externally. As you know each processor is provided with suppose you have got a CPU usually it is provided with two interrupts two interrupt inputs. These interrupt inputs are known as one is usually non-maskeval interrupt and another is maskable interrupt. They are available in variety of names, but whatever it may be one is non-maskeval interrupt another is maskable interrupt. Non-maskeval interrupt is commonly used for some kind of emergency situations like power failure and other things. So, it is restricted to that, but however this maskable interrupt input is commonly used for interrupts coming from IO devices and that has lead to that interrupt driven IO operations. That means whenever an IO device is ready for providing data to a processor it will generate an interrupt then the processor will read that data. So, that is the basic concept interrupt driven IO and that interrupt may also come from the operating system and another type of interrupts are exceptions. Exceptions are internal which are generated as the instructions are executed. So, for example, an illegal off code is executed. So, processor tries to execute an off code and that code is not matching with any valid operation codes present in the processor. So, it will generate an exception. Similarly, then divide by 0 overflow, underflow, page fault. So, these are various situations that can happen dynamically as instructions are executed and this will lead to exceptions. That means, whenever this happens OS needs to intervene to handle exceptions. Whenever this happens usually control is transferred to the operating system. CPU cannot handle these things. So, that is why the operating control is transferred to the operating systems, but there are the imprecise exceptions are to be taken care of. So, what do you mean by imprecise exception? An exception is called imprecise when the processor state when an exception is raised does not look exactly the same compared to when the instructions are executed in order. That means, here this is arising out of order execution. Since we are performing out of order execution, what can happen? The exceptions that can be generated by these instructions should be exactly same as if those instructions have been executed in order. So, this has to be taken care of and this is done with the help of the in and out of order execution model. An imprecise exception is said to be occur if an exception is raised by an instruction. Some instructions before it may not be complete, some instructions after it are already complete. For example, a floating point instruction exception could be detected after an integer instruction that is much later in the program order is complete. So, this type of imprecise exceptions are to be taken care of. So, we have discussed Tomasulo's approach in the context of whenever we have a got branch prediction. So, along with branch prediction, Tomasulo's approach can help you to go beyond loop iteration. I mean you can execute instructions beyond loop boundaries. I mean multiple iterations can be executed one after the other if the prediction is taken that you have demonstrated with the help of an example. However, there are certain things like imprecise exceptions are to be taken care of. The various other things like different types of hazards are taken care of automatically that you have discussed in detail. In my next lecture, I shall discuss about another very important concept and that is your not branch prediction, but speculative execution. So, we shall see what is a speculative execution and how Tomasulo's scheme can be extended for speculative execution that we shall discuss in my next lecture. Thank you.