 And welcome to today's lecture we shall continue our discussion on dynamic instruction scheduling. In the last lecture we have discussed about one technique known as core board and today we shall discuss about Tomasulo's algorithm which is an advanced version and enhanced version of dynamic instruction scheduling based on score boarding approach. As I have mentioned dynamic scheduling means the hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior. So, instead of compiler it is done by hardware. So, that was that I mentioned earlier and it has got a number of advantages. It allows handling situation not known at compile time and obviously it simplifies the compiler it is not necessary to have a optimizing compiler for a particular high end processor and it allows processor to tolerate unprecedented delays such as cache misses by executing other code while waiting for the miss to resolve. That means certain situation will happen at run time which cannot be predicted at compile time and when those things happen at run time the hardware will take care of it and that is one very important advantage as I have mentioned earlier and it allows code that was compiled for one pipeline in mind to run efficiently on a different pipeline. So, it is not explicitly pipeline dependent a particular code which has been compiled for a particular pipeline will run on another pipeline whenever we use this hardware based instruction scheduling instead of compiler based instruction scheduling. And also later on we shall discuss about advanced technique like hardware speculation which can lead to further performance advantage advantages and particularly it will increase the instruction level parallelism which is essential for super scalar and other multiple issue processors and obviously this is based on dynamic instruction scheduling. Now, let me very quickly review about this core board approach which I have discussed in the last lecture. We have seen that it gives you a speed up of 1.7 for compiler generated code and 2.5 speed up by hand but slow memory. So, by hand whenever you do the coding then you get better speed up however at that time there was no cache memory. So, there was no cache slow memory was used and it has got a number of limitations number the first limitation that arises is that it does there is no forwarding. As I mentioned in case of score voting the data is taken from the actual register. So, until the write back take place in the actual register you cannot really read the data. So, forwarding scheme which enhances the performance by reading data from pipeline registers that is not allowed in score voting. So, that is a very strong limitation. Second is limited to instructions in the basic block as we have already seen the basic block size is very limited may be 6, 8, 10 that is the number of instructions that you get in the basic block. That means without doing loop unrolling without doing other techniques to improve instruction level parallelism. The parallelism available within the basic block is very restricted and as a consequence you do not get much benefit. And then number of functional units is also restricted and it decides the structural hazards. That means the whenever you have got some even you have got instruction level parallelism you cannot really proceed because of structural hazard that is restricted by the number of functional units. And in case of right after the hazard we have seen since it is based on reading from the register it leads to stall. That means the execution of that instruction has to wait till writing take place. Similarly it prevents right after right hazard, but again it does it by not executing instructions which are in the later part of the program code. That means it follows the program order. So, by doing that it prevents WAW hazards although issue takes place, but the write back that means the completion is avoided completion is prevented. So, these are the limitations of scoreboard approach and some of them have been overcome by using the Tomasolo's approach. So it is a more sophisticated approach and it was developed for IBM 360 by 91. The basic goal was to keep the floating point 5 line as busy as possible for the entire 360 family. So 360 family means there were many variations a series of computers were developed. So it was meant for the entire series not that the enhanced version the more advanced version for which you develop a good compiler and the low end versions you leave as it is. Instead of that the scoreboarding this particular approach will allow simple compiler and the scoreboard will be used for the entire series of 360 family. That was the basic goal particularly for keeping the floating point 5 line in mind. And particularly 360 has got very restricted number of registers and we have seen scoreboarding and also the compiler based instruction scheduling requires large number of registers, but unfortunately 360 had 4 double precision floating point registers and long memory access and long floating point delay. The reason for that was in those days cache memory was not invented. So in 360 there was no cache memory. So reading data from memory used to take long time that is number one. Number two is floating point operations and other multiplication other operations used to take longer time. So this led to Tomasulo to try to figure out how to achieve renaming in hardware. So this is another innovative technique that has been used here. It you know we have already mentioned about the use of renaming which can help you to remove the anti-dependencies and output dependencies. But whenever you do renaming obviously you will require large number of registers. If you do it with the help of software or with the help of compiler, but in a very innovative way it has been done by hardware renaming by hardware using in Tomasulo's approach. We shall see how it has been done. And because of these advantages you know some version of this Tomasulo's approach were used in more recent processors. So although it was developed for 360 by 91, but subsequent processors more recent processors like DEC, Alfa 21, 264, HP 10,000, MIPS 10,000, Pentium 3 or P6604, Pentium 4 and so on. In all the modern processors Tomasulo's approach has been used in some way or the other may not be in the exactly the way it was developed for 360 by 91, but some modifications modified version of it has been used in all the modern processors. Let me give you a simple example before I proceed to discuss Tomasulo's approach. This is a simple 5-9 code sequence. Here you can see it has got a large number of dependencies. For example, it has got three true dependencies. For example, divide D and add D has got two dependencies which can read after write hazard. So the operand that is being generated by the first instruction is used by the second instruction. Similarly, you have got another output dependencies. For example, between your divide D and sub D, here also there is that means these two has got output dependencies. So you are writing into a register, here also you are writing into a register. So here you have got output dependencies, then here you have got anti-dependencies. You can see sub D and multi-D. So here you have got output dependencies. So we find that and here there is two data dependencies also sub D and multi-D. You are writing in F8 and then that is being used in the subsequent instruction. So you find that apart from in several three data dependencies that is present in this program code, it has got other dependencies like output dependencies and name dependencies. Now if you want to avoid the hazard, you may want to schedule the sub D instruction before add D. I mean by doing that, but it cannot be done because it will lead to WAR hazard. Similarly, multiplication double is scheduled before add D. It will lead to WAW hazard right after I hazard. So because of these instruction scheduling cannot be done, but resistant renaming can be used to overcome this problem. That means, these WAAR and WAW hazards can be overcome by using register renaming. Let me illustrate this with the help of this simple example. Here you have got divide D F0 F2 F4, then you have got add D F6 F0 F8 stored at double F6 0 R1, then sub D F8 F10 F14 and you have got multiplication double F6 F10 F8. Now let us do this is the original code sequence. Now let us do register renaming. Let us assume that we have got two temporary register R and S. R and S we have got two temporary register which we shall use in register renaming. So what we can do? We can do divide D F0 F2 F4, then here add D here instead of writing in F6, we write in say a temporary register R F0 and F8 and the next instruction SD also use R1 and then what we do? Then sub D here we use T F10 F16 F14 and then add D here you have got F6 F10 and F8 has been replaced by T. So here it will be T. So here we find that there is no output dependences and anti-dependences. So of course you have got true dependences, true data dependences which will be present there and so renaming register renaming can be used to overcome the WAW and WAR type of hazard. Now this we shall do if we do it by software then you will require large number of registers. How it is done? By using Tomasulo's approach by hardware we shall see later and let me compare Tomasulo's algorithm versus scoreboarding. It shares many ideas with the scoreboarding scheme. It combines the key elements of scoreboarding scheme with the introduction of register renaming. So this is the additional new feature that is register renaming by hardware and register renaming is achieved with the help of some hardware known as reservation stations. So in the reservation stations the operands are buffered. So instead of writing into the register or any temporary register we use reservation stations where the operand values are buffered and then from the reservation station it can be provided to the execution unit and so which buffered the operands of instructions waiting to issue and by the issue logic. So reservation unit stations are used corresponding to each functional unit and where the operand values can be buffered without the need for additional registers in the processor and hardware detection and execution control functionality are distributed. We have seen in Tomasulo's approach there was a scoreboard. Scoreboard is used to overcome the to detect if there is any hazard that means if there is any structural hazard then the instruction is not issued and if there is any data hazard then it is not issued until the operands are available. Here also here it is done in a distributed manner instead of using a single scoreboard it uses several reservation stations and these reservation stations does the hazard detection and execution control functions in a distributed manner and results are directly passed to functional units from the reservation stations rather than going through the registers. We have seen in scoreboarding the results were first written into the registers and the execution units were reading them from the registers but here what is being done the reservation stations are holding the values and which are directly passed to the execution units rather than going through the registers. So, these are the main differences and another very interesting thing is it uses a common bus, common result bus known as common data bus CBD allows all units waiting for an operand to be loaded simultaneously. So, what can happen a particular functional unit may produce a result and which can be used by several other functional units. So, that means the result produced by one instruction may be used by several other instructions. So, in this particular case there is a common data bus on which a particular execution unit will produce the result then since this result is available on a common data bus. So, all the execution units which are waiting for the operands will get them simultaneously and their execution can be started. So, this is another another difference between Tomasulo's algorithm and scoreboarding. So, we find that these are the key differences between scoreboarding approach and Tomasulo's algorithm and here is the here is the structure of the hardware that you require to implement Tomasulo's approach. And you can see here the instructions are fetched by instruction fetch unit instruction fetch unit and they are stored in instruction queue. So, instructions are fetched and the instruction unit will put them in this instruction queue and from the instruction queue the instructions can be issued to multiple functional units. For example, here you have got memory unit, here you have got several floating point adders and you have got several floating point multipliers. And these functional units floating point adders and floating point multipliers are getting their inputs not from the registers but from the reservation stations. So, reservation stations are holding not only information I shall tell about the data structure that is being used by the reservation stations and you will see the values are stored in this reservation stations and they can be directly provided to the functional units whenever both the operands are available and whenever the operands are available the execution is started. Similarly, of course the load and store is performed in a little different way the load operation can be performed as soon as the address is available. So, address unit will initially stores the value that is taken from the instruction and then it computes the actual address and that address is stored in the address unit and in the store buffers the addresses are available and as soon as the address is available the memory unit sends the address to the memory unit and it reads the data from the memory and that data is available on this common data bus as you can see as the data is available from the common data bus it goes to not only to the registers where the value has to be written but it also goes to all the reservation stations. So, it goes to all the reservation stations which are waiting for the data and they will write it into the registers. Similarly, whenever any operation is performed by a adder or a multiplier those outputs are provided on the common data bus and from the common data bus it is written into the registers as well as it goes to the reservation stations where they are buffered if a subsequent instruction needs them. So, this is how it goes on and you can see the so far as the store operation is concerned the you will require not only the address. So, you will require not only the address but you will require data. So, you can see it will store the address as well as the data that will be available as you can see through this common data bus and as soon as the data is available it is stored in this buffer and when both address and data are available it can be stored in the appropriate memory with the help of memory units that is how the store and load differs. But so far as the execution is concerned you see when the load and stores are also considered just like any other instruction I mean ALU based instructions performing addition multiplication and other thing. So, you do not have to treat load and store differently, but the data is coming not from the reservation stations but the address and data are coming from separate buffers you have got buffers for address and also buffer for data from where the memory unit will get information about address as well as address and data in case of store and accordingly writing will take place as soon as the both the values are available store I mean address as well as data. So, this is the structure that you require for implementing Tomasulo scheme. Now, let me highlight some of the important features of the different functional units. Reservation stations single entry buffer at the head of each functional unit has been replaced by multiple entry buffer. We have seen in case of score boarding there was multiple entry buffers and here you have got single entry buffer as you can see and which are stored in the reservation stations. And common data bus connects the output of the functional units to the reservation stations as well as the registers. That means parallel writing will take place in the register as well as in the buffers of the reservation station from the common data bus and wherever it is required it will go it will channelize to the appropriate reservation stations who are waiting for the data. Obviously, to do that it will maintain suitable data structure which I shall discuss little later. Then it has got register tags tag correspond to the reservation station entry number for the instruction producing results. That means which particular reservation station will produce a result and that will be used by subsequent instructions are that it is maintained with the help of register tag and that information is maintained with the help of register tags. So, the basic idea is an instruction waits in the reservation station until operands become available. So, it is basic philosophy is same as that of score boarding like data flow machine. So, as soon as data is available execute it. So, it will definitely lead to out of order execution and obviously out of order completion and since the operands it waits for the operands to become available it helps overcome read after write hazard. So, read after write hazard is overcome by following this approach because only when operands are available then execution is started. So, read after write hazard is overcome by this and a reservation station fetches and buffers an operand as soon as it is available. We have seen that the output from different functional units are directly going to the reservation stations and it eliminates the need to get operands from registers. So, each was done in case of score board. So, instead of having a centralized control and buffer the control and buffers are distributed with functional units in the form of reservation stations. So, reservation stations can be considered a kind of which are part of the functional units. That means, each functional unit is associated with a reservation station where the operand values can be stored and as soon as operands are available they are transferred to the reservation stations and store operands for issued but pending instruction. That means, as soon as an instruction issue takes place automatically a functional unit and reservation stations gets allocated and they keep track of it and since reservation station along with functional unit is allocated as an instruction issued they will automatically keep track of when the outputs are generated by instructions which will produce the operands needed by those instructions. So, register in instruction in instructions replaced by values and others will with pointers to reservation stations. So, as I have already mentioned we are not using registers but the values actually obtained by doing a computation are stored in the reservation stations and obviously you have to keep pointers so that the values can be appropriately used and this achieves register renaming as I have told and it avoids WAAR and WA type of hazards without stalling. So, it does not allow to be I mean stall for WAAR and WAW type of hazards. So, it does register renaming without need for registers that is the key advantage. However, it will require many more reservation stations than registers because each functional unit is associated with a reservation station and each reservation station must have the necessary registers to store the operands. So, can do optimizations then compilers cannot. That means, what the compilers cannot do can be done by the hardware with the help of these reservation stations because it can at run time it obtains the data values which are stored in the reservation stations and which cannot be predicted or obtained at compile time. So, results passed to functional units from reservation stations this is another thing I have already repeated it not through registers therefore, similar to forwarding. So, without doing forwarding hardware forwarding is also achieved because you know before the values are written into the registers these are available in the reservation stations as soon as they are generated by a functional units and as a consequence it is serving the purpose of forwarding without the forwarding hardware. So, over common data bus that broadcast results to all functional units as I have already seen the structure of the hardware where the through the common data bus the results are broadcasted and which goes to all the registers which need them. And as I have already mentioned load and stores are treated as functional units with reservation stations as well I have explained that and then integer instructions can get passed branches this one I shall discuss in the subsequent lecture. We have seen in case of score forwarding it is restricted that instruction scheduling is restricted to simple block I mean it cannot go beyond the branches, but with the help of this Tomasulo's algorithm we shall see it can go beyond basic block and as a consequence that means, the basic block size that means, the number of instructions that can be that is available in the instruction window for I mean can is larger than score forwarding. So, let us see the different stages of Tomasulo's algorithm it has got three stages first stage is known as issue stage. So, it get instructions from the instruction queue I have already seen so it is serving the function of dispatch unit and it issues instructions only if matching reservation station is free. So, here the way the structural hazard is overcome is this that means, it must have the reservation stations available for the corresponding function. So, you have got multiple functional units, but you know functional units may be busy and as a consequence whenever a particular instruction is eject then it may not be issued because there will be a structure hazard. So, instruction issue is only if matching reservation station is available and along with the functional units. So, second operand to the reservation station if they are in register send operands to the reservation station if they are in registers. So, if they are in registers they are sent to the operands and reservation station they can be, but most of the in most of the cases they take it directly from the output of the functional unit. So, if operands are not registers keep track of the functional units that will produce the operands as you have seen and this as if renaming to overcome WAR and WAW type of hazards. And the second stage is the execution stage which operate on the operands and when both the operands are ready then execute if not ready monitor the common data bus for result. So, as soon as the operands are available at the output of the common data bus and it will do the reading and then execution can be started. So, this step checks for read after write type of hazards. So, if more than one instruction is ready that means what can happen several instruction more than one instruction may get ready that means may have operands available and they are independent they can be issued simultaneously. In such a case what can be done? In such a case instruction issue is done arbitrarily. So, floating FP reservation unit choice is done arbitrarily, but loads and stores are two steps as I have already discussed loads in the load buffer execute as soon as memory unit is available for stores it has to wait for value as well as for address. So, to preserve exception behavior no instruction is allowed to initiate execution until all branches that precede the instruction in program order have completed. So, exception behavior has to be is also maintained by this approach. So, that means what is being done the instruction is allowed to proceed means it is allowed to execute out of order, but it is not allowed to write into the registers until you know that all branches that precede the instruction in the program order have been completed. That means it maintains the program order particularly whenever there are branches in between that is the reason why it can go beyond the branches and the third stage of the Tomasulo's algorithm is write result. So, finish execution so it is write back. So, write on CDB and from there in the registers and reservation units waiting for result. So, mark reservation station available that means here we have seen whenever writing take place writing is done by floating the data on the common data bus and accordingly it goes to appropriate register as well as to the reservation stations who are waiting for the data. So, stores are buffered in the store buffer until both the value to be stored and the store address are available then the result is written as soon as the memory unit is free. So, in the this write back stage also the that store operation is performed and because it requires two steps because address is required and value to be stored is also required. Now, there are three parts number one is instruction status that means instruction can be in one of the three stages issue execute write back. So, that instruction status is maintained and which of the three steps the instruction is in that is being maintained by suitable data structure. Then reservation stations each reservation station has six fields and these are the six fields number one is operation to perform on source operand S1 and S2 then QG and QK reservation station producing the source operand. You can see the reservation station is keeping track which other reservation stations are producing result producing operands and as a consequence as soon as they are available they can be transferred to the proper registers available in the reservation station. So, QG and QK field keeps track of those reservation stations then VG and VK the value of the source operands. So, value of the source operands as I mentioned the value generated by functional units or by store I mean load these are all generated with the help of with the help are stored in these registers VG and VK the value of the store operands are stored and this VG indicates that this reservation station and its accompanying functional units are occupied. So, this will help you to overcome structural hazard by looking at the busy flag bit and third type of third functional unit that is third part is register status the register file and store buffer each has a have a field QI. So, if QI if the value is blank no currently active instruction is of computing a result designed for this register. That means here it maintains about a register status the whenever a particular instruction will produce a result that will be stored in a register. Now, if the if this field is blank that means the all no already issued instructions are producing any result that will be stored in that register so it implies that. So, these are the different fields that are that is to maintain to maintain the status of the processor and the load and store buffers each has a field A which holds the result of the effective address once the first step of execution has been completed. So, the load and store buffers initially it actually takes the instruction from the instruction register I mean as the instruction is fetched then it calculates the effective address. So, effective address is calculated in the second step and that is being stored in the load and store buffer that means initially it is done in two steps first the you know that whatever is available in the instruction and by using that the effective address is calculated by adding the value of the program counter with the value that is available in the instruction and then that effective address is stored in the load and store buffer. So, let me illustrate the Tomasulo's algorithm with the help of an example and here is the here you have got this is the it is maintaining the clock cycle counter clock 0 we shall start with clock 0 then you have got three load load and store buffers there load you have got three load units then this is the instruction stream that means this is the window on which the Tomasulo's algorithm will be working this is the instruction window these are the instructions to be executed and here the functional unit countdown information is stored different functional unit will take different time for example, we shall see multiplication will require 10 clock cycles division will require still longer number of clock cycles addition I mean fixed point addition will require 2 clock cycles. So, that information is stored here and time information. So, initially it may have the value 2 or 10 and then it will be decremented and these are the reservation stations corresponding to different functional units we have got 5 functional units 3 floating point adder reservation stations are available here and 2 floating point multiplier divider which are also available in this particular as we have seen this Tomasulo's algorithm was primarily developed to enhance the performance of the floating point unit. So, that is why you have got information about only the floating point units now we have started the clock this is the clock cycle one and instruction one has been issued and accordingly it is a load instruction that load one unit has become busy and the fact the address to be calculated is 34 plus the content of R 2. So, that will be I mean actually the register R 2 will have the value and with that you have to add 34 to get the effective address and then in the next cycle the instruction 2 is issued which also is a load instruction fortunately we have got another load unit which will perform that load operation and it gets busy and the corresponding effective address will be calculated from here. So, and you can see the functional units where on the register on which that F 2 and F 6 where the value will be loaded by different functional units and load one unit will load in floating point register 6 load 2 will load in floating point register 2 these are also maintained. And right now as you can see none of the reservation units are busy at this moment because only load and store instructions have been issued load instructions have been issued no arithmetic instructions have been issued so far. Now, suppose there are several load instructions say 3 or 4 or 5 in such a case only 3 load instructions cannot be issued because of structural hazard. So, but in this particular code sequence you have got only 2 load instructions so there is no problem. Now, the execution of this load operation will be performed that means it will load the value into the register by reading it from the memory. So, execution will proceed it will take 2 clock cycles and it will do that and then that multiplication double instruction has also been issued in the third clock cycle and here F 0 is the destination register that is being reflected here and corresponding functional units gets busy as you can see this multiplier 1 is now busy and it will 2 operation to be performed is mentioned here the address is mentioned here and it will that load 2 that second operand will come from load 2. So, you see here it is mentioning from which functional units the data will be available and it will directly load the value in this register. So, register names are removed renamed by the reservation stations. So, you can see register names are not there load 1 is completing. So, load 1 is completing and it will be loaded into the register 6. So, it completes it and you can see the that writing has taken place into the register F 6 and the fourth instruction has also been issued in the fourth cycle and the load instruction is in the execution stage here it is maintained and accordingly the effective address is calculated by the here and so that it can do the execution and you can see the functional units are maintaining the destination register rather from which functional unit the operands will be available that is being maintained here. So, load 2 is completing now and it will be writing into the F 2 register here load 2 will write it into the F 2 register. So, M A 2 you can see it reads from the memory and writes it into it will write it into the register, but you can see here the not only it is writing into the register it is also writing into the reservation stations. So, M A 2 you can see those values are written into the register that means that F 2 that is needed here F 2 that is needed here these are directly written into the registers because V G and V K are holding the values. So, these values are directly after reading from the memory they are written into this register V G and V K. So, operand values are available for this add 1 so add 1 that means this particular instruction can be so it can be started and you can see the time required to execute this instruction is 2 cycles. So, 2 is being mentioned and it will be decremented as we shall go to the next cycle similarly say the multiplier that multiplication operation floating point multiplier will require 10 clock cycles and it has got both the operands now both the operands are available here also both the operands are available. So, they will start execution in the next clock cycle. So, timer starts down for add 1 and multiplier 1 as I have said. So, in the next clock cycle 2 has become 1 and 10 has become 9 and execution has started for that for those two instructions and all the instructions have now been issued. You can see as we have reached the clock cycle 6 all the instructions have been issued because we have the necessary functional units and reservation stations. So, since there is no structure hazard the all the instructions have been issued, but you can see different instruction and in different stages the instruction 1 and 2 have been completed and now the remaining four instructions have been issued, but the execution is proceeding for instruction 1 I mean that instruction 4 and multiplication that is your instruction 3. So, issue are here despite name dependency. So, there is a name dependency, but in spite of that it has been issued and now this instruction execution has been completed and as it is shown here that timer value has become 0 and operands are available it will write in the appropriate register that is your f 8 it will write in f 8. So, add 1 adder 1 will write in f 8 that is being shown here and as we go to the so this instruction is completing. Now, you can see we have reached the clock cycle 8 and the first second and fourth instruction have been completed. So, you see not only the instruction issue was out of order in order issue took place, but out of order execution have been done and out of order completion has taken place. So, out of order completion has also taken place for instruction 4 and instruction now the other instruction will proceed for example, this second instruction that is your this add d that means 6 instruction now will has got operands it will be now be in the execution stage and it will require 2 clock cycles and multiplication will require 7 clock cycles. So, and here this instruction execution is complete as you can see the time is 0. So, execution has been completed by getting the operands from the registers Vj and Vk. So, Vj and Vk have provided the operand values and this operation has been performed and it will now be written into the f 6 register in the next cycle. So, in the next cycle the operand the writing has been taken place and the now only instruction that is in I mean that has been that is in execution is multiplication division has not yet started because it has to it has to get both the operands only then division will be started. So, one operand is yet to come here as you can see Vj is not yet available. So, it is waiting for the operand writing has taken place multiplication will continue multiplication will continue and multiplication operation is now completing and it will write the result in the next cycle. So, multiplication is also complete now only instruction left is the divide D and divide D will require 40 clock cycles. So, it will continue, but fortunately both the operands are now available and it will continue to execute. So, you have to you can skip a couple of cycles and you have come to clock cycle 55. So, in clock cycle 55 we can see this division operation will require one more clock cycle to complete. So, it will complete in another one clock cycle. So, execution is now complete and it will perform the writing into the appropriate register that is in f 10 it will write multiplication to that hardware it will write it f 10. So, you can see the execution of these all these instructions have been completed and when they have been issued the issue have taken place in order as you can see 1, 2, 3, 4, 5, 6, but execution has taken place out of order and completion has also taken place out of order, but the Tomasulo's algorithm has overcome the hazards that can arise because of you know that WAW and WAR type of hazards and we have seen how it has been done with the help of suitable hardware. So, in order issue out of order execution and out of order completion. So, these are the two major advantages the distribution of hazard detection logic we have seen arises out of use of the distributed reservation stations and use of common data bus a single result can be broadcasted to multiple functional units and elimination of WAW and WAR hazards by renaming registers with the help of the reservation stations. So, without registers renaming has been done with the help of reservation stations by writing the values into the registers. And here is a comparison between Tomasulo's algorithm and scoreboard. So, the here in CDC 6600 multiple functional units were used in IBM 3691 pipeline functional units have been used 6 load 3 store 3 adder 2 multiplier divider and so on and window size you can see is larger than the scoreboard approach. Here it is 614 instructions in place of 5 instructions in case of CDC 6600 and there is no issue on structural hazard because structural hazard is taken care of by the because in order issue is taking place. So, that takes care of structural hazard WAR renaming is avoided WAW and WAW type of hazards. But in case of CDC 6600 in scoreboard approach we have seen it leads to stall and here broadcast result from the functional units. But in case of CDC 6600 we have seen these were written into the registers and then from the from the registers the reading was taking place. And we have seen in case of that Tomasulo's algorithm control is distributed with the help of multiple reservation station in case of CDC 6600 centralized control with the help of scoreboard. Of course, in this world nothing is one sided there are some drawbacks number one is complexity because of the complexity there is there is there is lot of delays of 36091 MIPS 10000 and many associated stores you require associative stores you require large number of registers you require in your hardware. Then performance limited by common data bus as we know whenever the in large number of devices are connected to a bus the capacitance is large and it becomes slow. And as a consequence there is performance limitation because of the common data bus. So, you can go for multiple common data bus and more functional units for parallel associated stores but using a single data bus those are the limitations. Let us stop here today and subsequently we shall see how Tomasulo's algorithm can overcome I mean beyond the control loops and before that we shall discuss about the control hazards in my subsequent lectures. Thank you.