 today's lecture on in quest of higher ILP. In the last lecture, we have started our discussion on this topic and we have seen how we can achieve higher instruction level parallelism and we can have machine organization to facilitate that. So, in this lecture, we shall continue on that topic and before I proceed, here are some important parameters I should tell you. Number one is instruction pipeline cycle. Instruction pipeline cycle essentially tells you the cycle time that means, we have seen that the machine is controlled by clock and whenever you do pipelining, then the clock frequency increases and cycle time reduces. So, this is the instruction pipeline cycle. On the other hand, it may take let us assume a machine cycle may take four such clock cycles. So, in other words, if it requires four stages of pipeline, then 1, 2, 3, 4. So, this will be your machine cycle rather the instruction cycle, but pipeline cycle time clock cycle time will be this one. So, this parameter you should remember. Second is instruction issue latency. Instruction issue latency is you are issuing instructions one after the other. So, the time required to issue an adjacent instruction after a particular after a previous instruction that tells you the instruction issue latency. That means, you cannot issue all the instructions simultaneously. So, there will be a difference in time or latency in issuing instructions one after the other. So, that is defined as the instruction issue latency or abbreviated form IL. Then instruction issue parallelism IP. IP stands for the number of instructions that can be issued simultaneously concurrently that is actually known as the instruction issue parallelism. And then the fourth parameter is simple operation latency that is the processor will be performing operations instruction execution. And that operation latency is the essentially the time required to perform the operation of a particular stage that is the operation latency OL. And then machine parallelism will tell you the number of instructions in flight simultaneously at a particular instant of time. For example, if you have got K stage pipeline, then you will find that when the pipeline is full there will be four. If the number of stages is K, then K instructions will be in flight simultaneously as the instructions get executed. So, this is the machine parallelism for a simple pipeline processor it is equal to K. So, we can start with whenever we go for discussing higher and higher ILP you will see you will require a reference. So, initially when we discussed the pipeline processor our reference was non pipeline processor. And with respect to non pipeline processor we have discussed various parameters like speed up throughput and so on. Now, since we shall be concerned with increasing the instruction level parallelism greater than 1, our reference point will be the baseline pipeline processor which I have discussed in my previous lectures processor or machine whatever you call it. So, this baseline pipeline processor obviously uses temporal parallelism we have already seen. And by doing that it achieves instruction level parallelism and we have seen that for the baseline pipeline processor which we also call scalar pipeline because it issues only one instruction at a time. That is why since issue parallelism IP is equal to 1 the for the baseline scalar risk processor another thing we have added that is risk we are our discussion will be centered around only this processor that is why this is being mentioned for the baseline scalar pipeline risk processor the issue parallelism IP is equal to 1 and output latency is also 1. Output latency is also 1 because you can see here it is generating output at the interval of 1 clock cycle. So, the output operational latency that is equal to 1 and the machine parallelism as I have already told which is equal to k. K means as you can see here at any particular instant of time since it has got 4 stages this particular pipeline processor maximum of 4 instructions will be simultaneously in flight during execution. Of course, they may be in different stages of execution, but you know the maximum number that can be present in the in the in the processor I mean when they are in during execution is restricted to k where k is the number of stages in the pipeline processor. And we have already seen that peak IPC is instruction per cycle it is just the opposite of CPI cycles per instruction and IPC peak IPC is equal to 1. So, this we shall be used as our reference. So, there was a paper published by Joe P he classified various ILP machine and this is the first classification where he classified the baseline scalar processor and these are the various parameters for that processor. Now, we have discussed that scalar pipelines have got many restrictions and first one maximum throughput is bounded by one instruction per cycle. So, IPC is less than 1 or CPI is greater than 1. So, what is the solution to this? How we can achieve better performance that means, how we can achieve IPC greater than 1 or CPI less than 1 obvious solution is make the pipeline wider. So, make the pipeline wider which is also known as super scalar. So, by doing that you can increase your I mean IPC can be instruction per cycle can be greater than 1 and your CPI can be less than 1. And we have seen that there are two paths in this direction. I have already mentioned it in the in my last lecture one is your VLIW very large instruction word and in the last lecture we have discussed in detail how it really works and we have seen that the compiler has complete responsibility for selecting a set of instructions to be executed concurrently. So, the compiler identifies instructions which can be executed in parallel then several instructions are concurrently issued because this of course will require multiple functional units present in the processor. And for this VLIW processor we have seen that the hardware requirement is simple because it does not require complex issue hardware. However, it requires a smart compiler or it will require what is known as optimizing compiler. That means the compiler has to analyze the instructions after fetching from the memory it has to analyze they have to be kept in a buffer then they will be analyzed to identify which instructions can be issued in parallel. And then the compiler will bundle several instructions together and we have already discussed this in detail in the last lecture. And another alternative is super scalar processor that we shall discuss and we have already mentioned about VLSI. So, based on the classification done by JOP this VLIW has got issue parallelism that is M instructions per cycle. It can issue M instructions per cycle obviously it will depend on the machine parallelism available in the program. Although maximum possible is M as you can see in this you cannot really fill off all the fields and as a result the actual parallelism that is possible will be less than M. So, here also the operation latency is equal to one cycle because as you can see here the after one cycle here the outputs are generated after one cycle here also. So, the issue parallelism is M instructions per cycle of course, M is the maximum value and operation latency is equal to 1 and machine parallelism is equal to M into K because M is the number of instructions that can be issued in parallel. And you can see if all the fields are filled up then you can see simultaneously you can have M into K means you can see here the pipeline depth is 4. So, 4 into 3 in this particular case you can have 1 2 3 4 4 into 4 16 there is a possibility of 16 instructions getting executed in parallel that means instructions may be in flight. So, that is why the machine parallelism is equal to M into K and peak IPC instruction level parallelism that is possible is M instructions per cycle or one VLIW instruction per cycle. So, one VLIW instruction will comprise M instructions here M is the number of fields present in the machine. And we have seen that this VLIW processors have got some drawbacks number one is it has got large number of registers needed in order to keep the functional units active and these functional units are needed to store operands to store results and whenever you feel I mean say we pick up 4 instructions to be executed in parallel. Obviously, you will require large number of registers present in the processor. So, this is one drawback unless the processor has got large number of registers it is not feasible then large data transport capacity is needed between functional units and the register file and between the register file and the memory. So, you can see your instruction is quite long that has to be fetched from the memory then those particular instructions will be loaded in the instruction register. And then instruction register will give I mean will apply the input to the execution unit which has to fetch the operands from the registers. So, you have to fetch the operands from large number of registers because you have to feed them to several functional units simultaneously. So, and so between functional units and the register file and between register files and memory. So, this is the large data transport capacity or bandwidth is required and third is high bandwidth between instruction cache and fetch cache. So, you know you have your program stored in the as you know nowadays all processors use cache memory to store the program. So, instruction programs are stored in the instruction cache and data is stored in the data cache. Later on we shall discuss in detail about the cache memory and obviously you will require high bandwidth between instruction cache and fetch cache. And why it is required because one instruction with 5 operations in a particular instruction and each of 32 bits will require 160 bits per instruction. So, this example shows why you require high bandwidth between instruction cache and the fetch unit. So, there are other drawbacks large code size partially because of unused operations and wasted bits in the instruction words. So, we have seen although we have used a large instruction word, but a large number of fields are unused. So, primarily because of this you know you have got large code size partially because of unused operations and wasted bits in instruction words. So, although it is there, but a significant portion is wasted. Another very important aspect is incompatibility of binary code. You know whenever a next generation processor is introduced, it is expected that it will be backward compatible with the next processor, new generation of processor. That means, we have seen that you can execute 80386 program in a 80486 machine or a 80486 program can be executed in Pentium, but in this particular case there is a problem. Problem is arising because for example, if for a new version of the processor additional functional units are introduced, then the number of operations possible to execute in parallel is increased. For example, earlier you are having say four functional units and obviously, your processor that VLIW instruction was having four fields, this was your instruction. Now, a new generation this is the old machine and suppose you have introduced a new machine has been introduced where you have got five functional units. So, the instruction will change, so it will be like this and as a consequence the program for this machine cannot be executed in this new machine. So, backward compatibility is lost. So, this is referred to as incompatibility of binary code. So, because of these limitations VLIW processors, although I have discussed in detail for the sake of completeness and commercial processors have been implemented, commercial VLIW processors have been implemented. For example, transmitters Crusoe processors because of low power and but they are not very commercially successful. So, commercially successful processors are superscalar, but before I discuss superscalar and variations of superscalar, let me touch upon one important aspect that is your limits on ILP. You know all our processors that we shall be discussing will heavily rely on the instruction level parallelism that is available. Now, if the instruction level parallelism that is present is small, then obviously there is no gain in implementing a processor with large instruction field or processor which can issue large instruction simultaneously. So, lot of simulation studies have been carried out by many researchers to identify the limits on instruction level parallelism and there are two extreme observations number one which was by Flyance that is known as Flyance bottleneck and here it says that instruction level parallelism available in the basic block is always less than 2, it is 1.8, 1.87, 1.96. So, it is less than 2 and that means if you do not do specialized operations like loop unrolling, software 5 lining which I have discussed, then if you simply restrict to the basic block as it is present in a program, then the instruction level parallelism that is available is only 2. However, contradictory results were published by other researchers. For example, Fisher along with his colleagues published a paper back in 1984 where he said that the instruction level parallelism available is 90. So, there is a big gap between 2 and 90 and what he did actually he identified some programs where you know essentially numeric processing is involved and there he found that the instruction level parallelism available is 90 and in fact, later on this was referred to as Fisher's optimism. So, you can see there is pessimistic view about the instruction level parallelism available that means it is 2 of 2, on the other hand there is a optimistic view which says that instruction level parallelism is much larger. Subsequent researchers have confirmed that this 90 is really too big, but definitely more than 2 instruction level parallelism is possible. So, 3, 5.8, 6, 7, so these are the different type, different instruction level parallelism which are available and obviously people will be interested in exploiting the instruction level parallelism that is available in programs and for that if necessary special techniques like software loop unrolling, software pipelining those techniques are to be incorporated to increase the instruction level parallelism. So, with this background let us discussed the motivation for superscalar processor, why superscalar processor is proposed or required. You can see here, here the vectorizability of a program that means a particular program which can be vectorized vectorizability means parallelism. So, you can see vectorizability it varies from say may be 0.8 to 0.4 to 0.8 and this is the typical range. Now, suppose you have got two processors, one processor is where n is equal to 1 I mean n is the number of stages in this particular case it has been n is the number of stages. Now, n is the number of stages and you have got a simple pipeline processors obviously speed up will vary depending on the vectorizability or parallelism available in the program. So, you can see depending on the parallelism available the speed up will vary if n is equal to 4, if you have got 4 stages maximum value is 4 and this will drop sharply as the vectorizability parameter decreases. Similarly, for say for n is equal to 6 that is the n is the number of stages the maximum possible speed up is 6 as we know and this also drops rapidly as the vectorizability or that it may be called as instruction level parallelism that is present that decreases. So, but if we go for say superscalar processor by superscalar processor mean the processor can issue more than one instruction simultaneously. So, this is the corresponding curve. So, in this curve as we can see speed up even is the minimum speed up is 2 that is possible and maximum speed up can be 12 with number of stages in the pipeline is equal to 6. So, for the same number of stages if we increase the number of issue from 1 to 2 as you can see here this point corresponding to number of for say ILP or vectorizability is equal to 0.8 we get 3 speed up we get 3 for conventional pipeline without any parallelism I mean more than one instruction issued. Now, you can see here as the as you increase the if you increase the number of issue to 2 the speed up jumps from 3 to 4.3 for n is equal to 6 n is the number of stages and f is equal to 0.8 that is parallelism, but m is equal to 2 that means number of instructions that is been issued is equal to 2 instead of m is equal to 1, m is equal to 1 corresponds to that scalar pipeline processor. So, this simple observation tells you that it is essential to go for superscalar processor to achieve higher speed up and higher speed up means higher performance. So, superscalar processors you can have statically schedule superscalar processor that is one version and later on we shall discuss about the dynamically scheduled instruction processor. So, in case of statically schedule superscalar processor you can have multiple issue in order execution and later on we shall discuss about dynamically scheduled processor where you will see out of order execution is possible. So, this is the superscalar processor proposal where you will try to go beyond simple instruction pipeline to achieve instructions per cycle greater than 1 dispatch multiple instructions per cycle that means the processor whenever it is executed I mean executing a program it will issue more than one instructions and it may be 2, it may be 3, it may be 4. So, it will dispatch multiple instructions per cycle and provide more generally applicable form of concurrency. You see vector processing vectorization or vector processing is a kind of a parallelism where you know whenever you are executing a loop different iterations are independent. So, since different iterations are independent in a vector processor or different iterations can be executed simultaneously that is the that is one kind of parallelism. But that kind of parallelism is present only when a program involves vector processing, but conventional programs may not be may not always contain vectorizability or processing of vectors. So, in such a case how to provide in a concurrency that is what is being done in superscalar processor. So, it is geared for sequential code that is hard to parallelize otherwise and obviously it will exploit fine gain on instruction level parallelism that is your superscalar processor and how is it done is shown with the help of this diagram. Here you have got we are utilizing spatial parallelism you know in case of conventional pipelining we have seen we tried to use temporal parallelism. So, instructions were issued one after the other in time at the interval of one clock cycle, but here what is being done several instructions are simultaneously issued. So, issue parallelism is M instructions per cycle. So, this is the parallelism that is present in a superscalar processor. So, three instructions are issued here in the next cycle. So, another three instructions are issued in the next cycle another three instructions are issued and operation latency of course, it remains one cycle as you can see the first result is coming out after k cycles k is the depth of the pipeline and then after one cycle you get another result. So, operation latency is one cycle in this case also and machine parallelism is equal to M into k. M into k as I have already told you can have the total number of instructions in flight simultaneously is M into k and the peak IPC, peak number of instructions per cycle that you can achieve is M instructions per cycle. So, that speed up you know this is the ideal case M instructions per cycle, but obviously the speed up factor will be less than one because there will be some stalls. So, here when there is no stall when there is no dependency the instructions can be issued in parallel depending on the parallelism observable in the processor then we get an ideal speed up that is equal to M, M into k. However, we cannot really do that in real life practical programs where there will be dependencies and as a consequence there will be stalls. So, in this particular diagram does not show in stall. So, that speed up factor will be less than one. So, M into point something so that will be the value that we will get. So, peak IPC will be in practice will be less than M. Now, based on this supercalor processors have been introduced commercially. So, commercial desktop processors now do four or more issues per clock and even in the embedded processors market dual issue super scalar pipelines are becoming common. So, we have already seen that minimum instruction level parallelism that is present is two even without doing you know loop unrolling and other sophisticated things in the basic block of the program that instruction level parallelism available is two. So, super scalar processor with issue rate of two is quite common and then of course, you can go for go beyond two four or more. Here is some example. So, here two processors are shown the first one A this corresponds to the five stage I 486 that is your Intel 486 processor where no parallelism I mean super scalar technique was introduced but it is a simple scalar pipeline with five stages instruction phase decode stage I decode stage II decoding was complex the reason for that was you know these processors use complex instructions. So, since they use complex instructions decoding is little complicated that is why decoder stage was divided into two stages decoder stage I and decoder stage II then there is an execution stage and write back or operator row that was the 486 processor pipeline. So, it is a scalar pipeline now when Pentium was introduced Pentium was having a parallel pipeline with two. So, it is a super scalar processor with of degree II. So, as you can see here it will fetch two instructions decode two instructions then issue two instructions because it has got two pipes U pipe and V pipe two separate pipelines through which two instructions can be issued and we will get executed in parallel. So, the D2 you have got two separate stages execution and write back. So, this is the super scalar processor first introduced in Pentium. Now, let us focus on the performance that you can achieve whenever you go for super scalar execution. We have seen that k stage baseline pipeline processor will require n tasks and it will require k plus n minus one clock cycles. Now, let us see what is the time required to execute the same program ideally. Obviously, we shall go for the ideal one k stage m issue super scalar processor. So, here you have got say let us assume there are four stages in the pipeline and three instructions are issued simultaneously. So, three instructions are issued simultaneously then in the next cycle another three instructions are issued in the next cycle another three instructions are issued in this way it continues. So, this is the super scalar processor of degree III. So, in this case what is the time required to execute n instructions. So, for the sake of generality we shall consider it that it has a degree of m, but in this example it is a degree shown is 3. Now, as you can see here the after k clock cycles k is the number of stages it will result for three instructions will be available. Then in the subsequent clock cycles as you can see each time you will get result from three instructions. So, here the first three instructions will require k clock cycles then you are left with n minus n minus m instructions. So, n minus m instructions will be results will be produced in n minus m by a three instructions are three instructions results of three instructions are produced per cycle. So, this by three. So, this is the sorry let me put m in general. So, this is the time required k plus n minus m by m. So, this is the number of you can say t m 1 for the super scalar processor. Now, what is the speed up? What is the speed up of this processor with respect to our baseline processor? For our baseline processor we have seen the baseline that is your baseline pipeline processor the time required was k plus n minus 1 that was the time required. On this case in this for the super scalar processor of degree gain time required is k plus n minus m by m. So, speed up is equal to t 1 comma 1 by t m comma 1. So, that means, you have to divide this by this and you will get is equal to m into k plus n minus 1 by this factor n minus n plus m into k k minus 1. So, this is the speed up they will get in case of your super scalar processor. So, this is m 1 this is shown here. Now, whenever you are executing large number of instructions then n is infinity or n is equal to infinity this is speed up limit s m comma 1 that becomes is equal to m. We have seen in case of your normal processor normal processor means pipeline processor we are considering pipeline processor the speed up was k with respect to non pipeline. Now, in this particular case speed up is m with respect to the baseline pipeline processor. So, if we consider the speed up with respect to the original non pipeline processor the speed up will be equal to maximum speed up with respect to the non pipeline processor will be m into k or machine parallelism is m into k as expected with respect to the baseline pipeline processor the speed up is m. Now, here is some comparison between the VLIW and superscalar processor. We have seen that VLIW in case of VLIW compiler finds the parallelism in superscalar hardware finds the parallelism. So, VLIW simple hardware superscalar will require more complex hardware and but one important point that you should notice here is that VLIW has less parallelism because it can exploit less parallelism because you know that parallelism is extracted with the help of a programmer compiler and compiler cannot identify all the parallelism that is present in the program at compile time. On the other hand, since it is done by hardware by the superscalar processor at run time it can identify more parallelism instruction level parallelism as a result superscalar gives you better performance. So, ideally you know we have seen for both the cases speed up is m but in case of VLIW lesser number of fields of the VLIW instruction will be filled up. On the other hand in case of superscalar processor more number of instructions will be executed in parallel because parallelism is extracted with the help of a hardware. So, superscalar will give you better performance. Now, let us look at the superscalar organization machine cycle time is shorter than the baseline processor the cycle time is another variation that is super pipeline. We have discussed about simple pipeline or scalar pipeline we have discussed about VLIW we have discussed about superscalar. Now we are considering another extension that is known as super pipeline what is the basic idea behind the super pipeline processors. It has been observed that we have seen that the different stages of a pipeline processor do not take are not uniform because you know we are trying to combine different types of instructions which are not same. So, as a consequence there is a possibility to further divide a particular stage of a pipeline. For example, suppose originally a processor was having say four stages. Now each stage this is the basic number of stages now this is further divided each stage is further divided into two sub stages. So, you can say you have got a major cycle as defined by the pipeline now you have got minor cycle. So, in each cycle some minor cycles are introduced what is the advantage that is that you get advantage is now as if the number of stages is increased. So, you have introduced n minor cycles as you introduce m minor cycles the clock frequency is the cycle time is now becomes 1 by n of the baseline processor. So, in this particular case I have shown n is equal to 2. So, the clock frequency doubles or the cycle times is 1 by n cycle time is reduced. So, this is 1 by 2 of this now what benefit do you get out of it? Benefit is now another instruction is issued after a minor cycle. So, instead of waiting for the major cycle instructions are issued after a minor cycle. So, as a consequence although the initial delay is same as the pipeline processor that means it will again take k clock cycles or k into n you know that clock cycles in terms of this super pipeline processor. But subsequently you will get result at the interval of 1 by n of the cycle time of the basic pipeline processor. So, your throughput will increase and it has been found that your m I mean whenever you have got super scalar processor of degree m and super pipeline processor of degree n when m and n are equal you get more or less same performance. But what is the difference between the two? Difference between the two is in this particular case you can see the number of I mean the clock frequency is increasing cycle time is getting reduced and the processor has to run at a faster with a faster clock. So, here it has to run at a faster clock and however the number of you know that number of instructions that will be if you consider the throughput it will although it is same, but the clock frequency is increasing. So, it is characterized by output latency of 1 cycle that is n minus cycles and I l is equal to 1 minus cycle and I p is equal to 1 instruction 1 instruction per minus cycles or you can say n instructions per cycle. So, here the machine parallelism is again is equal to n into k and in case of super scalar processor we have seen it is equal to m into k when m and m and n are same the parallelism that is available for both the machines is identical. So, it is not different, but you are achieving the same performance using a different approach altogether different approach. And it may be considered as a deeply pipeline processor with n into k stages. So, if you want to say in simple terms you may say that again it is nothing but a pipeline processor only thing that the number of stage has been increased earlier it was having k stage now it is now it is having k into m stages, but the although this statement is correct, but in practice there is some difference what is the difference? Difference is we have seen whenever we go for forwarding in k whenever we go for forwarding results from the pipe the buffers are applied to the functional units. That means, your from this from output of this output from here is fed is taken output from here is taken output from here is taken output from I mean from these intermediate stages are taken, but not you cannot take from the minus cycles. That means, the output from the at in the output available that is available from the minus cycles are not accessible are not available. So, that means, whenever you go for implementing forwarding which is necessary to overcome data hazards as we have seen by hardware means there you will feel the difference there you cannot consider it as a pipeline with m into k stages there you will consider it as if it is a pipeline with k stages. So, there lies the difference. So, outputs of some stages cannot be accessed for forwarding which is available for minus cycles and some super pipeline processors have been designed one is cdc6600 that was designed k1 was also super pipeline and MIPS R 4000 is also a super pipeline processor which has got 8 stage pipeline with 2 minus cycles. For example, 2 minus cycles per basic cycle. So, it has got instruction fetch that is the first stage and instruction fetch second I a stands for instruction fetch second. So, you can see instruction fetch is now divided into 2 stages. Then the second stage that you know that instruction decode and register read or the execution stage is again divided into 2 stages register fetch and execution then that then data fetch and data second that is this stage is also divided into 2 stages. Similarly, the right back stage is also divided the tc stands for tag check you know that you are reading from the cache memory. So, name has been tag check. So, you can see the it has got 4 you can consider it as a 8 stage pipeline. However, here it has got 2 minus cycles in it stage. So, this is an example of super pipeline organization. So, according to the classification of JOP this is the super pipeline processor where cycle time is 1 by n of the baseline processor issue parallelism is 1 instruction per minus cycle. So, you can see you are issuing 1 instruction for minus cycle output less and less latency is m minus cycle. So, you can see after m minus you are introducing output generated after m minus cycles and peak IPC is n instructions per major cycle. So, n into speed up. So, here you can achieve as I have said machine parallelism you can achieve is n into k. So, in a similar way you can find out the performance of the super pipeline processor and time required is you can say t 1 n is equal to k plus n minus 1 per n. So, you can consider it as a pipeline processor and in that way you can find out and speed up is equal to t 1 1 that is the super pipeline processor by t 1 n that is super pipeline processor we get k plus n minus 1 by k plus n minus 1 by n. So, this is the speed up that we get n into k plus n minus 1 by n k plus n minus 1 and whenever n is very large then you get s 1 n is equal to n as expected. That means, with respect to the base pipeline processor we get a speed up of n. However, whenever we consider the speed up with respect to non pipeline processor we shall get a speed up of n into k. So, here always we are considering with respect to the pipeline processor that is why the speed up limit is n, but that is again the ideal situation. Now we can extend the idea further and we can have super pipeline super scalar organization. That means, we can combine super scalar along with super pipeline and processors commercial processors are available for that and here the processor executes m instructions every cycle with a pipeline cycle 1 by n of the base line processor and it is characterized by l is equal to 1 cycles is equal to m minus cycles or i l is equal to 1 minus cycle and i p is equal to 1 instructions per minus cycle or n instructions per cycle and here machine parallelism is equal to n into k. Now it will be n into m into k I believe. So, here it is wrong machine parallelism will be n into m into k because it is you can see here the number of instructions which are number of instructions in flight during execution will be equal to n into m into k. So, machine parallelism will be m into n into k. So, this you should modify. So, the execution of instructions for a super pipeline super scalar processor of degree 3 comma 3 is shown here. So, 3 instructions are issued and after each minus cycle another 3 instruction is issued. So, after 3 minus cycle another 3 instruction is issued. So, this is how instruction issue is taking place and execution is taking place this is the ideal situation. So, according to the classification by Jopi again i p is equal to m instructions for minus cycle. So, p is equal to n minus cycles and i p is equal to m into n instructions for major cycle. So, this is the performance of super pipeline super scalar with degree n here t m n time required to execute n instructions will be equal to k plus n minus m by m n. So, you see you are dividing it by a factor of the first n instruction first k instructions will I mean first m instructions will take k cycles and after that the remaining n minus m instructions will require n by m by m n cycles. So, based on that you get a speed up which is equal to which will be this is this will be equal to s m n s m n is equal to n m into k plus n minus 1 by n m k plus n minus m. So, speed up limit in this case is m n as expected. Now, we can we have seen that inefficient unification of instructions lead to some problem because we try to execute to combine different types of pipelines which are not identical. So, ALU operation memory operation. So, instead of that we can go for diversified specialized pipelines as it is shown here ALU through one pipeline memory operation through another pipeline floating point through another pipeline. Instead of trying to combine different types of instruction in a single pipeline we can have separate diversified pipeline for different stages and along with that we can combine super scalar execution. And another limitation that we have seen that is because of rigid nature of in order pipeline. So, this problem can be overcome by having out of order execution and distributed execution pipeline as you can see here. However, this will require dispatch buffer that means you will require some buffer where the instructions will be temporarily stored. So, multiple entry buffer will be required for example, in this case this dispatch buffer through which you will be storing several instructions multiple instructions and then you will the order in which the instruction cell will be that will be issued from dispatch buffer may not be same as the program order. That means that this from the dispatch buffer instructions will be issued out of order. So, whenever the instructions are issued out of order there is a possibility that the instruction the instruction will be executed also out of order. So, out of order execution will take place and results will be stored in another buffer known as reorder buffer. So, the reorder buffer the order in which the results are stored in the reorder buffer you know ultimately you have to convert them in order the way it is available in your program order. So, from this reorder buffer the outputs are generated as it comes out from the program order. So, then the writing in the processors is required. So, the processor design superscalar processor design will involve different functional units fetch, decode, dispatch, execute, complete, retire along with that you will require different types of buffers instruction buffer dispatch buffer issuing buffer completion buffer store buffer. And these designs will be we shall in my subsequent lectures I shall discuss about the design of superscalar processor involving this type of buffers and functional units diversified functional units in my next two lectures. Thank you.