 We have discussed various techniques for increasing the instruction level parallelism to enhance the speed of processor in the last 8 or 9 lectures. So, before we start a new topic that is how you can enhance the speed of memory by using hierarchical memory organization, today we shall discuss about some problems in this tutorial class. So, first problem is using Amdahl's law, compare speed ups of a program that is 98 percent vectorizable for a system with 16, 64, 256 and 1024 processors, what would be a reasonable number of processors to build into a system for running such an application. As you know the speed up can be expressed in the form of an equation where speed up will be equal to 1 by 1 minus fraction enhanced, as you know a fraction of the program can be executed in enhanced form, so 1 minus fraction enhanced plus fraction enhanced and divide by the speed up because of enhancement. So, here the speed up that we are discussing may be because of several reasons, the speed up may arise because of you know pipelining, because of you know as you know you can use parallelism we have already discussed various techniques, whatever you do that may not be applicable to all parts of the program or you may use a special processor like floating point processor, so in such a case enhancement will take place only in the small part, so using this expression we can do the computation, so if we consider 16 processors it will be speed up will be equal to 1 by 0.02 that is 1 minus this is 90 percent is vectorizable, vectorizable means you have got a vector processor where you can execute in parallel 98 percent, so 1 minus 0.98 is 0.02 plus 0.98 by 16, where 16 is the number of processors that you can use whenever you do the whenever you do vector processing, so since you have got 16 processors this will be 0.98 by 16, so the speed up that you get is 12.3 by using 16 processors, so in a similar way whenever you use 64 processors you get a speed up of 28.3 and whenever you use 256 processors you get a speed up of 42 and similarly whenever you use 1024 processors you get a speed up of 47, so as you can see here the cost performance ratio that means the number of processors that you are utilizing is increasing but the speed up of increasing and not increasing at that rate, so if you consider the cost performance cost means essentially in this particular case cost represents number of processors and performance means speed up that you have achieved by using so many processors. If you consider that you will find that for the first case speed up is very close to 1 it is 16 by 12.3 because 16 is the number of processors and speed up is 12.3 16 by 12.3 in the second case it is 64 by 28.4 little less than 4, so on the other hand in this particular case it is 256 processors you are using and 42 is the speed up, so your cost performance ratio is 256 by 42 it is close to 6. On the other hand in this case whenever you go for 1024 processors you get only a speed up of 47, so very marginal increase in speed up from 42 to 47. Although the cost performance ratio is quite high 21 in this particular case that means 1024 by 47, so therefore we find that since the increase in speed up is marginal and cost performance is increasing significantly whenever we go from 256 to 1024 processors. So, reasonable number of processors that you can use in your system is 42, so we can conclude from this observation. So, the second problem that we shall be discussing is also related to this, in this case using Amdahl's law compare speed ups for programs with vectorizability decreasing from 98 percent to 95 percent, 90 percent, 85 percent and 80 percent. Here the objective is to see how the speed up changes as the vectorizability or later on we can say in terms of your instruction level parallelism we can say vectorizability. So, or it can be instruction level parallelism a kind of another parameter I mean what is the parallelism that is exist in a program, so vectorizability and instruction level parallelism is somewhat similar in notion. So, as the instruction level parallelism decreases there also we will find that we will not get good speed up. Similarly, in this particular case we are considering in the context of vector processors where you can perform computation in parallel using a large number of processors. So, as the vectorizability decreases how the performance degrades even if you increase the number of processors, so that is the study that we can make from this problem. So, we can see here the vectorizability is decreasing for a system with number of processors increasing from 4 to 16 to 64 and 1024. So, we shall be increasing the number of processors and for different cases of vectorizability. So, solution is given in this case in the tabulated form, so whenever the vectorizability is 98 percent and as you increase the number of processors we can see the speed up is increasing, but obviously the cost performance will be decreasing as we increase the number of processors. So, for 4 processors we are getting a speed up of 3.77 it is very close to 1 for 16 processors again little less than 1 and as we have discussed in the first problem. And in this case as we have seen that the reasonable number of processors that you can use in your system is 250 and you will get a speed up of roughly close to 42. Now, if the vectorizability decreases to 95 percent you can see for the same number of processors the speed up is less. So, speed up is less, but the speed up decreases more rapidly as we increase the number of processors, so you can see here the speed up is 9.1 for 16 processors little more than half I mean 2 is to 1, then for 64 processors it is close to I mean it is 4 is to 1 for 256 processors it is less than 10 about 12 is to 1. And so you can see the speed up is decreasing compared to the previous case as we have decreased the vectorizability. Similarly, for 90 percent vectorizability we see the speed up for different number of processors is increasing, but there is very small increase as you go for larger number of processors. So, as you increase the large number of processors the speed up does not increase much. And similarly for 80 percent increase 85 percent vectorizability in a program you find that speed up is 2.76 for 4 processors, 16 for 4.92 processors, 6.12 for 64 processors and you can see even if we increase from 64 to 256 processors the speed up is only increasing by fraction that means from 6.12 to 6.52. And again if we increase the number of processors to 1024 the speed up will be only 6.63, so very small increase in speed up. And similarly for 80 percent vectorizability situation is much more worse as we can see for 4 processors we get a speed up of 2.5 for 16 processors only 4, 4 is to 1 performance for 64 processors little more than 4. And you can see it does not exceed 5 even for 1024 processors. So, what is the lesson that we learn from this observation? So, the lesson that we learn from this observation is from it is evident from that table that as the vectorizability decreases the speed up keeps on decreasing for a given number of processors that we have seen. Moreover, the speed up decreases rapidly for higher number of processors. Therefore, it is more cost effective to use large number of processors when vectorizability is high. So, that means if we translate it in terms of ILP. So, unless you have large ILP instruction level parallelism you cannot really go for large number of stages in a pipeline or if you are using superscalar processor then you should not increase if you go for superscalar processor then you should not increase the number of functional unit in the superscalar if the instruction level parallelism is not high. That means that is the reason why you will see the number of functional units used in superscalars processors does not exceed beyond 6. Typical values are 4, 5, 6 not more than that and this is what you have learned from this problem. Let us now switch to third problem. In this case a workstation uses 1.5 megahertz processor with a claimed MIPS rating of 10 MIPS. 10 MIPS rating to execute a given program mix assume a one cycle delay for each memory access. What is the effective CPI of the processor? You can see in this particular case the MIPS can be expressed as clock rate by CPI into 10 to the power 6 or so CPI is equal to clock rate. Therefore, CPI that you get is clock rate by MIPS into 10 to the power 6 and by substituting different values we get 1.5 into 10 to the power 6 by 10. So, into 10 to the power 6 that is only 1.15. So, CPI is equal to 0.15 what does it mean? It means it is a superscalar processor and since it is a superscalar processor that is why you are getting a CPI of less than one. So, CPI of less than one can be obtained for superscalar processor and in this case you get CPI of 1.5 and that is the reason why in case of superscalar processor normally in case of superscalar processor normally instead of stating in terms of CPI it is stated in terms of instruction per cycle IPC and that will be more than one. Let us now go to the second part of the problem. Second part of the problem is let the processor speed is upgraded to 30 megahertz clock keeping the memory system unchanged. If 30 percent of the instructions require one memory access and another 5 percent require two memory accesses per instruction what is the performance of the upgraded processor with a compatible instruction set and equal instruction count in the given program mix. So, in this particular case what is happening you have increased the speed of the processor and however using this value that 30 percent of the instructions require one memory and 5 percent requires two memory. So, if we substitute it here that is your average number of memory accesses per instruction will be 0.3 into plus 2 into 0.05 that is 0.4. So, number of extra cycles will be 0.4 into 1. So, if we add this with 0.15 we get a CPI of 0.55. So, that means we are adding with the previous case whatever we got that is your 0.15. Now, because of the memory accesses there is a delay and so your CPI is now 0.55. So, with this CPI we get a mix rating of 30 by 0.55 that is your 54.5. So, that is the mix rating that we get in this particular case. So, we can see although it is a superscalar processor with a large I mean quite large degree we are not getting much benefit because of the delays in the memories. So, now let us switch to problem 4. In case of problem 4 consider the following code sequence of mips where the each instruction carry their usual meaning add R 2, R 5, R 4. So, here you are writing the value into R 2 by adding the content of R 5 and R 6. So, in the second instruction again you are adding the value of R 2 with that of R 5 and storing it in R 4 and here it is a store word you are storing the content of register R 5 into memory location computed by adding 100 with the content of R 2. So, we find that and fourth instruction is again an add instruction which is storing the value of the content of R 2 and R 4 in register R 3. Now, we find you have got a number of the question was enlist each of the data dependences present in this code along with its type specify which of the data hazards present in the above code sequence can be resolved by forwarding justify your answer. Now, you see this instruction 1 and instruction 2 these two they have read after write hazard because you are reading after this write. So, there is a read after write hazard between instruction 1 and instruction 2. Similarly, instruction 1 and 3 also has got a read after write hazard because you are using R 2 to compute the effective address that is used for the purpose of storing in memory location. I mean in memory location and content of R 5 is being stored in the memory location computed by using R 2 or 100. So, again this is a read after write type of hazard. Similarly, the fourth instruction is also having a read after write type of hazard because of R 2 and you are writing the value of R 2 in instruction 4 and there is another hazard present here you see that between R 2 and instruction 2 and instruction 4. So, that content of that R 4 you have loaded in I mean computed by using the content of R 2 and R 5 and stored in R 4 and R 4 is being used in instruction 4. So, we find there are four read after write type of hazards. The question naturally arises which of them will lead to hazard, will all of them lead to hazard or some of them will lead to hazard and then we can answer the second part. So, let us see which of them will lead to hazard. Let us consider the four instruction that you know that you can write it in this way instruction fetch, instruction decode, instruction execution, then memory access and write back and second instruction, instruction decode instruction, sorry instruction fetch, instruction decode, instruction execution, memory and write back. Third is instruction Third is instruction fetch, instruction decode, instruction execution, then your memory and write back and this is the fourth instruction fetch, instruction decode, instruction execution, memory operation and write back. Now in between you have got the registers, pipeline registers present in all the cases, so here also you have got pipeline registers present in all the cases. Now the question was, at least each of the data dependences present in this code specify which of the data hazards present in the above code sequence can be resolved by forwarding. First of all let us find out, I mean we have already considered the data dependences and now which of them can be resolved by using forwarding let us see. Now whenever this instruction execution is going on, this instruction execution has already taken place, so directly you can provide the data by forwarding from here to here. So the first hazard as you can see can be overcome by you first that hazard that will arise will be overcome by using forwarding. Then second one here second one again can be overcome by using forwarding, so you can see by using forwarding this can be overcome. Now the third hazard that you have already discussed that between one and four read after write although there is a dependency, but by the time that instruction execution is taking place data is now already available in the register, so you can directly feed it from here to the fourth instruction. So you can see here this fourth instruction will not lead to any hazard, so there is no need for forwarding in this case. So only one and two will require forwarding, one and two can be resolved by using forwarding, the third one need not be resolved by using forwarding, because already the data has been written into the register it can be written from the register itself. So there is no need for reading it from the pipeline register, but it can be read from the architectural register that is present in the ALU. So it can directly read from art to itself, so it will come to the can be provided to the instruction. Now let us come to the last one between two and four, two and four you see you are two and four you are here you will be reading, so now the forwarding can be done, because it is available here reading will take place here, so you can from this stage you will do the forwarding and it will go to this instruction execution stage. So you will require three forwarding to resolve the hazards and that is what I have written one two four can be resolved by forwarding ALU output and memory output, however the third one does not lead to any hazard. So this is the problem that deals with the dependencies and hazards that can be resolved by using forwarding. So forwarding is a expensive hardware that is used, but this is very useful from this as it is clear it can overcome lot of hazards with the help of the forwarding unit. Now let us come to the problem five, time required to perform instruction fetch is four nanosecond instruction decode is two nanosecond, instruction execution is three nanosecond, operand fetch in memory is four nanosecond, writeback is two nanosecond. So what has been done here we are considering a processor where the instruction fetch can be done in four nanosecond, instruction decode can be done in two nanosecond, instruction execution can be done in three second, then memory operation can be done again in four nanosecond because instruction fetch is reading from instruction memory and this memory operation is usually involved with data memory. So again four cycles will be required and then final operation that is your operand fetch or writeback you can say is performed in two cycles. So this is the value that is given in your that can be performed. Now whenever you implement single cycle non pipeline processor then what you try to do you perform all the operation in one cycle that means the instruction fetch may be it is taking four cycles, instruction decode two cycles, then three cycles, four cycles. So four nanosecond, two nanosecond, three nanosecond, four nanosecond and two nanosecond. So they will be considered in one cycle, since this is a single cycle then it will require how much time it will require 4 plus 2 plus 3 plus 4 plus 2 that means total number of cycles that you require is 10 plus 15 nanosecond. So 15 nanosecond will be the time period of the single cycle implementation. However the clock frequency that is the speed which is specified in terms of clock frequency will be 1 by 15 into 10 to the power 9 hertz or you can say 1000 by 15 megahertz. Now this is for single cycle implementation. Now let us assume that we are interested in multi cycle, multi cycle implementation that means each of these operations instruction fetch instruction decode instruction execution memory and write back all are performed in different cycles in a multi cycle implementation. So as we know there are three possible alternatives single cycle, multi cycle then pipeline. So before we consider pipelining pipeline in multi cycle implementation what will be the number of cycles required. Number of cycles will be the clock frequency will be decided by you know that the time period of the clock will be 4 nanosecond that is the time period and the clock frequency that means your speed will be 1000 by 4 megahertz that is your 250 megahertz. So whenever we go for multi cycle implementation this instruction fetch instruction decode instruction execution memory and write back all these cycles may not be required to execute in different programs. For example, whenever you are using ALU operations if it is a load store architecture and memory operation is performed separately by using load store architecture and ALU operations are performed separately in such a case this particular clock cycle will not be necessary whenever it is a ALU type instruction. On the other hand if it is a memory type then it will be required. So in such a case all the four cycles five cycles may not be required in different instruction to execute different instructions. However, the speed will the clock frequency will be 250 megahertz and if all the cycles are required total time will be longer than single cycle implementation. However, you may be asking what is the benefit of multi cycle implementation? The benefit of multi cycle implementation is the hardware resources that is required can be less. Why the hardware resources that is required can be less? For example, you can use same memory there is no need for two separate memories you can use same memory whenever we go for multi cycle implementation. So multi cycle implementation will not require two separate memories similarly you will not require two ALU for these two operations. So that is why multi cycle implementation requires lesser hardware. So that is the reason why multi cycle implementation is also popular and compared to non pipeline because in non pipeline all of them are performed in single cycles. So you will require separate hardware resources for performing all these three operations. Whenever we go for pipeline implementation we require separate hardware resources why you require different hardware resources because you will see different instructions will be executed in a overlap manner. So you will require two separate memories for one for instruction another for data. You will require two separate functional units may be for instruction execution and for computation of whenever you perform here you will perform the computation of that PC plus 4. So you will require separate hardware in your pipeline implementation that means if you consider the hardware requirement the single cycle and pipeline implementation will require more hardware. And regarding the clock frequency for pipeline implementation the clock frequency will be same as the multi cycle implementation. So that way it will be clock speed wise it will be same let us see the answer for this. So what will be the clock speed for non pipeline single cycle processor based on the same technology five stage pipeline processor is designed using latches requiring 0.5 nanosecond what is the clock speed of the pipeline processor. So here additional delay will be required because of the latches that means I told you that the clock frequency will be same for multi cycle implementation and pipeline implementation but that is not true if we do not assume that the latches will have zero delay. If we assume non zero delay for latches then the speed of pipeline processors or the clock frequency of pipeline processor will be more. So you have to add the delay of the latches and assuming that there is no stall what is the speed up of the pipeline processor with respect to the non pipeline processor to execute thousand instructions. So this is your problem we have already discussed three different situation now let us go to the solution. So time period for the non pipeline single cycle processor will be 15 nanosecond as I have told. So the frequency will be 1000 by 15 megahertz. Similarly for pipeline processor it will be equal to maximum of the stage time. So in this case maximum is 4 so you will require 4 plus delay of the latch which has been given as 0.05 that means equal to 4 plus 0.05. So you will require a frequency speed of 1000 by 4.5 megahertz that will be the speed of operation for this particular pipeline implementation. As I have already told if it is a multi cycle implementation then speed will be 1000 by 4. Now speed up is will be equal to 1000 into 15 that is the we know that speed up is calculated in this way time for execution for non pipeline by time for execution time for pipeline. In our case time required execution time required for non pipeline will be equal to we are executing 1000 instructions so 1000 into 15 nanosecond. So that is the time required for non pipeline implementation on the other hand time required for pipeline implementation will be equal to 5000 minus 1 k plus n minus 1 as you know into 4.5 that is the time period of the pipeline processor. So this value is roughly equal to 3.32. So we can see in this case the speed up ideally speed up is 5 instead of 5 we are getting 3.32. So we are getting a speed up of 3.32 all even though there is no stall but because of the increase in you have to consider clock frequency corresponding to the delay of the larger stage I mean maximum stage and also because of the delay of the latch. So you see even without stall your pipeline implementation will give lesser speed up. Now let us switch to problem 6. So consider a 4 stage floating point adder with 10 nanosecond delay per stage name the appropriate functions we perform by the 4 stages find out the minimum number of clock periods required to add 100 numbers that is z is equal to a 1 plus a 2 in this way we have got 100 numbers using the pipeline adder assume that the output of stage 4 is routed back to either of the two inputs with delays equal to multiple of the clock period. So I have already discussed about 4 stage I mean floating point pipeline adder and as we have seen the functions to be performed by the 4 stages are given here number 1 is adjust significant you have to adjust the significant corresponding to the exponent of the larger significant I mean 2 exponents will not be same. So you have to adjust the significant corresponding to the number having higher exponent then you have to add the significant in the second stage. In the third stage you will normalize the sum and in the 4 stage you will round up the sum if you have got leading 0s. So these are the 4 operations to be performed in the 4 stages and I have already discussed about this in one of my previous lectures and with the help of an example. So how the adjustment of significant is required here for example 2 numbers although we have considered decimal numbers situation will be same for binary numbers you are adding to numbers but here you have to adjust the significant for example in this case it is minus 1. So the significant has been adjusted for this to make the exponent same 10 to the power 1 then you will be adding the significance to get the sum of the significance. Now you may be asking one question sometimes we use a term called fraction for significant why you are using the name significant conventionally whenever we normalize a floating point number usually you shift in such a way that you start your number starts after the decimal point it can be 1 0 0 0 1 0 that means after the decimal point it is 1 I mean I am considering the case for decimal numbers. So then you have got other numbers so this is how normalization is done but there is a IEEE standard IEEE 754 standard where after normalization the value is 1.0 that means it is 1 before the decimal point and that means you are shifting by 1 and remaining numbers can be here. Question is why this has been done in IEEE 754 the reason for that is by considering this you are getting one additional bit in your fraction and as a consequence resolution is increasing and to increase the resolution that is used. So in such a case we cannot really call it a fraction we have to give a separate name that is why it is called significant anyway. So these were the four stages that we have discussed however our pipeline has to be little different compared to the pipeline that has been shown in this diagram because as you have seen our requirement is assume that the output of stage 4 is routed back to either of the two inputs with delays equal to the multiple of clock cycles. To achieve this we have to modify the pipeline little bit and the modified pipeline is shown in this diagram. Here what has been done the four stages are here adjust significant adjust significant normalize some round of some the traditional stages that is required for adding floating point numbers are given here. And these three these are these blue colored I mean these are essentially the latches. So latches are shown in between different stages. Now you see the output is routed back to the input however you will require two separate latches for the two operands and then they are routed through multiplexers. So in the multiplexers you can feed either the number that is coming that means A1 to A100 that you can feed or you can take the output and you can that means that is being stored in these latches they can be fed through this multiplexer. That means one of the operands can be say A1 and two operands that can be fed is one is A1 another is A2 or it can be A1 is fed here and a number I mean the output of the pipeline that can be fed back and that will go through this multiplexer to one of the as one of the operand to the floating point pipeline adder. So in this way you can do that or what can happen you can save the output in two different latches and then both of them can be fed to the input of the pipeline floating point adder through these multiplexers. So to achieve this flexibility we have added these multiplexers and obviously they will require separate clocks for latches and control signals for these multiplexers. In other words the controller of this floating point adder will be little more complex. Now using this type of pipeline adder let us see how the computation is being performed. So the operation performed in different cycles are given here. On the left hand side you have got the clock numbers 1, 2, 3, 4. So in the first 4 clock cycles we are adding the numbers A1 plus A2 we are generating partial sum then we are adding A3 and A4 which can be done as I told by feeding here Ai and Aj that means A1, A2 is fed here and they are selected and applied to the pipeline adder. So this is how the first input is going then the second input is going and after at the end of fourth cycle you will get the output. The output will be available from the stage at the end of fourth cycle and that is being shown here on this right side column. Similarly A3 plus A4 is fed in the second cycle and output will be available in the fifth cycle, A5 and A6 is fed in the third cycle, output is available in the sixth cycle and A7 plus A8 is fed in the fourth cycle and output is available in the seventh cycle. Now the control will be changed and in the fifth cycle what will be done? Fifth, sixth, seventh in the remaining cycles what will be done? One input will come as we have seen in the fifth cycle so output is generated in the fourth cycle which is stored in this latch which has been latched here and that will be fed to one of the inputs in the fifth cycle. So this A12, A12 that means the result of the addition of the number A1 and A2 that is being fed to one arm of the adder and A9 is fed to another arm that means A9 is fed here and A12 is fed from here to through this multiplexer to another input. So this is how in each cycle will be feeding one input and partial sums will keep on accumulating in the pipeline registers and they will be keeping they will keep coming to these inputs where they will be latched and they will be used as input as it is shown in different cycles. So in this way cycle 5, 6, 7, 8, 9 and it will continue till clock number 96. So till clock number 96 it will continue and where you will be feeding one input from the output of the pipeline and another input will come from this from the numbers that has been given. So this way it will continue and you can see these four numbers I mean partial sums you can say that in clock cycle 96 this is being available and in clock cycle 98 this is being available that means this partial sum that means the partial you can see that the addition of numbers is now grouped into four and those four groups will now is now available in the pipeline registers. So in one group you have got addition of A1 plus A2 plus A9 plus A9 plus 413 in this way it will continue till A97. In another group that means this is one sum and another will be A3 plus A4 plus A10 plus A14 in this way it will continue till A97 98. Third group is A5 plus A6 plus A11 plus A15 it will go to N99 and the fourth group will be so these are the partial sums which are available in the pipeline registers. So A7 plus A8 plus A12 plus A16 plus in this way it will continue till A100. So you can see now all the numbers are added but they are available in different pipeline registers. So this is available in the 96 cycles and this is available in the 97 cycles, this is available in the 98 cycles, this is available in the 99 cycles. So now you have to add these four numbers how these additions will be carried out. So to perform the addition of these four numbers what you have to do you have to wait for multiple of cycles because you have to load the first number, this number and second number. So one will be available in this latch, another will be available in this latch. So this output will be available so in one cycle it was latched here, in another cycle it was latched here. So in two latches these are available and then by selecting multiple flexors it will be fed to this pipeline adder. So you can see in 98 cycles you will be able to start adding the first two partial sums that means first two partial sums addition will start in the 98th cycle. Similarly addition of this can start in the 100 cycle because output is available in the 100 at the end of 100 cycles. So you can see in the 100 cycles this is being started and these two results will be available in 102 cycles since it is starting at 98 it will be available in the 101st cycle and this will be available in the 103rd cycle. And these partial sums has been shown as S102 and S104. So now you have to add and which can be fed to the pipeline only at the end of 104 cycle. So in the 104th cycle you can feed these two numbers so in between they were latched in these two latches and now in the 104th cycle that S102 and S104 that means S102 is obtained by adding these two and S104 is obtained by adding these two. So now you have got the sum of these two and you have to add these two numbers and which can be fed in the 104th cycle. So as you feed it in the 104th cycle you will get the result in the 107th cycle. So that means the final output you will be getting at the end of 107th cycle. So you can say that total number of cycles required is 4 for the first 4 where 8 numbers were fed to the pipeline then you will require 92 remaining numbers were fed I mean 8 plus 92 remaining 92 numbers were fed then you will require 2 cycles, 2 wet cycles for adding these two, 2 wet cycles for this and then finally 3 cycles you will require and total is 107 cycles. So this is the last problem that I wanted to discuss in this tutorial and I suggest that you solve problems from the book, try to solve and that is the best way to learn a subject solving problems and that is the reason why I have discussed some problems in this particular tutorial and maybe later on I shall discuss some more problems in other tutorials. Thank you.