 to today's lecture on software pipelining. We are discussing about data hazards and in the last lecture, we have seen how we can minimize or reduce the data hazards that means overcome the stalls with the help of a software approach known as loop unrolling. What it essentially does, it increases the size of the basic block. So, suppose you have got a loop in which these are the three instructions which are data dependent. So, in the main body of the loop, you have got three instructions A, B, C and this is the data dependency graph. So, these three instructions are dependent as a consequence, they cannot be executed simultaneously or in an overlapped manner. So, by using loop unrolling, what is being done A, B, C then if we unroll three times A, B, C and then A, B, C. So, you have now got six instructions in your basic block and since the instructions of the different iteration or different loop are usually independent, within these nine instructions, the data dependency is obtained. So, instruction level parallelism is increased by using this basic block and then these instructions are recidue to execute the program in such a way that the stalls are reduced. That is the basic approach we have discussed in the last lecture and it is a software based approach and we have also seen that loop unrolling with instruction scheduling has three different types of limits. Number one is the decrease in the amount of overhead which is amortized with each unroll. For example, if the loop is unrolled eight times, the overhead is reduced from half cycles of the original iteration to 1 by 4. So, that happens and the growth of the code size due to loop unrolling that leads to cache misses and as a consequence, you cannot really unroll many times. Third limit that is due to shortfall of registers created by aggressive unrolling and scheduling and this is known as register pressure. We have already discussed in detail these issues and we have seen that loop unrolling improves the performance by eliminating overhead instructions. Loop unrolling is a simple but useful method to increase the size of the straight line code fragments. This is a sophisticated high level transformation which leads to significant increase in the complexity of the compiler. So, this loop unrolling and instruction schedule is done at the cost of increased compiler complexity. But we have seen in spite of that fact, it has got some limitations because it increases the size of the code that leads to cache misses and other things. We shall discuss about another approach where some of these limitations are overcome and this is known as software pipeline and this eliminates the loop independent dependencies through code restructuring. So, by restructuring the code, the loop independent dependencies are eliminated and obviously, it leads to reduced number of stalls. It helps to achieve better performance in pipeline execution and one very important factor is as compared to simple loop unrolling, it consumes less code space that means, size of the code is small. Now, let me explain how exactly it is being done. So, let me start with the again the same data dependency graph A, B, C this is your data dependency graph and whenever we try to execute in a pipeline manner, we may write it in this way A, B, C then in the next cycle A, B, C next cycle A, B, C next cycle A, B, C A, B, C and so on. Now, if we look at this we find that we can divide this into three parts. In this case, you can see this instruction C of the first iteration say may be iteration I, instruction B of iteration I plus 1 and instruction A of iteration I plus 2. We can combine them and form a single loop where these three instructions will be executed. Since they have been taken from different iterations, they are presumed to be independent that means, this C, this B, this A they belong to different iterations and as a consequence, they are independent. That means, here we can see that CVA, CVA, CVA this will form a loop that means, this part can be considered as a loop like this CVA, this will form a loop and of course, you will have these three instructions will be there to be executed and you have to execute the remaining instructions which are left out. So, it will have a kind of prologue and epilogue, but this part can be executed in the form of a loop and by doing this you find that there will be since these three instructions are independent, there will be no stall in the pipeline. So, that is what is being done in group unrolling that means, we are taking instructions from different loops as it is done in case of hardware pipeline. We have seen in a hardware pipeline, what was done say instruction fetch, instruction decode, instruction execute, then memory operation, then write back. So, what was done? These instructions were executed or these instructions instruction fetch, they were executed in a pipeline. Similarly, here also we are doing the same thing, but here it was done for a single instruction, now we have taken instructions from three different iterations. So, from this simple feature that means exactly just as it happens in a hardware pipeline, we are doing it in the same manner that means, in each iteration of a software pipeline code some instructions of some iteration of the original loop is executed. So, this what I have explained is shown in this. So, we form a kernel CVA which can be executed in the form of a loop and this is essentially software pipelining. Let me illustrate this with the help of an example. So, this is the same thing explained, central idea is reorganized loops, each iteration is made from instructions using from different iteration of the original loop. That means, here for example, we had iteration i 0, i 1, i 2, i 3, i 4 and i 5, but some instruction from iteration i 0, some from i 1, some from i 2, some from i 3 and some for i 4 are taken to form one loop and then similarly in the next iteration instruction from i 1, i 2, i 3, i 4 and i 5 is taken that forms another loop. So, this is the software pipeline iteration and two iterations are shown here. Let me illustrate this with the help of an example, but before that here it is explained how is this done, how this software pipelining is implemented or done. So, number one step is unroll loop body with an unroll factor n. So, here also just like the loop unrolling is done to improve the instruction level parallelism which I have explained in the last lecture. Here also we will be doing loop unrolling, but for a different objective here the objective is different the way the instructions are executed is different from the previous case. So, unrolling is done then select order of instructions from different iterations to pipeline. So, here you have to select instructions from different iterations to form a pipeline and then paste instructions from different iteration into the new pipeline loop body. So, you will form a pipeline loop body and taking instructions or rather pasting instructions from different iterations. Let me come back to our original example, this was the static loop unrolling example that with the help of this we explained the loop unrolling and subsequently how loop unrolling was done to improve the instruction level parallelism that I explained in the last lecture. Now in this case you have got three instructions in the main body of the iteration and these two the third and I mean fourth and fifth instructions they are essentially loop manipulation instructions which are used for housekeeping. Now here the as I mentioned the loop body is unrolled three times. So, you have got three instructions and the first three instructions belong to iteration one, second three instructions belong to iteration i plus one then the last three instructions belong to iteration i plus two. And of course, whenever loop unrolling is done the loop overhead instructions are not needed they have been removed and so two instructions which were present here have been removed and in the single loop body of restructured loop could contain instructions from different iteration of the original loop body. Let us see how this is being done. Now what you are doing we are taking three instructions first instruction from iteration one, second instruction from iteration i plus one and third instruction from iteration i plus two. So, whenever you take instructions from different iterations you have to select it in such a way that each instruction must be selected at least once to make sure that we do not leave out any instruction of the original loop in the pipeline body loop. That means here ultimately you have to execute your program and it has to give correct result and it will give correct result only when all the instructions are executed and that to do that you have to be careful. That means in this particular case for example, you are taking third instruction from iteration one, second instruction from iteration i plus one, first instruction from iteration i plus two. So, you are taking all the three instructions may be from three different iterations in your loop body of the software pipeline. That means in one iteration these three will be executed in the second iteration the other three instructions that means another instruction from iteration one will be executed another instruction for iteration i plus one will be executed and another instruction for i plus two will be executed. So, this is how it will be done. So, this is a very simple program having only three instructions in the loop body. If you have more number of instructions in the loop body you have to be very careful to pick up instructions and put them in your software pipeline. So, with these three instructions now we have formed a loop. So, here we have formed a loop as you can see taking instruction three from iteration one, instruction two of iteration two, iteration i plus one and instruction one of iteration three. So, these three is forming the loop of the software pipeline. So, they belong to three different iterations then of course, since we are interested in forming a loop here we have to put those loop manipulation instructions. This one will reduce the value of the pointer to point to the next element of the array and this will decide how many times the loop looping will be done. That means, you have to carry out the execution of this loop that was for thousand times in our original program it was thousand times it is not shown here, but in our original program it will loop for thousand times. So, to facilitate that register R two was stored with the value I mean it was zero. So, initially R one is initialized with thousand and then it is decremented and until the and then you keep on comparing and then you whenever these two are equal then you stop it you come out of this loop this is how it is being done. So, we find that the software pipeline will have now this is the fourth step where you will be having the pipeline loop body and pipeline loop body you can see you have to adjust the various values I mean the so that the effective address points to the right element of the array and that is why here it is sixteen R one then you are loading the value of that array element in register F four and you are adding it with the constant that is being stored in zero sorry this is the stored data and load data. So, you have taken from three different instructions. So, you cannot really explain the operation from this thing, but you have to go back to the original program to do that. So, these are the this is forming the basic loop and here this instruction has been taken from M i M i minus 1 and M i minus 2. So, and you have the free header where you have to fill the instructions which are to be executed before you execute this loop. Similarly, after this loop execution is complete you will require several instructions present here. Let us see what are the instructions that will be present in the free header and post header. So, if we consider this your loop body is loop stored double F four sixteen R one then add double F four F zero F two then load double F zero zero R one and this is decrementing the pointer d d d u i and R one comma R one comma minus eight. So, you are decrementing by eight and b n e b n e R one R two loop. Now, this is your main loop body what will be in your free header you must have free header post header. So, here we are left with three instructions that load F zero comma R zero R one add d F four F zero comma F two. So, you have to execute these three instructions that means load double F zero comma zero R one and add double F four comma F zero comma F two. Now, if these two instructions are executed one after the other we know that this will lead to a stall here what can be done another load that second load that is present here this load can be filled up in between these two instructions. So, if you do that then load double you have to fill this one you have to fill it you cannot fill it this way I think we have to then you have to use different register value. So, you have to load double F zero zero R one. So, to here you have to actually increase the value of value by eight because it points to the next array element. So, that you have to do. So, first instruction is done this way and then second instruction is done this way this is how you have to execute I mean the free header will form these three instructions may be you have to insert a stall here and your post header will require three instructions that stored data F four zero R one add double F four comma F zero comma F two and then stored double F four comma zero R one. Now, you see your R one was you have to adjust the value here such that these two are stored in two different places. So, here actually this will be minus eight and this will point to the proper array element that means storing has to be done in a different way. So, these three instructions are these three instructions are to be executed before and after this loop body. So, that will form the free header and post header. So, you have to fill those instructions and this part that free header and post header part may have few stalls, but main body of the loop will not have any stall because they have been taken from different iterations and they are independent. So, this is the basic idea of loop body. Now, let us consider the important issues related to software pipeline. So, register management can be tricky in more complex example we may need to increase the iterations between when data is read and when the results are used. Actually, if we go back to our this problem we find that in the free header part, in the post header part the register management has to be done properly such that ultimately the program gives you correct result. So, the way I have written it may not give you the correct result because the register management part has not been taken into consideration properly and that you have to take into consideration. Then, optimal software pipeline has been shown to be an NP complete problem. So, I have illustrated this with the help of a very simple example and it appeared to be very simple, but whenever you consider real life problems which are where the number of instructions in the loop body is more and later on we shall discuss when it is being done in the context of your super scalar architecture. Then, it becomes a very non trivial problem and it has been found that it is NP complete problem. So, whenever the problem is NP complete and you have to solve it there is no deterministic algorithm which will give you an optimal result. So, what you have to do you have to use a heuristic based approach. So, that is being tried present solution are based on heuristic. So, heuristic based approach which is used for software pipelining and some lot of research has been carried out to achieve proper software pipelining to improve the performance of the program execution. Another very important aspect is you can see if we compare loop unrolling and software pipelining we find that software pipelining takes less code space. We have seen that one very important limitation of loop unrolling was that increases the size of the code. So, since the size of the code is big and you have to load it in the cache memory before you execute it may lead to cache misses, but when the size of the code is small that problem does not arise. So, this software pipelining facilitates that it has got less code space and software pipelining and loop unrolling reduce different types of inefficiencies. So, the inefficiency in terms of instruction level parallelism that is present in the program and it does the two approaches reduce the inefficiencies in two different ways or rather the different types of inefficiencies are reduced. So, loop unrolling reduces loop management overheads as we have seen it the additional loop management overheads which are present if you unroll each time in one role you reduce the number of those overheads. So, those overheads are reduced and software pipelining allows a pipeline to run at full efficiency by eliminating loop independent dependencies. So, in case of software pipelining it allows the it improves the efficiency by eliminating loop independent dependencies. That means, by taking instructions from different loops which are independent you are forming a loop body and that is how you are increasing the efficiency of the program. So, you can visualize the two approaches software pipelining and loop unrolling the top diagram corresponds to software pipelining. So, as I mentioned there will be a start up code and wind down code you have got the software pipelining and number of overlapped operations are shown here. Here the number of overlapped operation is maximum, but in this part and in this part the number of instructions which can be executed in an overlapped manner that will be reduced. On the other hand the bottom diagram corresponds to loop unrolling where only the middle part which is proportional to the number of unrolls where you have got the maximum number of instructions which can be executed in an overlapped manner, but the other parts overlap between unroll iterations they are the number of instructions which can be overlapped in a executed in an overlapped manner that is smaller. So, these two diagrams compares or visualizes these two basic approaches. Now, so far what we have tried we have tried to unroll the loop or we have tried another approach that is software pipelining by which we have tried to execute the program. So, that the cycles per instruction CPI a CPI of 1 is achieved that means maximum throughput bounded by one instruction per cycle that is the maximum that can be achieved in both the approaches that means what we are doing that per cycle one instruction is executed when it can happen when there is no stall. So, only when there is no stall we shall be able to achieve a upper limit of CPI is equal to 1 and beyond that it cannot be done by using these approaches. So, inefficient unifications of instructions into one pipeline that means we are trying to combine different types of instructions. For example, ALU operations, memory stage operations, floating point operations we are trying to do I mean inefficient unification of instructions into one pipeline that means we are forming a single pipeline where these are being inefficiently combined and this rigid nature of in order pipeline that means we are trying to execute one instruction followed by another instruction. So, they will be executed in a in order, but that is very rigid in the sense that if a particular instruction execution is stored because of some reason then the second instruction also is stored that is not allowed to progress and so that problem arises in a scalar pipeline. Now, so higher ILP processor how can we increase the ILP or we can have how we can have CPI less than 1. So, far we have assumed that our upper limit is CPI is equal to 1 we cannot still reduce it CPI can be more than 1 whenever you have got stall, but now we are trying to achieve more than one CPI which is less than 1. In other words we are trying to execute more than one instruction in a single cycle and there are two basic approaches one is known as VLIW very large instruction word and another approach is known as superscalar. So, in both the cases the basic approach is to have more than one functional unit. So, the number of functional unit that is present both in VLIW approach or in superscalar approach is more than 1. So, far we assume that you have got only one functional unit and where your which is pipeline. Now, in VLIW or superscalar we shall be trying to we shall be having more than one functional unit. Since, we have got more than one functional unit we shall be able to issue more than one instruction at a time more than one operation at a time. That means this will help in getting the CPI which is less than 1. So, in a superscalar or VLIW processors you have got more than one functional unit in a single CPU. So, you have got only one CPU, but unlike one ALU present in a CPU you have got more than one functional unit. So, that is the basic approach followed, but the way these two are done in two different cases are different. In case of VLIW the compiler has complete responsibility of selecting a set of instructions to be executed concurrently. That means that instruction level parallelism ILP that is being exploited in superscalar or VLIW they are done in a different way. In VLIW the responsibility is given to compiler that means compiler identifies which instructions can be executed in parallel and those instructions for corresponding to those instructions you have got separate functional units and they are executed. That means the compiler is given the complete responsibility for identifying the instruction level parallelism and then a single instruction will be having more than one operation which can be fed to different functional units that is done in case of VLIW. On the other hand in case of superscalar approach their compiler is simple compiler is ordinary and simple ordinary compiler, but the hardware identifies which instructions can be issued simultaneously can be executed concurrently. So, responsibility is done concurrently then several more than one instructions are issued which are fed to different functional units. So, these are the functional units here also. So, both the cases you have got functional units, but in case of VLIW these instructions are formed with the help of compiler then they are executed in order. On the other hand in superscalar the hardware finds out which instructions can be executed. So, you have got a instruction issue hardware which will generate several operations to be performed by different functional units. Now, in case of superscalar processors it can be done in two ways. Number one is statically scheduled superscalar processor where multiple issues performed, but in order execution take place. On the other hand dynamic schedule superscalar processor which will use very specialized feature like specialized property like speculative execution branch prediction and where you will allow out of order execution. So, this will require more hardware functionalities and complexities. So, later on I shall discuss about these two techniques which will provide you higher ILP and of course, loop unrolling will be necessary or software pipelining will be necessary to have more number of to increase the ILP. Now, so far what I have discussed is known as static instruction pipelining which is done by compiler. Another approach is known as dynamic instruction pipelining. Why dynamic instruction pipelining is needed? In case of pipelining we have seen some hardware technique forwarding inter looping technique where the when there is a stall I mean when there is a hazard stalls are introduced or software based instruction is done. But the software based instruction is structuring is handicapped due to inability to detect many dependencies. We have discussed about different types of dependencies. Those dependencies which are visible at compile time the dependencies which are visible at compile time can be done with the help of static instruction scheduling. So, it is very conservative is nature. On the other hand particularly there are situations that means name dependencies involving memory. If name dependencies involving memory is present in your program this cannot be identified by the compiler at compile time because they will be evident only when the program is executed that is at run time. So, the dependencies which are not revealed at compile time will be visible at run time and that is what is being done in case of dynamic instruction scheduling with the help of a hardware. So, the hardware determines the order in which instructions execute. So, this is in contrast to statically schedule processor where the compiler determines the order of execution. And later on I shall discuss a technique by which this hardware scheduling is done. You will require a very specialized hardware which will do this instruction scheduling. And the loop unrolling and other things which is being done by the compiler will not be necessary whenever you do it with the help of a hardware. And various other things like register renaming and other things they are also incorporated at the time of dynamic instruction scheduling. So, that we have come to the end of a very important topic that is instruction level parallelism where the simple pipelines are used and that instruction level parallelism is incorporated to achieve CPI-1. So, some of the important points you should remember before we leave this topic is given here. First of all what is pipelining that I defined in the beginning. It is an implementation technique where multiple tasks are performed in an overlap manner you may recall that. And when can it be implemented I mentioned that it can be implemented when a task can be divided into two or more sub tasks which can be performed independently. And second point is the earliest use of parallelism in designing CPUs to enhance processing speed was pipelining. So, pipelining was is the first parallelism that was incorporated in processors. And pipelining does not reduces execution time of a single instruction it increases the throughput that I have highlighted many times. Whenever you execute instructions in a with the help of a pipelining processor time needed to execute a single instruction is not reduced rather it increases because you are performing different parts of an instruction by different stages in different stages. But in between you have got those pipelining registers or latches which introduces some delay. So, if you consider the time needed to execute a single instruction may be instead of 10 nanosecond it may take 11 nanosecond or more. So, time needed to execute a single instruction reduces it is not reduced. So, we have seen it was 40 nanosecond and nanosecond is in a non pipelining processor and 44 nanosecond was required in a pipelining processor. We can see for each latch one additional delay was done that is why in a pipelining processor single instruction was taking 44 nanosecond. However, if you consider the throughput you will find that on the average per 11 nanosecond you are getting one output and that is how the throughput is increased and giving you a speed up of 40 by 11. So, this is being highlighted at this point. And another issue that I mentioned in detail we have discussed about two types of processors sys can risk having different features sys processors are not suitable for pipelining because of variable instruction format, variable execution time and complex addressing modes. On the other hand, risk processors are suitable for pipelining because of fixed instruction format, fixed execution time and limited addressing modes. So, we have restricted our discussion to pipelining of risk processor. However, later on I shall discuss about that Intel series of processors which are essentially sys, but those processors internally complex instructions are decomposed into risk like operations and they are executed in a pipelined manner. Later on I shall discuss about it. Then I discussed about hazards there are situations called hazards that prevent the next instruction stream from getting executed in its designated clock cycle. And we have discussed about three different types of hazards, structural hazard which arises due to non availability of enough hardware resources and data hazards results of earlier instructions not available. That means, in case of data dependency results needed by a subsequent instruction is not available because it has not yet been completed. So, that is why data hazards occurs and we have discussed various techniques for overcoming data hazards. Later on we shall discuss about the control hazards and techniques of overcoming control hazards. So, control decisions resulting from earlier instructions not yet made. That means, the decision is sometimes required to take a decision whether a branch will be taken or not taken and because of that delay there will be some stalls to be introduced and how that can be minimized that we shall discuss later. So, we have discussed techniques for overcoming structural hazard and data hazard. And in the next class we shall discuss in more detail about the superscalar and the VLIW processors. Thank you.