 సి అల్లి ిర్రిలిండి చికిని వా చింరడ పిమరావా. సి ఃా, సి సి కా సి 3 వహి మా కి బ౾ చా చింరౡివానిలి. and this section we are going to take a detour from arm 7 processor go to understand the 5 stage pipeline which is being used in arm 9 families this this is to understand the 5 stage pipeline and then we will come back to arm 7 to read about its assembly instruction set and what are the features that the arm 7 supports so only for this session we are purchasing upon the 5 stage pipeline which is supported in arm 9 processor okay so with this session we will be completing the second unit of the course and from the next class onwards we will start on the arm instruction set so the focus of this session is to understand the 5 stage pipeline of arm 9 and what are the various stages of the pipeline and what are the pipeline adsorbs okay and then how is it avoided in some order solution it is called data forwarding so this will help you to understand 5 stage pipeline and understand the how it is implemented in arm okay arm 9 family was announced in 1997 because of 5 stage pipeline the arm 9 processor can run at higher clock rate than the arm 7 family so I had explained earlier that when the number of stages are increased you will have minimum work to be done at each stage which will be taking less clock number of clock stages because of which we will be able to run the processor at a higher clock rate when we increase this number of stages in the arm 9 processor so the fact that arm 9 has 5 stages number to arm 7 we will be able to run the arm 9 processor at a higher speed unit the extra stages stages that are introduced in the arm 9 processor helps the helps in its performance we will see how it helps and once we clock rate the beta clock rate of the processor has increased the peripheral and especially the memory which is going to be integrated with the processor should also be able to run at that speed so the processor and the memory should also be in a synchronization with its whole clock rate and another change from the arm 7 family is that this arm 9 processor is a hardware architecture that means it has got separate data and instruction compared to arm 7 which had a single memory and a single data which is accessing both instruction and data from the memory which was called one arm 9 architecture now arm 9 onwards the arm family followed the hardware architecture that means data and instructions are maintaining two different memory and they are access to two different processor so that an instruction access is not blocked because of a data access happening while executing instruction which is the of course load or storing section so this processor arm 9 also has a MMU support that is memory management unit support so virtual memory can be supported with the help of operating system running on the processor and the arm 9 has a separate data and instruction system so the SOC the system on tip built using arm 9 will have a data cache and instruction cache inside the tip apart from MMU so this whole thing is called a CPU which the first CPU was arm 9-20 T means it supports some more and arm 9-20 follows the has an arm 9 family processor inside with the data and instruction cache and MMU break-through okay by the now by now you know the difference between a CPU and a processor the processor is just a core arm core and the CPU is the processor so along with other sub modules which are part of the it could be a data instruction cache MMU and any co-processor that the vendor integrates with the co-processor so this is the introduction to arm 9 let us see how arm 9 by stage pipeline differs from the three stage pipeline so I have given both for you to understand how they differ the fetch mode the fetch stage of both the instruction both in arm 3 and arm 3 but when you come to the decode mode in the five stage pipeline you see that it it does the same thing like some more to decompressing it that we will not worry about it until we talk about some more it does the decoding of the instruction and here only registers connected that means it in the three stage pipeline all the control signals required for accessing the operands which are mentioned in the instruction are generated here and actually utilized in the execution stage where the operands were read finally before the operations were done and then return back into the registers so all these operations reading the operands from the register 5 and performing operation based on the instruction and the writing the result back everything was happening in this execute stage not state of the three stage pipeline in three stage pipeline whereas here what you see is the reading of operands they will move to the decode stage okay so in the second stage itself the decoding of not only decoding of the register as well reading the operands also done here so you may wonder what is the different between these two the difference is that here in the decode stage it only identifies what are the registers to be accessed it prepares the control signals for that and then it pauses on that control signals to this next stage and actual reading happens during this cycle the reading that operands from the register 5 happens during this cycle whereas here it not only generates the signals for accessing the operands it also reads those operands and makes it available to ALU so when it moves to the decode stage the operands have already been accessed from the register 5 and in this execution stage only the operation that needs to be performed by the ALU as well as the barris keeper is carried out now this is also divided into multiple stages if you remember in the earlier stage in the arms 7 family if the memory needs to be accessed suppose it was executing a load or a store instruction even computing the memory address was done here means if suppose based on the mode of the address address modes incrementing post index or index some indexing has to be done then that corresponding has to be done using the ALU and the address was generated and the memory was accessed and written into the register or it is written from the register to memory everything was happening here in one cycle okay whereas here what happens is that is split into multiple stages that means any address arithmetic is done here and actual memory access is happening here writing back the result is done here so that is for the memory access suppose if there is a data processing instruction which does not involve any memory it only does some arithmetic on the registers and then the result has to be written back into register in the case this cycle there is no work to be done but it has to wait till the right cycle right is comes to the right stage where the result can be written back into the register file so this is the kind of different operations performed at each stage and they all need to be done in synchronization in lockstep with each other and common mcroft is driving these stages and that even one stage the other stage the information that is required to be passed on is done and I will explain you how it is done internally so this class is only to make you understand what happens inside a pipeline and how it is implemented we are taking the ARM processor as an example and then we are studying here but this is the method which is being followed in any other instruction pipeline implementation tool but this kind of splitting of the function is do uniform across the processor this is in terms is very specific to the ARM also there may be some minus changes when you study about other processors so you can understand that what you under you know what is given here could be what is being followed in other risk processor also with minus changes to operations done at a different stage so let us go I hope you understand the difference between these three stages the difference between these two pipelines let me explain some more in detail so the breaking of instruction execution is done and it is broken down to five components from three so because of this the maximum of the things to be done at each stage has come so effectively we will be able to increase the process because the work of our boolean logic particularly that we are going to be having it inside a clock inside a particular stage it is not going to be introducing more delays so we will be able to complete it within a with a higher clock rate so it helps in improving the performance and increasing the clock rate of the processor this is very important that it is not sufficient if you just increase the clock rate of the process because within the particular stage if memory has to be accessed the memory should also be able to provide the information the data within that time so not only memory if suppose there are other peripherals which also run in the clock step with the processor they should all be operating at the rate the higher rate otherwise we have to have a mechanism within the processor to interface with the lower rate are the slow peripherals peripherals we can afford to do that but if a program is running from a memory and it needs to be accessed the fetch stage of the pipeline needs to be running at the same speed as the processor so the instruction memory and the data memory should all be running at the same speed as processor and as I mentioned online commonly has a separate data and instruction memory so you can either have separate memory or you can have a unified memory with a separate caches so both you know effectively improve the performance of the processor that means reducing the clocks per instruction the number of clocks that are acquired effective clocks per instruction execution will come down because of this specific play so this is what is done in the online now let us see what is that how does it come back with a typical this is so this 5 state pipeline is called a classical way of automatic pipeline so though initially was not designed for a 5 state pipeline it actually you know map into the pipeline the key difference in the online is the provision of three source operating reports and two reports if you recall arm 7 has two reports in the memory and only one report for a register 5 okay that means from the register 5 you can have two reads happening as well as one write happening in family whereas in online they increased it to one more report and one more rightful apart from what you had for address instrumenting hardware and R15 for accessing the R15 of RTC one more read and write course are there apart from that there are three reports and two write course in the register 5 in online whereas classical typical classical processors for have only two reports and one write course okay so effectively what you can see because of this is that when an execution here if you remember the 5 stage pipeline in the decode stage itself the operands are there I told you that arm 7 was accessing the operands in the execute stage whereas in online it was accessing the operands in the decode stage itself so maximum of three operands three so registers could be there you know in the same clock cycle similarly during that time R15 can also be accessed and written back suppose because this needs to be done for every instruction because every pipeline whenever there is a fetch it has to increment the R15 so there should be a process separate for so that R15 can be accessed without any intense of other source operands being read by the instruction okay let us see what are the various stages though I showed you the in the figure let us explain that in detail now fetch mode which accesses the instruction from the memory and place it in the and when that instruction moves from fetch mode to decode stage fetch stage to equal stage what happens the instruction which was accessed earlier so cycle moves to the decode stage and the new instruction gets fetched from the memory which is handled by the fetch stage now when the instruction moves in the decode stage it has to resolve you know what are the operands are being used in the instruction looking at the if you remember 4 bits were reserved for Rn and Rn and Rd that is two operand registers and the then you know destination registers so based on those values and what kind of operation needs to be performed those registers are first read and kept ready in a separate invisible program visible cycle registers okay those operands are not used in the stage it is only accessed and kept to be used in the next execution stage so this needs to be passed on to the next stage because once this instruction moves from decode stage to the execution stage the next instruction will be filling in the decode stage so now decode stage will be handling the next instruction whereas the previous instruction what it has fetched the parameter operands won't be available with the decode stage so it has to be moved to the decode stage where actual execution of the instruction happens so there should be a way of forcing the parameter it is similar to the line of paper standing and then they move items from one hand to the other so once the instruction moves from one stage to other stage the next stage could take control of that particular instruction and then takes whatever information is required from the previous stage and handles it and then passes it on to the next stage so that is the way this handling inside the hardware so in the aggregate state what happened based on the operation if barrage shipper needs to be used one operand the rm operand will be operated on and then the ALU does the job within the clock cycle if you remember non overlapping two takes clock cycle and then if it happens to be a load store it has to resolve and find out the memory address using the ALU and then if it is a load of store instruction it has to do a data memory access that means the address computer is put stage is put on the address buffer and if it was read then it will put the value also or if it is a write it will wait for the memory to provide the data which is to be written into the register means if it is a store or load based on that either both data and address will be thrown out if it was a store if it is a load it will put the data address first and then wait for the memory to provide the data and in the write back stage whatever memory gives there is a value that needs to be stored from the memory sorry it has to be loaded from the memory that will be accessed and then put into the register so writing back into the register happens in the final stage now what happens suppose you have a instruction called add r1 comma r2 comma r3 what are we trying to do here we are trying to add r2 and r3 and know that result into r1 now how is it done in the pipeline the instruction plus add instruction is the 32 bit instruction which is stored in the memory is fetched by the first stage in the decode stage it understands that it is a add operation so here it used to add operation there is no barrel shifter operation here because it is a straight away also and r3 needs to added and put into the register now all that information is available when this instruction is getting decoded but it is not known when finally the result is available and this will be written into the register so this information has to be passed on to all the stages so that when this add instruction after some 3 clock cycle when it landed lands here it will know the result is to be written into r1 so that is being passed through a intermediate programmer invisible pipeline register I will explain you but I got here if I make it clear to you later on it will become easier now add instruction it will it has understood that it has to access the parameters from r2 and r3 so it has read the parameter now r2 and r3 are already read in this state and then it passes on the values that it has read from r2 and r3 as two parameters to the ALU now what does the ALU do ALU also knows that it has to be add operation so that has come from there and then it will say that there are no barrel shifter information operation required so it does the addition and then keeps the result now it cannot write into the r1 register because it is only execution state so only in the write back it can do it now so it will run through this memory access cycle also because it is nothing to the memory so it just waits for that cycle to get over and then in the write back that r1 is written into with by the result that it got in the it will so it is passed on through different stages and then finally the result get written so this is the way pipeline works let us see with it so this is the typical instruction pipeline execution you would have seen this is interface this is the instruction fetch mode decode stage execute stage memory access calculation stage or memory access stage and then writing back the result so if you see at a particular time there are this is the suppose you assume this is the instruction at thousand this is the instruction starting at thousand four because you remember all the instructions occupied four by so if an instruction started at thousand the next instant assuming that there are more branches it is a sequential instruction that being executed by the pipeline so thousand thousand four thousand eight thousand three and you know thousand eight so this all these instructions are in different stages of the pipeline so when thousand the instruction at thousand reaches the right back so the instruction was at thousand four at this stage similarly the remaining instruction move through the pipeline and at the end of five clock cycle you will see that one instruction gets completed per cycle so you will see that every instruction coming over the pipeline if there are no other stall or any stalk in the pipeline I hope you understand the flow of the pipeline here now I told you that an instruction to get executed we have to pass on the information from one stage to the other stage because each stage when it is operating on one stage it calculates what it has done with the previous one so that needs to be passed on through a intermediate register which in the same clock cycle it gets flashed and then this does a different operation and this does an operation based on what is coming similarly the same stage here the execution stage it takes parameter it is similar to you can know in your programming paradigm it could be called as a function number functions are executed and parameters are being passed to each function so every clock this function each of the functions get executed but the parameter which it reach from is from the pipeline register so as the instruction flow in you get different parameters for every stages and they do according to it and then pass on the result back into the next stage so now assuming each of the stages takes tau 1, tau 2, tau 3 they are 5 different values 5 different values which are stored there now if we know that each stage is taking this my clock cycle or seconds or nanosecond then I told you that in a pipeline all the stages should be in lock lock that means they have to operate in synchronization means you know completely synchronized each other that means this cannot complete ahead of this or one of them cannot take more time or less time so once you know that this is the effective delay of each of the stages based on the factory and delays that you have in the logic states that is used for implementing the function you have to assume a worst case instruction execution in each of the phases and then complete those values and that should be given at the clock period for each of the stages so effectively what happens is you you take a maximum of this time delay and then make it at the clock so what happens so every stage would have completed before that maximum delay is happen so the instructions will flow without any instructions yes I know a new instruction is issued every clock cycle and on every clock cycle the results of each stage moves into the pipeline register so these pipeline registers are the invisible registers they don't have any means like R0, R1, R2 you cannot access them from your assembly program so when I say invisible means you don't even know where they exist but they are required for this instruction to be executed to the pipeline and which is used by the hardware inside and then as I told you the maximum of this delay is used at the clock period for the complete pipeline to be working in sync with each other now what is this pipeline register to they work on data and control value what do I mean by data and control value data means it is what it has understood from the interest inspection decode suppose as I told you in the example add R1, R2, R3 it it came to know that R2 and R3 are the operations to be used so it has access that operand here and then it has understood that add operation needs to be performed on this operand and there are no direct sugar operation to be performed and then the result needs to be written into R1 now in this information which is available in decode stage needs to be passed on to this write back stage so that when the result comes out from the add ALU unit it can be written into R1 so how does this stage know this instruction what will you know interested in writing the result into R1 so that needs to be passed on to the pipeline so this is the values it will say that destination register chosen was R1 so R1 will come here then when finally add result you know lands in this stage it will know that it will enter into R1 so it will write into the register now what is the control operation the operation that needs to be performed is add that is the control information to the ALU because ALU is capable of doing performing any operation whatever is program to do so that add operation needs to be passed on to this so that it will know that on the two operand values it has passed it is supposed to do a addition okay and then it will pass on the result it will also say that okay this is the value of the result that I have found from adding the R2 and R3 it does not know whether it has added from R2 or R3 it has got two values it has added and then it says this is the value I got and you are supposed to write into R1 now okay sorry this operand stage should have been performed it is execution stage the instruction decode is done and then it should have been execute stage and then you know write back and this is the memory memory execution and then write back happens here this is the address calculation okay and then write back happens here and then the value goes out okay now you see here any data value in later stages must be propagated to the pipeline which is there and then most extreme example is RD field that is what I mentioned in the R1 that has to come to the write back stage to be written into the register now the pipeline register do the job of passing on these values between stages to the other one okay now now what is the control signal the control signal is generated in the same way as in the single cycle processor which needs to go through the pipeline now take an example which what are the control signal for the the add operation or from the operation or is there any RO or any other rotate operation anything that needs to be performed has to be passed on and then what are the sources of ARU where the values are actually the action value itself is passed on okay and then in memory whether it is a reading or write and which memory access address has to be used for and it could be even CC source also and if it is a branch instance if a new address when the instruction is to be active is passed on here so it could be a source address for the program counter now what happens in the write back whether it is a register write or memory to register whether register to memory or memory to register or it is a register destination whether it is an add operation and needs to be written into R1 so at various stages it is passed on based on the instruction which was recorded in the IDP so that falls through the different stages of the pipeline so this is a way control signals are passed on for the later stages to use them so the control signals are passed on along with the values also which are required for the greater performance now we thought having more stages in the pipeline helps a lot it improves the performance the weight of the process can be improved it is not that having more stages in the pipeline is all good there are some issues associated with that which needs to be handled so they are called pipeline absorbers let us see what are they this is a what we know that disturb the smooth distribution of the pipeline something which disturb the smooth flow of instruction into the pipeline is called the pipeline absorbers let us see what they are let me give you an an example where assume it is not true with ARM 7 or 9 but assume that there is a unified cache with a single resource that means unified cache there is a both instruction and data are accessed from the same cache and it has got only single resource that means when the processor wants to access the instruction or the data assuming that they are already available in the cache now it will be reading from the cache rather than from the manual now because of the restriction or limitation that this particular cache has only in the resource it cannot read an instruction that means the data now we may wonder how when we can acquire that these two needs to be done in the same section it is possible because if you remember in a pipeline case different instructions are at the different stages of the pipeline so there may be an instruction which needs the data which is in the ID stage instruction decode stage where a point over ARM process is accessing of data or the app code so these are so accessed there I am sorry we cannot register we can say that because it is coming to the data cache it has to be accessing of memory so it is accessing some data from the memory which is in the 4 stage of the pipeline so that data needs to be accessed and during that cycle itself some other later instruction which is an instruction from the pipe on the memory so they both will be trying to access the cache at the same spot so because of that there will be a hazard so it cannot be done together so one of the things needs to be stopped and then we have to wait for the one of the needs to be done there is an instruction or data and then the next class has to be started so the pipeline bubble which I explain in the 3 stage pipeline will happen now dependence because of and hazards now what I am saying is there can be a data dependency between two instruction which may cause a hazard let us see how it can happen so in a program you have set of instructions and normally then we are operating on set of register they all be related to each other because we add two values and then the next instruction uses the result of that and then carries forward to another operation or it stores into the memory whatever it is so there will be a dependency between two instructions which are coming one after the other so this if two instructions are data dependent they cannot execute simultaneously what I mean by that is it will be clear as we go forward I will give an example and then explain you that so if two instructions that means the data instruction is waiting for the result to be available from the previous instruction that means it has to wait for the previous instruction to get completed before the next instruction can start accessing the parameter or operating so the dependency may be because of registered operand or it could be from the memory also because earlier instruction has to write some result into the memory and then that needs to be written that is why the next instruction then it has to be dependent on whichever so this kind of data dependence between instruction also can cause result now there are basically three types of working methods one is data that means an instruction may require an operand that will be the result of a proceeding still uncompleted instruction what I mean by uncompleted the previous instruction is still not written the result back into the register the write back stage is not being done so the instructions are following one after the other within a clock delay right so when the previous instruction is in the execute stage the next instruction comes in the decode stage where the operand is required but now it cannot get that operand and then the previous instruction writes the result back into the register so this is called the data result structural result there can be a same result given an example in the tenth of a unified cache it could be limitation on the register code also could cause a resource conflict because of that there will be a stall in the pipeline now one more one more I got a site called controller pad this is because if there is a dump or a branch to another address then the flow is disturbed now the program is not accessing the instructions in the sequence whereas the pipeline always accesses the instructions in the same sequence because it keeps incrementing flight 4 and then it is on accessing the instruction and put it in the pipeline now suddenly a branch instruction has come inside the pipeline and says that I need to now access instructions from some other memory now what happens the instructions so far have been accessed from the sequential address needs to be abandoned and a new instruction has to be accessed from the new location so because of that there will be pipeline as well so when I say pipeline as well it actually introduces some bubbles in the pipeline it is not that it is blowing up the pipeline but actually it introduces some delays that means the pipeline cannot give you a throughput of no file instruction for clock in an ideal case in a file stage pipeline you need to get a file instruction get completed in every clock but what happened is effectively you should get one instruction coming out of every clock that will not happen so it will get delayed so common solution is to start the pipeline now there are some hardware features supported by your processors to avoid these bubbles in the pipeline for example now true data dependency is there one instruction from the final outcome of the previous instruction this is called flow dependency or write rate dependency that means the previous instruction has to write the result into a register and the next instruction is supposed to read from the register so until the previous instruction write the next instruction cannot read from the result from the register that is called a flow dependency take an example here now add r1, r2, r3 what does it mean r1 is equal to r2 plus r3 now what is this move that okay it should have been sub or any other operation arithmetic operation actually I am showing it as a add r2 things as r1 this should have been add so add r4, r5 suppose 2 add instruction from one after the other now what happens is this operant which is required for this add instruction we did that add has to come from this so until this addition is done and the result is written back into r1 this cannot proceed forward so this is called a true data dependency second instruction can be fetched this add instruction can be fetched but it cannot proceed forward until this previous add is complete okay so this is called a true data dependency now take an example of add and a null instance it is a typical I am not talking about a particular any particular processor or anything we assume that there are two instructions which are add and null written by this instruction for this to take a value from r2 and then do the operation now what happens is as per our decode status instruction decode is reading the operant also in this stage so effectively in this setting itself this instruction needs r2 to be available but this go back okay the execution is done for the add it has completed the value but still it has not written back that value into r2 because it is going to it has to wait till the write back then r2 will be written so if suppose this instruction tries to read whatever is the value of r2 in this class it will be the old value it won't get the latest value which is supposed to be coming from the write back stage of this so because of this this instruction cannot proceed forward so there will be some out of the directory in the pipeline to designize that there is a dependency in the register operant values so it will stall this instruction it won't proceed further it won't be executed like this it will be stall till this talk is done and then after this writing back it will be done it will be reading this and then it will proceed further it needs to support some way of passing on this value to not appropriate stage so that this pipeline this instruction can proceed further without any delay let us see how it is handled in the proper way this is another example there a store instruction if you remember store is something to do with the memory now what is it doing it is loading the value into r7 and r1 r2 needs to be accessed and then that operant needs to be taken for error now if r1 r2 needs to be accessed using the actually this is a load it will be loading the values from memory to register it needs to be load register and then these parameters have to be read by the adding instruction so this won't be ready until the register values are accessed in the memory so because of that there is a delay here and in this case r3 is written into by add and then that is being used by this instruction now so this is also the dependent on this value so because of this dependencies need to be avoided or it should be this instruction should be stored for this to operate properly okay now how is it handled this is handled by the data power I will tell you what is done take an example here add instruction takes the values of r1 r2 here and it executes the addition and the result is available now but what I told you is through the pipeline register these values are passed on from one stage to the other stage and finally it gets written into r2 in this stage and then r2 needs to be accessed here now it supports hardware stuff the processor supports the way of forwarding the result ahead of writing into the register file because you know that after this execution is done within this clock cycle you have the result available out of ARU now if there is a way this output ARU can be given to this stage so that you know this value is copied from the register file and then kept in the pipeline register for the next stage to consume so if this value can be substituted before they write it up you know given to this stage actually there is no need of introducing value okay this concept is called forwarding this is different from the pipeline register which pass on the information from one stage to the other stage that is the usual flow of information from one stage to the other apart from that between two stages between two instruction flow there is a information flow happening across them based on the type of offering which are used by a different instruction this is called data forwarding that means once the data is available we know that at this stage the addition is performed only thing is it has to wait for two clock cycles to be written into the register file for this instruction to accept so you may have a delay of two clock cycles instead of doing that if the hardware can designate there is a dependency between the you know this instruction and this instruction and it is actually waiting for the result of this instruction if it could be passed on but from this stage to there the next instruction which is in the previous stage then this instruction can proceed without any delay so is the correctness of the program issue of course yes because between this the value what it has got is correct anything is it gets returned into the register later that does not matter because you have got the correct value for this addition to be performed so effectively the program correctness is assured and there are no bubbles introduced in the pipeline because of which the performance and the effectiveness of the pipeline is better here so but it needs an additional strategy and additional load of course from the hardware which is supported in ARM so on ARM 9 also support this data function if I show you the when we see the data flow of the ARM 9 in the data function now as well in DRM section so R1 to R2 load we are doing a loading of the value which is pointed by R1 from the memory into R2 register and which is required added add operation now recall when actually the memory gets activated it gets the computation of result happens here for the load here it is a simple shape where there is no indexing or anything but typically one cycle is used for computing the this cycle is used for computing the address and this cycle is used for accessing the memory now what happens in the end of memory cycle only the value is available but it is still not written into the register file which happens only here but at least the value will be available here so in case if you want this instruction to be run without any delay it is not possible because when this this is all same clock set we need parameters here for this add to proceed to the next stage but we do not have the value because it is only address is completed here and we do not know what is stored in the file so we have to introduce one bubble here and the data perverting is there if this value would have come into the processor from the memory in this clock that can be fed to this id stage and it can proceed so with the one bubble when you are introducing one bubble this instruction can proceed but it cannot proceed without any delay we cannot feed anything from here because earlier when we saw as the research was available so the parameters could be often could have been forced onto the next instruction whereas here we are accessing it from the memory which is not available in this cycle it will be available only at the end of this cycle and it will be written into the R2 in this cycle so we have to wait for a cycle at least to get the value by forwarding this value into this in the procedure instruction so we have to show here is that is not where forwarding can solve all the problems it can solve some problems where the operands are coming from the register but if it has to come from memory it has to wait for one bubble now another example so by introducing this bubble we are trying to get that value same example here with the solution so we are accessing the value from the memory state data forwarding is L2 here but if data forwarding was not supported then we have to wait for the write back to happen and then read the same value from the register so even you have to do this one more cycle but here anyway it is only one bubble disk okay now just what all needs to be as a assembly programmer you may be writing a code for arm 7 which we quoted on to arm 9 there are some subtle differences between these two because of the differences in which the way the operands are being fetched in the 3 stage and 5 stage data now assuming that instruction 2 is a data dependent on a load instruction which I have shown you previously LDR instruction 1 then instruction 2 has to be stored atleast have to go memory state okay even when forwarding is implemented from memory block to index data that can have to be removed the only way to avoid this is to enter a compiler okay how do you avoid this if suppose a compiler is aware because it is it knows what are the instruction whether is there any dependency between this so it can introduce another instruction without disturbing the flow without disturbing the correctness of the program it can put the instruction which depends on the previous many LDR instruction if it can put some other instruction which can be executed without any dependency then it can avoid so that needs to be done by the compiler by introducing some instruction which is not dependent on each other now because this problem does not happen because computing the address as well as accessing the parameter value and writing into the register all happening in the easy good state so the next instruction coming in will be able to get the parameter very easily because it is only reading the parameter only in the last week so 3 stage pipeline does not have this kind of dependency the 5 stage only has the dependency in the 2nd stage itself whereas in the 3 stage reading the operand and writing the result back everything is happening in the 3rd stage that is the easy good stage so there won't be any dependency between operand being available for the next instruction to proceed so this is only specific to the 5 stage okay now let us see how it is implemented in the 5 stage so you understood what all the pipeline registers and then data for reading you will be in a position to understand the whole diagram okay now it is little more different than what you are seeing for the ARM 7 processor you have an instruction patch within the processor and the instruction flow through this and you see that there is the address incrementer which is now specifically incrementing the value by 4 and then writing that why is the address going to instruction cache because if the instruction is available in the cache it will be read from here or it will go to the memory and then through cache the instruction come into the processor now after this this is the first stage now whatever you see the grey one they are called instruction they are all pipeline registers which I showed you in between the stages you can pass on the data as well as control information from one stage to the other stage these are the registers which are immediately applicable to the programmer but they do a very important job of passing on the information between different stages now if you have noticed but within the 3 stage in the 5 stage pipeline the PC if you remember the 3 stage pipeline first there was a there is a fetch and then there is a decode and then execute if you remember I told you that when a particular instruction gets executed in the third stage that then the PC value is PC plus 12 that means it is accessing the next instruction but one because in during the first stage PC is and the third second stage it will be PC plus 4 and when it comes to third stage it will be PC plus 8 now that is the way PC is incrementing so PC plus 4 and 8 plus 8 is the insert of PC value when an instruction is getting executed in the current execution stage whereas if suppose an instruction in the 5 stage pipeline the PC value it is going to see PC plus 4 because here the upward access is happening immediately after the fetch thing it is not happening after the third stage okay during the 3 stage pipeline during the third stage only the PC is suppose any instruction is having a dependency on the PC value it will be accessing it from the register 5 so what the ARM processor family have done is suppose to keep the similarity between an ARM 7 and ARM 9 family though really the PC is only incremented by 4 they is incremented by another 4 and then write into R 15 so that the program original program which was written for ARM 7 so that is why you see that though the PC is plus 4 it is incremented by another 4 and written into R 15 so when this instruction comes into this suppose you have written a code for ARM 7 and you are running it on ARM 9 you will be assuming that or the assembler would have assumed that R 15 is incremented to plus 5 so to get the same feel of the value that it expects to have in R 15 the PC is incremented by one more 4 and then is written into R 15 so that is why you see this here and then the rest of the thing is happening the same way immediate value is also passed on in case it has to be shifted across to ARU and then based on the load or store you get the arithmetic pre-index or post-index this will be understanding in little more detail when you are talking about addressing mode but this address computation is happening here because I told you in the execution stage address computation also happens if it is a load or store and then actual access of the load or store happens during the mem cycle and then writing that value back into the register happens in the right back state so if it was a typical ALU operation it comes here the ALU value is computed and then it is passed on through the pipeline register to the register write and then it is written back into this register so this is the way the process pattern now I showed you that there is a forwarding that means whatever value you get during the execution stage you feed it back into the instruction that means if you do not wait for the results to be available only operating the register it is taken out from here and in the next clock cycle if there is a dependency on this data for the previous next instruction when the next instruction enters this data is available so whatever I showed you from the execute stage to the decode stage there is an operand moving and this forwarding happens so that you get the values to the forwarding register so this is all about pipeline now what is the pc behavior I will explain to you with a diagram so in the 3 stage pipeline the pc was different it was pc plus 8 in the 3 stage whereas in 5 stage it is only pc plus 4 when the operands are read in the 5 stage pipeline because of this ARM has added to emulate the ARM's behavior it is adding the pc value to 8 and then feeding it the implemented pc value from the 3 stage is fed directly to the register in the decode stage bypassing the pipeline because pipeline register only the pc plus 4 but we want pc plus 8 to be in sync with the earlier code so the pc plus 8 is passed on so that when the instruction is accessing the architecture it will get the correct value what it assumes to have in the architecture which is pc plus 8 as per the 7th family of processor so this is the little difference in the behavior of 5 stage pipeline which is handled by ARM now what is the pipeline paradox do you think that pipeline actually makes the instruction to run faster actually not actually having more stages it is taking a longer time for the execution to complete one typical example is I told you that because of the 5 stage pipeline even for an add or a very unknown data processing instruction it does not involve any memory has to wait 1 clock cycle to write back the result that means we are introducing 1 clock of 1 delay 1 cycle delay we are introducing because of the 5 stage so it is in fact increasing the execution time but how do we get the better performance because it increases the throughput you get to see 5 instructions now every clock cycle 1 instruction comes out from the pipeline that is k times 5 times faster performance you get when you switch that operation into multiple stages and then execute them so you get a throughput improvement but we are not executing the instruction faster than the normal execution processor so what is the ARM processor and the pipelining how this ISA of ARM supports pipelining effectively because all instructions are 32 bit long and fetching them is easier and decoding them is easier because the operands fields are in a physical location so decoding is simpler and then we have a register to register architecture the AIU arithmetic operation in the execute stage can be completed in one cycle including the barrel shifter operation so there is no delay because of memory because we have removed all the data processing instructions using memory as another operand so only all the data processing instructions used will be registered for the operand so there is no delay because of memory whereas this processor data processing instructions also to have memory operand that is directly accessing the operands from memory because of which there will be a delay and this processor also have variable length instruction length instructions because of which the decoding logic is complex so if you remember that tau value of how much time it takes for a particular stage to complete will be more because of this you have to give the same time for all the stages so your pipeline also becomes slower the whole pipeline becomes slower slower is one of the stages in the pipeline takes more time so because of this implementing a by stage or a more stages in the pipeline is easier for this so this is the you know how an instruction that architecture affects the pipelining and what is the relationship why is this easier for implementing a pipelining in this processor that is what I am trying to explain here okay so we have come to an end of this class where we understood what is the 5 stage pipeline and the various stages of the pipeline and then how the 5 stage pipeline is organized internally in the arms and how the data forwarding from the different stages type in completing the instruction which was the more process into the pipeline so we understood also what are the pipeline numbers and how it is solved using the hardware so thank you very much for your attention and this brings down the end of our unique tool and we will be starting with instruction set of arm 7 from the next session thank you very much for your attention