 మిత్న్ ఇటిలు వ్ల్లు మారున్లు ఎత్త్లిక్లుఘి టిస్యల్మ్సి వారో ట్విలుంట్లి చారిహా ఆడిక్లిండిసిగో మాలిక్. కాట్లినో నిక్లి  the model is that when an instruction gets decoded inside the ARM core and how a particular instruction is executed by interacting with the different registers in the register file and then gets running the results out back to the register and then we will touch upon the pipeline that this is the new concept which will be introducing you to here we in this lecture we will talk about ARM 3 stage pipeline which is the ARM 7 PDMI 3 stage pipeline and then talk about what are the different stages and their limitation in this section now to start with let us again refresh the fundamentals on which ARM processor is designed all of you are aware that the ARM instructions are all particular bits long ok so they are aligned on word that means the 4 byte boundaries the instructions can be only on 4 byte boundaries and all of them are of particular bits length and as you are aware this is a list processor with a load store architecture that means anything to do with memory is done using load or store instruction that is if you if the processor wants to read a particular data from memory it has to use load instruction and if it wants to save any value in the register to a memory it has to use store instruction now all of the data processing instruction know for example arithmetic or logical operation or anything that can be after that can work on only register operands so any of the data processing instructions need to work only with the registers that means if the processor wants to do some addition of two values which are stored in the memory first of all the processor has to load those contents from the memory to one of the internal registers which we saw in the last session and then it can perform addition on those two contents of the registers once the result is written back into your register it has to be written back into the memory wherever the result is supposed to go by using a store instruction so addition cannot be performed directly picking up an operand from the memory whereas this processor support them ARM core is a particular bit processor and most of instruction sheet the registers as the holding a signed or untyed into that time so I have mentioned about signed and unsigned earlier the processor does not recognize whether the value which is stored in a register is the signed or untyed it changes the track if the instruction allows it to change the condition flag like carry 0 and overflow flag and signed bit signed flag and based on the operand whether it is signed or untyed the programmer is supposed to look at the relevant conditional bits to understand the outcome of the result and all the data processing instruction work only with 32 bit number though memory and the processor they have a provision to read bit 8 bits or 16 bit data but when it comes into the registers it gets expanded to a 32 bit value and then any data processing instruction operate on it assuming it has to be a 32 bit value now all the decoding of instruction is done using hardware circuitry this is to make decoding faster now let us see how a particular instruction looks like as I told you the instruction which is 32 bit so I have given a sample instruction okay this is add RD and RN and R1 this is the normal convention the way an assembly instruction is represented in ARM so add shows that it is an addition and RN and RM are the two operands they get added and the result is put into the RD that is a destination register now these registers could be anything from R0 to R15 and you have to remember that R15 has some special restrictions because that is a PC that is a program code other than that other registers could be used here to perform this operation now how is it encoded into a instruction 32 bit instruction so the total width of this is 32 bit I have not shown all the intricate details of this instruction is a general format where the function to be performed is there which occupies F bit in the instruction and then there are three operands right two operands and one is the result so three registers are to be represented in the instruction which each of now each one of them is occupying N bit because they are all internal registers we are representing here so all of them require N bits to encoder now the three address format all register specifier are given here as two operands and one destination address so two are source operands as I mentioned and one is the destination register there are some restrictions on some registers which can be used in a particular location based on the type of instruction that we will be covering when we touch upon each of instruction in detail so this is a general format okay so if this three can be any of the available general purpose register now let us see I told that each of the registers are represented using N bits you know that the registers are all the general purpose registers which are available in a in the ARM processor if you recall in the last discussion we talked about different modes user mode session mode and user mode and FIQA and IRQ so in different modes overall how many registers are visible that also we touched upon so I want you to just spend some time maybe one or two minutes it will take more than that even less than that so point out how many bits are required to encode the register which is addressed in this three operands right is it 32 bits or does it be 8 bits or 4 bits or 2 so having just let us see what is the it is the option C that means it is 4 bits to represent the register why do we need 4 bits all of you know how many maximum numbers can be represented using 4 bits it is 0 to 15 that is totally 16 so if you recall the number of total registers in the ARM register file on each and in each mode overall it may be some category or so but when you are when this processor is operating in a particular mode it is going to be same only a subset of the registers which are available physically so it is it happens to be 16 so 4 bits are sufficient to encode which of the register is used for each of the operands just vision so this has to be a 4 bit value okay now let us look at a data flow model of ARM okay so this is the diagram which shows the internal structure of how the instruction get into the processor and how they are processed so let me give a overview of this this is the data where is it coming from this coming from memory now if you recall ARM 7 pdmi which we are talking about now is a non-volume architecture so that means one on one architecture means what both instructions and data come from the same memory so here data also come as well as instructions also come into the processor this slide now the instruction gets decoded and based on the opcode and based on the operand registers mentioned in the instruction there is a decode logic generate some control signals to transfer information that even one registers you are there okay so that is a decoding logic with with that and then during the execution some of the registers get opened up and the data flows into the proper registers and the instruction get executed now leaving aside that part let us understand the other portion of the security here the memory write anything should be written into memory as we go through this data bus and one of the registers contents need to be written in so there is the flow of data from here memory write goes through this and any memory read if you recall in the last discussion we said it could be a 4 8 8 bit content or a 16 bit content can come from memory and optionally it could be sign extended based on the type of the data which is coming in and finally what is written into a register as a 32 bit data because I told you the instructions operate on only 32 bit data value 32 bit values in the register so even if memory interface allows reading a 8 bit or a 16 bit data that either gets sign extended or it is filled with 0 the top portion of the content and the complete 32 bit content of the register is used for any arithmetic operations inside the processor now our key register is PCA program counter which is also part of this register file so whenever the processor wants to access the next instruction it has to send that a content of R15 or a PC into a address register which is interface to the address bus of the memory so address goes out so the address which is going out could be a an address of an instruction or it could be an address of a data so if you if there is a memory read the address goes out here and the data comes into it if it is an instruction address goes out from here and the instruction come in to this bus data bus and goes into the decoder so this is the flow of address and the data now why is there an implemented here you know that so the PC needs to be incremented on completing one instruction like you know accessing after one instruction it has to be incremented now incremented by how many amount how much of amount it has to be 4 because we are reading 32 bit instruction from the memory that means 4 bytes of data so when the next instruction needs to be read it has to increment the address by 4 and then that value is put back so that the processor can keep on reading instructions from the memory it can perform only instruction access provided there is no memory access in the instruction but suppose we have a sequence of instruction which are all arithmetic instruction like add or subtract or any or operation which is internally done with the register content and there is no need for accessing any memory then the processor can keep on sending the PC value to access the instruction and you can keep on incrementing it if there is no other jump or branch to some other address then you can sequentially access the address in the memory the instructions in the memory and then keep on processing it inside but no instruction sequence could be like this because we need some data to be processed and that means we return back into memory or read from the memory so our instruction flow will be interpreted with some data accesses also so during that time what happens is the instruction accesses differs and the address the data accesses done so everything happens in one cycle so either in one cycle it reads an instruction or it read or write the memory content or data contents okay so this is about instruction flow now what happens now this value needs to go here why because this is actually a content of a PC which needs to be written into R15 which is the part of register file so not only this instruction address goes into the address it also goes into the register file to be written into R15 now in the register file you see how many reports are there there are some two operands are read from here and then one result is written into register file and R15 is read from here and the incremented R15 is put back into register file again if you remember I mentioned that a register file here in ARM has two reports and one write port for the registers other than R15 and there is one special read and write port for R15 so all these operations reading two operands from the registers and writing the result back and reading the R15 content and writing back can all happen in one cycle also because register file has that many ports available for the access of these values okay let us this bit of introduction is enough let us both do one by one to summarize what I said so if from the memory we are reading 8 bit or 16 bit numbers it needs to be extended to 32 bits based on sign whether it needs to be signed or unsigned value it will be if it is signed it will extend the MSB bit into all the higher end values or it will fill it with 0 while extending the number of bits now how do we know whether what is read from the data memory is a signed or unsigned value it can be represented in the instruction and the programmer knows what the value is stored whether he has stored or read you know sign value or unsigned value based on that when he is reading that particular content from the memory the instruction he can mention whether it is a signed value or unsigned or based on that the signed extended security will come into play otherwise it will directly write the value into this and then fill it the higher bit with 0 and two source operands for any instruction A and B parts are read by these two reports from the register 5 and the PC value as I mentioned it gets read from here and it is last here and the incremented value is written back into register 5 so what happens is suppose the PC interested in accessing the address instruction stored at address 1000 1000 is written and that goes out in the address base and increment of 1004 in the next cycle gets overwritten here and it also get written into R15 so what happens is after reading the instruction at 1000, 10004 gets written into this through this and it can access this instruction stored at 1004 okay that is the way instruction gets accessed so the address is also written back into the address as well as R15 now let us see what does ALU do ALU has an option to take the operands two operands it could be anything right R and RM could be anything from R0 to R15 subject to some restriction based on the instruction which is being included so ALU knows based on the op core what is written the which are with operation is supposed to do and relevant is registers contents are put on the these versus based on the decode logic after decoding the instruction and then the operation is performed and the result is written back into a relevant register based on the instruction now it could be a data processing instruction with operation the register or it could be a load or store instruction that means what a particular register content needs to be read from here and that needs to be written into a particular address now the address may come from the one of the existing registers and there are different addressing modes that we will talk about that you know when we touch upon the address addressing modes but suppose there is addressing mode which has got some index incrementing then the incrementing also happens here and that address goes out and the data what is to be written goes out through this part so during the data read or write these security devolved comment feature and the data to be written goes out from the register file into the data bus through this part okay this is a explanation for load and store instruction how it gets executed here now another important feature is about barrel shifter now I told you that during the data processing instruction there are two operands right from the register file now the second operands which is you know possibly you know within the arm called as RM so this is a convention follow okay nothing special about it but the RM is one of the operands which only go through the goes through the barrel shifter so within the cycle whatever is the content of a particular register gets shifted or rotated or some logical operation is performed using this barrel shifter and the output of that is back to the ALG so if suppose addition is to be performed from r0 or 0 comes from here and then you given instructions in that r you know the second operand r1 is read and shifted by 4 bit to the left and then it is the content of that is added to the r0 then what happens is that shifting by 4 is done by this barrel shifter and the shifted value is given to ALU and the arithmetic addition is performed and in the same cycle you will say that the output result needs to be written into some register maybe r2 then the output is written into r2 so this all happened in one cycle that is when we say cycling we are referring to the m clock that is feeding into the processor so the complete arithmetic is done I return is done option is based on the instruction the barrel shifter will be used or it could be a short circuit that suppose if you say r add r0 r1 and then I want that to be written into r2 then r0 r1 come directly there and then the register is written into r2 so this is the way barrel shifter is used in the processor so the barrel shifter and ALU can calculate together a wide range of expressions okay and the load source as I mentioned there are in the memory there are non sequential access and sequential access sequential address access means in the sequential access cycle multiple address data locations are either read or written into in the memory by one single instruction so what happens is during that time suppose assume that there is an instruction which says that starting from 1000 access remaining 10 bytes sequential bytes the sequential words from in the memory and then you copy from r0 to r5 r5 r6 the register content is to be copied to 5 or 6 locations in the memory now what happens is the address is automatically incremented by 4 starting with the first access and the remaining access is also completed in the sequential mode so to perform the sequential access of memory this particular in the incrementer and the the other registers are used for the data access now we mentioned that a whole lot of operation is getting done in single cycle so what is the single cycle means the m clock when I say it starts from here and ends here now normally you would have seen a clock going up and down and then again going up and down so we call from the single clock signal we say that one cycle is done but in on it is done in a very innovative way there are two signals okay which are offset by a small time gap and they are not overlapping that means those these two signals are coming from the common source inside the processor after this signal goes down after a finite gap this signal goes up and then again comes down come so these two signal correspond to one clock cycle okay now what happens during this time what is it required first of all the entire security in the data path of arm does not operate on edge trigger so this are all level triggered and it is implemented using a transparent logic or a decrease clock which you might have been heard about in the digital design posters so when the signal is high some part of the circuitry gets activated and whatever is there on the bus internal buses the signal gets transferred you know transferred to the circuitry it could be a register or it could be ALU or whatever and then another set of circuitry in the processor gets activated when this is high so they are all triggered by enabled by this signal so why is it required because if suppose you allow one set of clock that you pay to get activated to be using this and then when it goes low the remaining set of instruction security gets activated because of this transition there can be a race condition where one may not have to stop and one may not have to stop so there may be a possibility of mixture so they have decided and built the system using two different top set which do not overlap at all so anything which is on during this time is completely off and anything needs to be on comes up during this time now let us see how this is useful for reading the values if you remember recall these two suppose you assume that these two signals are coming and the barrier shifter is ready to accept the data which is coming in through RM now when the first clock goes phase one clock when it goes up the register read time starts and the contents of the register for selected registers are available on the bus and ALU reads it one of the parameter RM reads the RM could be either passing through a barrel shifter and then comes to ALU or it could be directly read from the ALU now what happens is during this time the ALU is open to receive the operands coming from the register file either RM which is directly connected to from register file to ALU comes directly here it is available whereas the next operand RM could go through the barrel shifter and it might come with the register giving so ALU will read for this particular time for both the operands to be left so this actually allows time for the barrel shifter to do its job and make the result available in the same clock so immediately after getting both the parameters ALU operates on it either it could be add or arithmetic or whatever know subtraction operation and then the result is available here and during that time this phase two is active and during this the register which is supposed to be read reading the output coming from ALU is open and the result gets written into that register so from this start of time to the end of the time what happens is the operands are made available to ALU and one of the RM operator operand can go through barrel shifter and that is also made available to ALU and the ALU perform the operation during this gap and when the result is available before the end of this phase two they get matched so all these operations get performed in one cycle now a little bit of explanation here for one understanding this register read bussers are dynamic and they are pre charged during phase two what I mean the pre charged means the buses which are connecting the ALU and the register file they are of special nature and they are called dynamic because the buses are getting pre charged and then when a particular register is driving them based on the values of the register content either is zero or one the pre charged bus gets discharged wherever the zero star and then the result gets latched on to it at the beginning of the next clock so basically this is provided to enable the access of data and the preparation for the data can be done in one cycle ahead and at the end of phase two the whichever buses are required to feed the data for the next phase they are ready with the contents so when phase one goes high the selected register the phase one when it goes high the selected register discharge the read bussers that means when we say register selected register discharge that means they put the values on the bus so when the phase one during the phase one they are discharging that means in the prior cycle where the bus is already charged and when the particular register needs to drive the bus they discharge a specific bits in the bus or lines in the bus based on the content of the register and then they get latched to the circuitry which is enabled using the space clock okay whichever the transparent latches they get enabled so this is the way two phase non overlapping clock function in ARM processor for transferring data between registers and the ALUR barrel shifter within the ARM processor now ALU has as input latches which are open during phase one okay as I mentioned and allowing operands to begin combining what I mean by operands to begin combining means ALU gives sufficient time for the both the operands to be available in case the second operand needs to come from barrel shifter barrel shifter takes them few nanoseconds to perform its operation so it allows both the operands to be settling down after the barrel shifter operation is done before latching on the value at the end of the phase one so but they close at the end of phase one so the phase two pre-touch can happen for the next phase one clock so ALU then continues to process the operands through phase two producing a valid output towards the end of phase two which is latched on the destination register at this point so within the clock what are the operations and this is the one clock set okay so to summarize what are the operations done in a continuous clock cycle ALU reached the operand rn second operand rn is either directly or through barrel shifter ALU operations perform the set is sent out and it is latched on to the rd ALU operations correspond to the execution part of an instruction please remember that this is part of an instruction part of it so they are happening one clock cycle so this is the summary of how an instruction gets executed the explain the clock as well as the data flow of inside the processor now let me introduce you to the pipeline concept and how it maps on to the data flow that i explained just now now just generic definitions of pipelining all of you might have heard something about pipelining in your undergrads class but even if you are not drawing anything at all this will be a good refresher for you to understand what the pipeline is this is an implementation technique where multiple instructions are overlapped in execution so which is not visible to the programmer so from the programmer perspective instruction gets executed in one clock cycle but internally it is divided into multiple and then they get executed that we will talk about how it is done so each step in the pipeline is called a pipeline state or pipeline segment and the pipeline machine cycle is the time that you have to move on instruction from one step of the pipeline to the next step next step and throughput of a pipeline is the number of instructions that can leave the pipeline in each cycle okay and just for a moment you remember this but you it will get cleared when i explain in that concept so latency is an instruction needs some time to find a time to pass through the pipeline so based on the number of stages in the pipeline the latency could be either three cycles or five cycles this whether it is a three stage pipeline or a pipeline cycle let us see how it is implemented let us first see why is it required now suppose a particular task here task i mean is that anything could be an add instruction or a welding instruction but for a simplicity sake let us take an add instruction suppose it takes with a processor which does not have any pipeline it takes three seconds then you divide the pipe to know operation of the add instruction so now what are the operations involved to perform an add operation first of all the add instruction needs to be test the memory and it needs to be decoded as i told to show you the decode logic in the inside the processor and based on the decoding the relevant registers are to be accessed and operation if it is an add or subtract needs to be performed during the execution cycle and the result needs to be returned back into the register so that is all happening in the execute cycle now instead of doing them in one row they are split into three stages and they are happened at three at three points at three different circuits okay patch and decode and because each one of the splitting on p now we assume that if p is the time taken for all the operations if you want to perform the same thing using three stages you need to divide them into k sub-task now any task which is divided by some amount will take less time right because if the whole operation whole thing takes three seconds of course then if you are dividing it into three stages then each one of them may take one second so we are actually the time taken patch stage would be now coming down by number of stages and the total time it took initially for performing this operation so it is needed now what does it mean that means that the any data transfer between two stages in the pipeline needs to happen in the this time that means it is much faster than the original time that p so so if suppose there is a sequential you know instructions which are there and they keep coming into the pipeline the function instruction come inside there is nothing here so assume that it is power the processor is powered on so there is no the pipeline is empty and the processor instruction come in first instruction will be only here nothing is coming out of this and then second instruction when it comes in the second instruction is fetched here and the processor moves to the decode stage and in the third cycle third instruction gets fetched and the second instruction enters the decode logic and the third instruction comes to the execution stage now after end of third cycle the first instruction which came in has completed the execution now what happens after fourth cycle or after the fifth cycle you will see that the instruction gets completed and they keep coming out every p by k time frame so now you see that earlier the instructions were coming out every few seconds now it will come every p by k seconds that means it gets executed faster right at least through to this now more than the earlier stage where there is no pipeline so after exactly one instruction or a task gets completed so pipeline is most important for instruction processing because the instructions are always sequential and they are assisted from the memory and they are put into this pipeline and they come out through the pipeline after their completion now let us see how i can now explain you this in a non pipeline implementation you see that one instruction come in and it takes a lot of time each second may be i can call it as c in another second you see that all the processes are coming in another second as you are but for a discussion point of view let us say and then they come out here but if they are divided by three stages that means the original time will be divided by three now and after first three instructions come or up to three cycles you will see that every p by three seconds one instruction come out one instruction comes out from the pipeline so the effect could be you put this three times faster than the non pipeline case so now what happens the clock the m clock which was originally between the whole stage it has to be much faster than the earlier thing so every clock the parent the information about the particular instruction we let you know pass on to the next stage so it has to be than at a speed three times faster than a previously and if the pipeline is making the execution faster the accessing the memory for instruction to get the instructions from the memory also becomes faster so memory needs to be better compared to the previous stage so that it can speed at the speed at which the processor wants the instructions from the memory so memory also needs to be getting faster let us see let me explain how a three stage pipeline is implemented in the ARM core so what is the fetch cycle the fetch stage is a instruction is fetched from the memory and placed in the instruction pipeline so the particular bit value read from the instruction memory is brought from the memory and kept inside a the inside the data flow machine it is nothing but the visible instruction a visible register the instruction is kept in the instruction register or some other security which is not visible to the present now the code is understanding the instruction based on the format of the instruction i told you three parameters or three operand format and then there is a functional code which also tells about what operation the particular instruction is expected to expecting from the processor to be performed so that information is decoded now what is the why do we decode the instruction because only the instruction has the information about what are the operand which registers are the operand then which register is the registration register and what is the operation all of them all the information is available when the particular instruction is decoded and that information has to be communicated to the data flow during the execution so that relevant registers get opened up and the data parameters are passed on to ALU and the results are written back to the register file so respective control signals for enabling a particular register content to be available on the bus and then the ALU output to be written into a particular register they all happen based on the control signals coming from the decode logic so always the execution stage follows the decode logic and the execution stage owns the data path that means flow that grow data between the ALU and the register file happen during this time and what are the signals to be used during the execution stage is decided by decoding here so the decode logic passes on that information to the execution stage and that get executed now let us see during the execution the as i told you it owns the data path and particular instruction gets really exhibited here and their values are written into the their respective registers as per the instruction that was read now it is a good time again i am asking a simple question where mclaw corresponds to which one of the options here okay so this is a kind of stage where every stage takes 3 by 3 seconds or whatever unit your time unit you can think of now mclaw has to run at p or 3 by 3 or 3 p or none of the options of it please spend some time and choose one of the options the correct option is b why because mclaw is a signal which is going to the memory also right so as i told you 2 phase signal that whole thing corresponds to mclaw so within that memory has to provide the instruction and within the clock the exhibition has to happen so 3 by 3 is a time that is taken for every stage to execute a particular instruction okay so the correct option is 3 by 3 now just to summarize what all happening in an instruction pipeline this is a fresh state this decode understands what is the operation to be performed what are the registers to be selected for the operand and during the execution stage as i mentioned during the data for explanation registers are read the a and b buses from the register file the registers are read by the alu and the alu performs a skip the role rotate operation and the alu performs any operation that needs to be done on the 2 operands and rn and rm and the rd gets written into it this all happening one cycle as i when we do the 2 phase a non overlapping clock cycle of arm so this is what is the operation done in each stages and this thumb in thumb instruction please we will cover it later now just to make it easy for you to understand at each stage three different instructions will be in the different stages okay not the same once the one instruction has told through the pipeline completely there will be one instruction here and the next instruction which was there in the memory would have been you know will be in this decode logic and the the instruction next to that will be in the fifth one okay so different instructions are there on each stages of the pipeline so if a simple data processing operation is done one instruction gets computed every clock cycle so how many latency three clock latency is there time taken for one instruction to go through the pipeline and come out of it after the execution now when a multi-cycle is executed the flow is less regular as one below so what is the multi-cycle instruction as i told you there is complete sequential access of memory and one instruction can perform copying or taking your multiple values from memory or transferring multiple values between registers data registers and the memory so during the multi-cycle operation take an example in that case what happens is the execution part of that instruction gets extended okay let us not worry about that too much here only as suppose you assume a stored instruction this is the following instructions are coming from the memory the first one was add okay it was pressed in this clock cycle and it moved to decode during the time fifth an str instruction was pressed from the memory str is what it is a stored instruction that means what it is supposed to copy a particular value in the register to a memory now assume that there are other add instructions it is following after the stored instruction now let us see go through the add instruction takes a single cycle they go through this and complete it after the third cycle okay whereas such instruction takes one more additional cycle why because based on the str instruction it could be different addressing modes could be there indexing could be there so the address which is there in one of the registers could be added and then the final memory address that needs to be accessed needs to be will be calculated so during the time the ALU is also used so because of that reason you cannot have any other instructions coming in here so because this instruction is taking one more cycle to conclude the address from there to which the register content is to be stored so after the computation of the address the value will be put in the address register and the actual transfer of register content to the memory happens so what I am saying is when some instruction takes more cycles when I say cycle it means the cycle time of m clock which is a cycle time for one stage or the pipeline the remaining instruction gets blocked till they complete it so during this time you see a gap and then the decode starts now why is there gap here because decode of this is happening so decode of another instruction cannot happen here right because it is getting fetched here after the decode the after this decode is completed only this decode can start because if we draw a vertical line with respect to time only one fetch unit can be used in one of the instruction and only one decode will be there in one of them or here there is no decode because this is occupying the decode because it is generating a serious liquid for the data transfer so decode logic cannot be used here because during the access calculation the data flow is used so you you cannot have a decode stage here okay so the cycle that access main memory or someone with a light setting so you can see that on every stage the memory is accessed because these are all the instruction accesses and this is the only data access now during this time why there is no fetch here because now instruction is getting accessed sorry the data is getting accessed here during this cycle so our next instruction cannot be fetched because if you have to you have to remember that it is a one I am an architecture where main memory is used for instruction and data because data access is happening here instruction access has to be so after the data access is completed the transfer is completed the fetch of the next instruction happens so you have to understand this diagram by going through in terms of time and then see what gets completed in the different stages of the pipeline so the data flow is likewise used in every cycle being involved in all the execute cycle so data path is what I explain you about the ALU and the data that are you know value shifter and decode logic is always generating the control signals for the next data path and in addition to data path control signals control signals for the data transfer also happens so during this time control signals for the data transfer also is generated here that is why the decode logic is not being used during this time okay so the gaps just to know know make you aware they are called bubbles in the pipeline because of one address instruction coming here the store instruction coming here there are two bubbles in the pipeline so effectively the throughput of this is cannot be three instruction per cycle because there is some during the some stages one time frame some stages are not active okay so because of the because during this time only these two are active so all the pipeline stages are not used here also only two stages are used one more is not used whereas rest of the time all three are used here here all three stages are used so only when all the three stages are effectively used for time instruction the throughput will be equal to the number of stages in the pipeline other ways there will be it will be little less than the expected value so i am giving an example with a address thousand and thousand and four thousand eight these are the instruction suppose this address add instruction was a thousand then next is there will be could have been a thousand four and this could have been a thousand eight now you have to understand one more thing that though the memory is determining the flows you know and the speed of the operation through the pipeline if you look at this particular time frame when the execution of the instruction which was fetched at thousand is happening the instruction which is getting fetched from the instruction memory is thousand plus eight now whatever it means during this execution time the r15 that is a pc value will not be pointing at the instruction which is getting executed it will be pointing at one plus two that is instruction but one okay next instruction but one that means not even the next instruction but the next to next will be the value of r15 it is pointing at this instruction so if suppose the programmer assumes when my instruction that add instruction or whatever is getting executed the r15 is pointing at its own address or the next instruction it is not true it will be r15 is pointing at the you know second instruction from the current instruction getting executed so any dependency on r15 inverse in the assembly code needs to be comprehend this pipeline effect during the programming so when we talk about some instruction and their impact it will be assume upon this so please remember this particular case where the actual execution of this instruction is happening for this thousand the instruction at thousand the r15 is pointing at thousand plus eight to fetch the instruction from the memory so this is the anomaly that we should be aware of as a assembly programmer and just explain you that if there is a branch instruction we will talk about what is branch instruction mean and it means that it is jumping to the control flow is jumping to some other value that means the branch instruction is modifying the value of r15 so it won't be same as what is getting fetched here it is a after this execution the r15 that is a pc will be loaded with a new value so whatever it has executed or accessed prior to this has to be fleshed from the pipeline and a new instruction has to be fetched from the memory and the pipeline has to be resorted from the so these stages will have to be this you know stopped and a new instruction has to be fetched that is the impact of branch instruction on the pipeline so pc's behave here one consequence of the pipeline execution is program counter which is visible to the user r15 must must run ahead of the current instruction how much it is ahead let us say okay so the if suppose the instruction is stating it is a single stage instruction that means add the first execution itself the first writing means when the first that instruction gets executed in the execution stage the pc will be two stages ahead whereas so this needs to be componented in your program in assembly program and if suppose if whatever instruction you are executing is a multi-cycle instruction and your instruction is also modifying the value of r15 then it has to be then very carefully otherwise there will be unexpected unpredictable users because of the pipeline influence on the value of r15 okay what i mean by that is during the execution if your instruction itself is modifying the value of r15 what happens to the user current value and then it is overwritten by your own instruction so it is not known until no it depends on the programmer what instruction and what operation the programmer has performed on the r15 because r15 is also part of a general purpose register which can be used as one of the operant also for you know in any of the data processing instruction so it is completely dependent on the instruction being executed so that's why ARM says that it is unfairly the characteristic characteristics of the pipeline the branch instruction or branching by the direct modification of the pc causes the ARM core to pass pipeline and mentioned to you and another possible reason could be instruct when an interrupt happens after the first the instruction which is getting executed is completed the flow goes to the ISR the interrupt service routine which could be accessed from the interrupt vector and then it starts executing the ISR from some other address so the pipeline needs to be completely fleshed and a new value the based on the pc which is stored in the interrupt vector has to be loaded and then it has to start fresh so ARM returns from the interrupt to r15 the pipeline will be repealed again with the instruction it was just after the completed instruction so we know the ARM processor we will talk about this don't worry about even if you don't understand this at this moment we will be covering whole one session on interrupt so that time all these things will become clearer but just want to while i am discussing about pipeline i want you to be aware that interrupts and branches could cost on disturbance to the pipeline and it needs to be handled accordingly so just to summarize what are the limitations to the pipeline because of the one-on-one architecture with a single instruction and data memory we will have performance limited by the available memory bank so ARM for accesses memory on almost every session because it could be for accessing the instruction or it could be for accessing the data so to get a better cycle per instruction suppose you want to improve the performance i told you because of the store instruction one cycle was or one two cities were no missed out two bubbles you saw in the previous cases so if you want to avoid such penalty to the throughput we need to do something so it may be you know ARM 7 has a limitation but in the subsequent ARM family processor they have split the instruction and data memories into separate so that you you do not want content for the same bus and compete for the same bus in the same cycle so instruction access and the data access can happen in parallel if there are two different buses so you may have a better performance there so another way of improving the performance is improving the or increasing the number of stages in the pipeline why does increasing the pipeline stages improve the performance that is it if you remember if there was a three stage pipeline the throughput average throughput of the pipeline would be three suppose if you subdivide it further and make it into five stages the average throughput will be five that means five instructions may get completed every clock cycle instead of three in the third three stage pipeline so even if you are running the processor in the same clock the end clock is same suppose it is 10 mega hours of whatever it is a five stage pipeline may perform better than the three stage pipeline of course it depends on the flow of instruction but hypothetically because of improving the increase in the number of stages we can still better get a better performance and that five stage pipeline is implemented in the ARM 9 processor it is not in ARM 7 ARM 7 is a three stage pipeline with a common instruction and data memory though our our goal is to study only ARM 7 TDM 9 in detail in this session all the sessions i wanted to cover some part of five stage pipeline also so we will be taking a detour to ARM 9 in the next section to talk about five stage pipeline and then we will come back to ARM 7 again to continue further so the so we will be talking about five stage pipeline of ARM 9 in the next class and just we wanted to give you a flavor of where the ARM family processors are going you can see that ARM 10 family has a six stage pipeline and ARM 11 family has a eight stage pipeline so you can see that one diamond is stopped here in the ARM 7 itself and going forward everything is hardware that means there are separate memory or instruction and data memory and there are multipliers with their complex operations in the latest way you know feature family so we will go to ARM 9 only to understand the five stage pipeline in the next section and then we will concentrate in the rest of the discussion only on ARM 7 by going deeper into what are the instructions being supported in ARM 7 whatever you learn in ARM 7 is relevant for the rest of the families of ARM 4 so i am just taking a detour only to make you understand about five stage pipeline and we will come back okay so these are the topics that we covered in this session today we touched upon data of the open adult how practical instruction is decoded and how operands are exchanged and then operation is performed in a single cycle and i also explained about the clock cycle in detail about two phase non-war mapping clock and then we looked at instruction three stage pipeline how different instruction get executed this during the using the pipeline and then we talked about three stage pipeline limitation and then why we need IS stages of pipeline using comparison of multiple ARM family processor so this comes brings us to the end of this session i hope you enjoyed this and these are the books i read of a continuity for all the system apart from the ARM manual and thank you all for your time and hope you enjoyed this talk and let us meet again for the five stage pipeline which will be very interesting with its own problems also associated with that thank you very much and have a nice day bye bye