 So, welcome you all to the 23rd session. So, in this section we are going to see an actual implementation of a floating point processor ok. So, we saw the floating point format in the last class. So, this this is going to introduce you to the actual implementation of a floating point processor which is actually a family of you know caught up on family and this is one of the co processors that can be connected to the ARM by building the system ok. Ifyou are interested in adding the floating points capability to the chip that we are making ok. So, here I am going to you know briefly talk about the if you know different architecture classification and where does SIMD come, where does it belong to and then what is the vector processor a brief very brief introduction because since we are going to talk about vector floating point processorit is logical to you know make you aware about these vector processors and their advantages. Then we will go into the details of a implementation of VFP and how we will see one example ofimplementation a floating pointis not an a complete application, but a small implementation which will show you the different features of VFP ok. So, the architecture the processor architecture classification actuallygiven by a very you know 1965you time scale this was done and later processors they fall into different categories now, but this classification done by Dr. Michael Flynnactually gives you a broader view of where a particular processor belong to ok. So, this is called taxonomy that is a branch of science concerned with a classification especially of organisms, but here we are using it for processor architecture and these are the four broad categories any processor if you see in the world they will fall under one of the categories ok. So, let us see what they signify. So, SI SG is a single instruction and single data stream that means, a CPU is there ok a memory. So, it accessesstream of instructions and based on what the instruction wants it accesses the data also and then the processing continues ok. So, this is the processing processor which executes instructions and this is the control unit which controls the signals to the memory and howinstructions and data are accessed and then the data stream gets the all the required data from the memory and they process on them. So, if you see that a single processor ok it gets one instruction stream coming into the processor and one data stream and it operates on them right. So, this is what is called a single instruction and single data stream processor any unit processor a including our processor ARM falls under this category ok. Now, what is a SAMD? This is what actually we started off because the vector processors the vector floating point processor falls under this category. So, what is the difference between that and this? Here the instructions stream that is coming from the memory ok is spread to multiple elements processing units ok or we call as processing element. So, if you see there are n processors in elements here and all of them get the same instruction stream, but they get different data streams they tend they happen to work on different set of data. So, though it is shown here as LM M 1 and M 1 and different memory block they all for can be kept in a single memory physical memory itself ok this not that they need to be separate the physical memory, but they get different data that is the main thing which we should keep in mind, they all get the same instruction. So, what you get? You will get a results which are equivalent to number of processing elements in the system. So, if there are n processing elements and this instruction whatever you is being performed is an addition suppose and if it takes let us assume that it is a floating point addition and it is working on n set of data n elements are there in processors ok. So, you get after n may be this takes 3 cycles assume after 3 cycles how many results will be available n results will be available because all of these processing elements run in parallel. So, that is very important because we cannot keep a processing elements and they do not run in parallel then there is no use. So, all the processing elements are running in parallel. So, they get the instruction and all of them operate on different data and give you a different set of results. So, that is a this is a multiple processing element. So, it is may not be multiple processors in a sense it could be multiple a add arithmetic units ok ALU may be multiple a add units may be there ok and then they each of them will be getting a different that set of data may be it know if you need to add you need at least 2 inputs and then they each add units give you output. So, the processing elements can be mapped to a add unit in ALU or may be a multiplication unit. So, that can be multiple units in the system and they all operate the in parallel ok ok. So, this is what is SAMD and there will be better floating point processor call under this category call under this category. Another difference here multiple instruction and single data stream. So, what we saw earlier was single instruction and multiple data stream and this is the reverse of them. What does it mean? You get different instructions to the units ok, but the same datawhen you say same data means it could be modified by this element and then the modified data stream will come, but if you consider this whole thing as a single system there is only one data stream coming into the system and then multiple processing units may be pass on this same data may be modified based on this instruction and then a different set of instruction will be operating on it and then the data element gets changed and then it goes on. A typical example is a network processor there assume you have a router ok a network router and then there is a input packets coming in and then there are so many output ports may be input ports also many will be there output ports there are so many those packets may go into one of them. Now, while these packets are getting routed if you have a networking background then you will know that some of the packets may get changed also because based on the you know L 2, L 3 layers may be L 2 layer will have the address is different because here the network is different from what network these packets come through and then as as per the L 3 may be in internet protocol there may be some changes to the packet happening. Now, you you must be knowing that you know time to live is one of the element based on the number of half elements halfnotes the packets the entry in the TTL may be changed. So, what I am trying to say is the packets which get routed to different networks they get changed may be at some level in L 2 level or may be a packetsdata in the L 3 level. So, and then the processing also may work on may be this will work on L 2 layer data and then this may work on L 3 layer and this may do an encryption which may fall under the higher categories and some elements may do a modification in the some layer. So, different operations are performed on the same packet before the packet goes out. So, is a typical example ok and then a very high level parallelism is called multi level multiple instructions and multiple data stream where you have multiple processing elements connected to network or you know it could be in the same physical location, but they all work on there they have their own local memory or they could be a shared memory. So, it could be a distributed system or a shared memory system where each of the processing elements gets different data stream as well as different instruction stream they work on different part of the data and then, but they all work on a single application ok, but they may do a different jobs which all combine together toachieve an application goal. So, they all running parallel they all work on different set of data and different set of instructions are running on them. So, this is called multiple instruction and multiple instruction this is the network of workstation may fall under this category ok. So, this just to give you a overview and then let us see what is SAMD processor. SAMD processor and the name is a vector processor or it could be called as array processor and it runs multiple operations or a single add operation on multiple data ok simultaneously that is very important and it is common it was common in 1970s and 1990s as a super computer we call them a super computer, but now whatever the home desktop we have they also may do lot of jobs in parallel. So, and all the this is that we have is all multiple processors. So, they they will be running in parallel. So, now the these are 1970s 90s they even havingthe vector processor was a considered to be a super super computer. So, operates on vector of data. So, one example I can give you is a matrix ok. Suppose you have n rows ok n and m column. So, n by m matrix and then you want to multiply it by another no matrix may be m by p ok. So, n you want to do this kind of operation matrix multiplication or matrix addition ofno n by n matrix plus another n by n matrix if you want to add I am sorry about the squares what I see ok n by n. Now, what happens the each element here needs to be added to this element here ok. So, we can take a particular row and then a particular row here and then add them all. So, these operations which are happened the done on a matrix ok they operate on different data, but the intent either it is a multiplication or addition that interest in instruction is single. So, basically a vector processor is to operate on matrix of data or a vector of data. So, vector of data is a may be a single row or a single column element can be considered as a vector of data and then e a corresponding elements are at no operated on either a multiplication operation or addition operation or any other operation that we want to be scaling or you know rotation of matrix anything can be done. So, transformation all this kind of a matrix operation call under this vector data. So, an example is suppose I showed you that a matrix of data suppose we fill in those that you know different data each vector elements are having some memory and then we give feed different data in the matrix and then another set of vector processors they have to get a new elements row of element then they all can do a you know parallel addition and then the results could be shared in another vector. So, typically in a scalar operation one data you know is in addition operation is a you know by 2 day set of you know one set of data is taken and then the result comes out. So, whereas, in ket of a vector processor a vector of data is taken ok. So, these 2 are inputs and then the output case given. So, that means, depending upon number of elements in the vector or number of processors that are running in parallel that many the parallelism will be that. Suppose n processors there you get a l n additions or subtractions or whatever will be done in a single cycle or it could be if addition takes 3 cycles in a 3 cycles you may perform n operations. So, this is the operation and the advantage of vector processors ok. So, just a high overly high level overview of vector processor. So, vector instructions access memory with the known pattern what I am saying is you are giving a single instructionfor a set of processors. So, the organizing the instructions and the data is very very important. So, vector of data needs to be in a contiguous space ok. So, memory configuration for a vector processor needs to be set up then only the accessing and the operations may be done with a higher performance. So, all of you are aware of interleaved memory where you have even our simple D dim package of DRAM these multiple chips will be there and then you know we have data accessing from all the chips. So, we are interleaving multiple bands of memory. So, that they all operate in parallel the address resolution you know absorbing the address and then getting the data out all a bands in the memory will be working on in parallel. So, you will get a performance benefit. So, especially in a vector processor we need data multiple data to be fed to the processors. So, we need a interleaved memory to feed the processor with the mode you know positive data. And the memory latency is amortized over multiple elements because we are accessing multiple elements at the same time data is accessed from memory put in order into large set of registers and one more thing we need to remember is I told you that add unit the ALU may be having multiple add units will be there ok ok multiple add units arithmetic units. But if they need to get two parameters each then we need that many registers ok. So, a vector processor will have more registers in the system ok. It also has more processing element the multiplication or arithmetic units as well as more register. So, that the operands can be kept in the in them and they can be accessed by multiple element you know processing elements simultaneously ok. And then they operate on the element one by one and the processor write the results is back into memory. So, this reduces the branch and branch problem because we are operating now we are getting results of multiple data operations. And then there are fewer instructions because one instruction itself you know does multiple operations. So, there will be fewer instruction such as. So, these are the advantages of vector auditors. And then as some other advantages a single record instruction this is a great deal of work instruction fetch and decode bandwidth needed is dynamically reduced. Because see when is the bandwidth is more is required suppose the memory is not able to feed ok the speed at which the walls ok. Now, when I tell you that one instruction goes in and then multiple elements are there in the CPU they work on the data in parallel and then I am telling you that the data elements are arranged in a very you know unique memory interleaved memory and then they can feed the CPU. The instruction fetch may not be so fast you know this speed of the memory may not be very fast because we are getting multiple operations done at the same time with similar instruction. So, it is dramatically the instruction fetch and decode bandwidth is reduced. Results computed by various elements in a in a vector are independent of each other. One more thing we should remember is in a matrix multiplication the multiplication done by multiple elements what you see they are all independent elements there we add them. So, there is no due data dependency among them that means we do not have to wait for this operation to be completed before this can start they can all they all can start in parallel. So, the data dependency or the data hazards what we have seen ok in the uniprocessor systems are reduced. So, still there will be a data hazard between vectors, but within the vector there is no data hazard. So, it is power efficient because reduction in instruction bandwidth and the data hazard checking. So, if you need to check the data hazard on every arithmetic operation that you perform then there must be some digital circuit step in the system, but which will consume power, but that is reduced. So, so we get the advantage of power efficiency because of the less reduced instruction bandwidth and then reduced data hazard verification while running the code fine. So, this is a very quick recap of what what is vector processor and what is a SAMD architecture. So, vector processor in a nutshell falls under SAMD that is single instruction multiple data architecture and it works on it gets a single instruction, but it works on multiple processing elements are there and they work on multiple data and produce multiple results ok. So, it is a parallel architecture because more elements are there in parallel ok are running in parallel ok. Now, let us see a VFP architecture VFP is our vector floating point architecture. Now, the vector floating point processor actually is a coprocessor extension to the ARM architecture. So, now you should remember all the time mentioned about coprocessors ok. So, VFP is nothing, but a coprocessor. So, this is data bus ok memory this is address bus ok. So, what is connected the data bus is connected and then if you recall let me just quickly recall the NCPI coprocessor instruction and then there are two this is the one input coming from ARM there are two you know CPA CPB if you remember coprocessor accent and coprocessor busy signal. So, all this handshake and then there is no restriction that this cannot drive the address bus, but it can take the data in and then it has got its own instruction pipeline which will be in sync with the pipeline in the ARM all that should come in your mind ok. When I am talking about VFP it is basically a coprocessor with using a CPID ok. Let me just tell you now itself CPID 10 it is not 16 CPID 10 and 11 there is a single precision floating point and this is a double precision floating point. So, it uses the CPID of 10 and 11 these are all reserved ok. So, it is not you know it is a standard that you know it is used by vector floating point processor that I do not call it as a reserved because the 8 to 11 if I remember I think they are all given for user level vector you know floating point processor or any other coprocessor that you can design. So, the coprocessor ID can be anything, but the 10 and 11 are used for by this vector processor which comes along with the ARM SOC ok ARM IP. So, it works with ARM on VFP. So, as a coprocessor so, you should remember all the instructions that it can operate on and so, basically all the operations done by vector of floating point processor is done through a coprocessor instruction right. So, please remember these all of them what I mentioned in the earlier slide. It provides single precision as a double precision. So, this also you should know single precision is the 32 bit floating point representation and double precision is a 64 bit representation I2P 754 standard. So, it can work on any of them. So, based on the type of the IVFP we have. So, just now I introduce you vectors 2 vectors. So, VFP has 8 elements okthat can be done in parallel. See this is this number cannot be arbitrary because it is tied to the see vector length. When I say vector of data it can a processor can operate on ok I am saying that 0 to 7 that means, 8 elements can be there. The when I say that it can work on 8 set of data in parallel that means, that many processor any element should be in the system. The VFP that has been designed ok VFP need to have that many processing elements to process the multiple data elements in parallel. So, this 8 is fixed that means, it can work on either 8 single precision operations or 4 double precision ok because basically before the double precision is a 64 bit data and this is the 32 bit is a double. So, the any elements are know operating in a in parallel. So, so please remember this VFP can work on vector of data. It does not mean that it cannot work on a scalar data. What I mean the scalar data and vector data means scalar data means may be a single register operating on a single you know R 2 and R 3 and then putting the results in R 1 it is a scalar data ok. When I say vector data it is a set of registers which are identified ok and it works on that and then set of in a in another set of register the results are written in ok. So, this is the vector of data. So, vector of data means the data is has to be stored in some intermediate registers. So, the registers are used for that purpose and then the vector processor operates on those vector of data and then results also should be a vector of data because if it is operating on two vector operation a two vector two vectors and then it is a vector operation then it will result a result in a vector of data coming out. So, we need to identify the set of registers for where this result should be stored into. So, so most arithmetic instructions can be used on these vectors most arithmetic ok and allowing both single and allowing single you know single instruction as well as SAMD parallelism. So, if it is a SLE single instruction and SAMD parallelism means it is a single instruction is operating on multiple data. So, the vector of data ok. So, further the floating point load and store instructions have multiple register forms. So, even the ARM has multiple register form LDM and STM which you have heard about. So, floating point also supports floating point processor can load a data some memory multiple words can be copied from memory to registers in the floating point processor or it could be the other way. So, the registers could be saved into memory ok multiple data transactions can happen ok. Now, I told you that VFP has a double precision support also. So, that is indicated by the naming convention D in it. So, if suppose by you know if come across VFP V version 1 D that means, that it has got a double precision support also. So, whenever there is a double precision support implicitly the single precision support is also supported ok. You cannot say a processor with only a single double precision and not supporting a single precision. So, but the other way around is true possible means if it is if x is D that means only a single precision is supported. So, what happens is when we support when the processor supports both D and single precision and double precision that mean there is a complex activity in the system. So, to perform this operations whereas, if it has single precision data the because the data width is lesser than the processing elements complexity also comes down ok that is very natural. By default double precision support is also ok. Now, what is the internal organization of the VFP processor? As I told you VFP is connected to the data bus where ARM is also connected to ok. So, VFP cannot exist alone it is a co-processor please keep in mind you cannot have a system with only VFP and write some code for it. Now, if there should be a ARM unit ok and then this co-processor can be an add on to that ok and then with the help of ARM co-processor can do its job ok. So, the data bus is common for both getting a data from memory as well as instructions also from memory. So, if it is an instruction it is fetched by this instruction issuer which which tracks the pipeline of FPU as and then keeps track of ARM's pipeline. So, that they are in sync and then based on the instruction if it is a LDM or SDM ok then it activates this your star unit. So, that the addresses will be generated by ARM and then the data coming from the data bus from the memory can be either transferred into the register or it can be copied back to memory. So, this unit takes care of interaction interacting with the memory unit ok and then these 2 or 3 signals you you should be by now you should be in combination with that this is NCPI and these 2 are CPI and CPB ok which is going to the ARM processor this is a handshake with the ARM processor this is a co-processor interface. One thing what we should see from here is the arithmetic unit and register bands are connected like this what does it mean? I told you that a vector processor need to have more register. So, I will show you in a short while you know how many registers are there in the vector processor, but if suppose you know I tell you that there are some more registers we can have a set of registers getting loaded from the memory while a set of registers are operated on using the arithmetic unit. So, these 2 can happen in parallel ok and similar to ARM processor maybe the data is copied and then you know from the registers and most of the operations in the vector processor a specially multiplication division okand then square root computation all those especially these 2 takes more cycles to of you know the execute. So, while this is being executed ok if there is a pipeline of instruction I told you in the last class that P 4 of instructions can be in the system in a co-processor. So, there is a P 4 of instruction then already one division is instruction has come and then it is being handled by this arithmetic unit and then there can be a load or LDM or STM can happen in parallel because this can do that job ok. So, parallelism is possible because of the number of registers that are supported in the system and then the way it is organized. So, you know as which is shown the register bank has got more rate and the right course and they can because of number of registers are more one set of registers can be getting loaded from the memory while the other registers are operated upon with you know by the arithmetic unit. So, they all can happen in parallel ok within the EFP apart from this parallelism within EFP ARM is also executing some instruction ok once it hands over the CDP instruction that is co-processor data processing instruction like mul or add whatever it goes ahead with the ARM instruction. So, it is also executing some control instruction in parallel. So, there is a lot of parallelism now you see and within this when they have a vector of data the vector of data is also doing a parallel of parallel operation not only the parallelism between these two within the vector there is a multiple data elements are getting operated on. So, there is a parallelism here too. So, you can imagine a parallel execution between ARM and co-processor and within co-processor the load element and the arithmetic unit are running in parallel and then within the arithmetic unit a vector of data is getting operated on you know maximum I told you 8 single precision operation can happen in parallel. So, there is so much of parallelism now in the same clock cycle so much of operation getting done. So, all these things keep in mind at the back of yourwhile reading this back of your mind keep this thing so that you understand that how much of performance improvement you get with this kind of advancement in the architecture. So, the lower solenoid operates concurrently which I already told you. So, and then the arithmetic units can work on the previously loaded operands hardware interlocks protect against data other. So, that if I told you that between two vectors that can be a data other, but not within the vector right. So, there are some hardware interlocks that means, suppose there are two vector operations and then if suppose I allow I will give you an example add isadd two vectors ok. I will give you exact syntax later ok in the class you know two vectors are getting operated and then you say mul is also getting done. Now, this vector may be is used here as a multiple you know one of the operands ok between two vectors that can be adependency. So, the data other can happen. So, the operands which are there in two instructions which are within the VFP which are getting operated you can see that may be S 3 is in both the places then we doing this has to write into the register first and then this has to take. So, the operand needs to come after the completion of this then only this operation can start up. So, that kind of a data other dependencies can be caught with the hardware interlocks in the system. So, there is a support for identifying those kind of data others ok. Now, let us see what is the support code. So, again I have to give you a little bit of background ok again similar diagram I am ah ARM and VFP. Now, I will not talk about just a coprocessor it is a VFP coprocessor. Now, if you recall IEEE 754 the floating point process you know floating point format it has got lots of exceptions you know you have seen that you know 0 by 0 ok infinity by infinity there are so many kind of operations which can result into some exceptions ok. There could be a data abort or it could be a an error or there may be a possibility that the number the results that you are getting is going below the normalized value you know you are getting a very low value ok this is 0.0 ok and then you are getting a result which is very low and then it has to be either written as you know minus 0 or plus 0 ok. So, that decision has to be taken. So, this kind of data we you know dependent exceptions are multiple possible with you know with the floating point arithmetic. So, when such things happen how does the vector processing point processor can start you know give a control to the exception handler I have to give you a little bit of background because ah it will be a difficult if you do not have this in the back of your mind ok. Now, assume there is some floating point operation which was done by the processor ok let me choose another color ok. Now, this has done a job ok some floating point addition operation or may be a division ok let me call you know floating point division it did and then they are divided by 0 as capital ok divided by 0 ok. So, any number you know some ah n divided by 0 as happened. So, this has to be handled in a separate way it depends on the application. So, now see who executed this instruction F div instruction was done by the vector cross floating point processor, but it has it has encountered an error condition. Now, it should take you know it should bring you the whatever has happened into the vector processor it should bring it to the notice of arm ok. So, exception handler how does it work ok you know that vector table is there in the vector table which is in the memory ok. Suppose this is the memory in the vector table is there and then may be a undefined exception undefined instruction ok undefined exception has to be created generated. Now, that needs to be signaled to the processor so that whatever the arm processor is currently doing it stops that and then gets the control to the interpreted table where undefined instruction is no instruction vector is picked up and then the handler which is stored in the memory is executed. So, something happening here needs to generate an exception and that needs to be controlled or it needs to be processed by the arm processor. So, please remember VFP or any other coprocessor in the system do not they do not have their own exception handling capability. So, there need to be any exception handler specific to the particular coprocessor needs to be integrated into the arms coprocessor sorry arms exception handlers and then based on which coprocessor or which instruction has passed the exception accordingly the handler can act on it. Now, how you may wonder ok instruction handler is called ok and then you if you recall while the instruction handler is called it will also know which instruction has caused it because see the instruction is getting executed and then some other instruction is there. So, you have to take back trace which instruction has caused the hand you know exception and then process on that. So, specific to the particular coprocessor handling has to be done. So, the job of the support code is to provide that capabilities ok. Whoever is supplying the VFP the vector copro, protein point coprocessor will also provide the support code for it the exception handlers and then the system software developer ok while developing the software for this whole SOC will integrate those exception handlers which has been given by the VFP vendor into their application and then build the system ok. So, that is how it works. So, that software component is called a support code. The support code provides a feature of IEEE compliance that are not supported by the hardware. So, basically if you can see that the support code is a software because it has to be executed as a part of exception handling. So, if a VFP processor in its own hardware if it cannot take a specific exception condition in the hardware it may be reason may be that it is very complex implement in hardware then you know putting it in software or it could be that is the probability of that particular exception happening is so, less that we do not want to increase the footprint of the processor with a more complexity or no it will be very power hungry also if that there is a system there, but those occasional things can be handled using software without any performance drawback. If the application warrants that then that particular handling of those things can be given to software or a hardware. So, it is actually a implementation choice whether a handling of particular thing has to be done by a hardware or software, but if it has to be done by the software it has to be provided by the VFP vendor as a support code which is called sub architecture ok. So, the definition of the interface between vehicle folding point processor hardware and the support code is known as the sub architecture. So, implementation ok this is the this is the definition of support code what I mean by support code and then I said told you earlier CP 10 and 11 are normally used for VFP coprocessor one is for similar precision arithmetic and another for double precision. In general this is a single precision and CP 11 for double precision. So, what I mean by that suppose if you are in you know interested in you know giving a set of data the ARM code ok when I say you mean to ARM is giving a coprocessor instruction to the VFP it may say that CP 10 ok I think 8 to some number I think 11 I think CP ID will be put right in the instruction format. So, you the ARM processor or the assembler or the you know the compiler generates the code for the CP ID based on whether a single precision operation has to be done or a double precision operation has to be done by the VFP ok. So, based on that the particular coprocessor instruction will be generated and the VFP will you know look at the value of CP ID and then it comes to know whether they supposed to do a single precision or double precision operation ok. Let us see now what are the applications that will be benefited by the VFP the vector floating point processor image processing applications. So, let me give an example what is scaling suppose you have a cube ok in a graphics world ok. So, you want to scale it to the bigger size ok. So, what is being done here in a typical image processing every pixel suppose you know this is occupying a less number of you know pixels and then this is getting a enlarge because of scaling may be scaling by two factor. Now whatever images are here or you know represented as two bits here and two pixels and then they are expanded. So, to do the scaling operation a scalar vector has to be multiplied with the pixels in the system ok. There is a complex graphics is not very simple, but when you want to do a scaling operation it is actually a multiple part of the images are handled and then they are they can be done in parallel ok. Image processing actually gives you a lots of scope for parallelism because what you are working on this image you know the scheme logic can be applied on this part of the image both can happen in parallel because there is no dependency among these two in terms of scaling you know in particular this gets expanded to this part and the other part will get expanded to this part. So, basically these two part of the image can be acted on in parallel. So, it is a good candidate for a vector processor ok because multiple data the same scaling factor is a single instruction ok. So, that is a correct fit for a operation. Now 2D 3D transformation suppose you want to transfer a 2D image to a 3D then you have to give the depth perspective to them or one moreyou knowdimension has to be added. So, these kind of things can be graphics or image processing are the good candidate for SAMD or floating point operation and font generation different kinds of fonts may be converting from one kind of font to the other one okcan be done. Digital filters there is a serial processing algorithm which operates on parallel data ok and it is all if you see they are all you know if you want a good precision of data it has to be done in floating point ok that is a common thing you cannot achieve the same kinds of you know perfection in the image generation if you use a integer data format. So, you have to have a floating point format so that the images quality andthe quality ofdigital filters that perform a job on signal processing they will also be you know producing a good result when you operate on floating pointnumber system. So, this is the best a better choice for performing these jobs. So, any scientific application that means, floating point operation will be a the VFP is the right choice ok. Now we have a enough of good background and why we need a VFP how it is implemented how they are connected to ARM and ok how they handle some exceptions also I gave you, but now we will see inside of VFP what is inside. So, basically first thing what you should want to learn about any processor is its own register. So, it has got 32 general purpose registers ok if you recall in the ARM there were 16 general purpose registers ok apart from bank registers there were 16 R 0 to R 15 correct. So, maybe if you add all the bank registers and the SPSR and CPSR it was coming closer to this, but here in vector processor we have a 32 general purpose registers I told you vector processor are working on vector of data. So, they need more registers. So, this is a natural storage is to have more registers in them and the registers can store either single precision or a a 32 bit integer in them. See one thing you should remember when you have a 32 bit register this is a bit pattern ok. So, either we can store a floating point value or an integer ok signed integer and I know unsigned here because whether you are storing a floating point or integer depends on how you interpret this bit pattern in this register ok there is nothing special about it ok. How do you interpret the numbers how the how this now once and zeros in the 32 32 bit values are interpreted depends on what kind of value you keep it in the register, but for a register to store some 32 bit value or it is not only you know restricted to register even a memory they it can store the bit pattern in the memory. How you interpret depends on what you know about this data whether it is a floating point data or you know integer data you interpret it in that way. So, this VFP registers are not aware ok whether it is storing a floating point value or a integer value now there is a confusion. Now, how will I know what is being stored there and how do I operate on it? So, who decides it? It is decided by the instruction please remember as an application developer if you are writing assembly code you are writing the instructions and you are moving the data into the registers because you are only writing the LDM instruction from the memory. So, as a programmer you are supposed to know what data you are moving into the registers and what instructions you are giving to operate on those registers. Suppose, if you move integer values from memory you can know execute some integer arithmetic on it or you can even you know why do you want to do a integer arithmetic in the VFP registers you can as well do it in ARM processor. So, what I am saying is you can interpret in a different way and you can convert this integer to floating point and then operate on it internally in the VFP because input may be an integer and then you do a conversion to the floating point value and then operate on it inside the floating point processor. So, this kind of thing can be done using this register and you can move the values of you know the either an integer or a floating point into this registers freely. The processor does not understand or does not relate see it understands the floating point error format, but it does not know whether a particular register is holding an integer value or a floating point value that is only the instructions know. So, based on that instruction if suppose instruction says this is the register R 0 and this is the you know I am talking about VFP R R or you know may be S 0 and S 1 this is the convention followed in floating point. If I say that these two are single position data and then I operate on a floating floating point addition then this will this bit pattern will be considered as a floating point and this will be considered as a floating point and it will add these two as floating point and write it into some register S 4 it will write the in the floating point format. So, that is what I am trying to say here ok. So, D variants of architecture these registers can also be used in pass. So, now when I say that ok I will show you the register. So, two set of registers one after the other can be considered as a the holding a double precision value ok. So, there are three or more system register this is something to do with the similar to CPSR there are some system registers to configure the floating point processor as well as to find out what is the flags coming out of the operation of floating point those things are managed by some system registers in the system in the floating point processor this is the status and control registers ok. Now, this is the register file. So, S 0 to S 31 ok how many 32 registers. So, I told you this is a single precision that means, each register here is a 32 bit wide register ok. Now, what is this double precision is a 64 bit wide data register. So, it combines S 0 to S 1 element as a D 0. So, if you are operating on if you say I am operating on a is a double add suppose if I if I have a floating point operation, but it is a double precision arithmetic and then I say that D 0 comma D 1 parameter and then D 3 is the you know result has to be written. So, it will combine these two registers and access them as a single element as a D 0 and then D 1 and write the result into S 3 D 3. So, this is how the register files are you know overlapped with each other. So, please remember this is a physically there is only one there is no you know separate physical registers like this and this only 32 registers are there whether you are considering it is a double precision or not depends on how you interpret this value whether you are taking them all together as a double precision or you are treating this one single element as a single precision depends on the operation. So, again whether it has to be interpreted as a single or no sorry you know combined as a data double precision data or it has to be treated as a single precision data depends on the instruction ok. As such registers do not have the information whether they are holding either an integer value or is a holding a double precision value or a single precision value the registers do not have any indication associated with them ok. One more important thing that I want to explain here that is called register bands it is different from bank register that we saw in ARM ok. So, here the register bands you have a 32 registers in it and then if we treat them as single precision registers they are all listed like this ok there are 8 8 single precision registers here 8 here 8 here 8 here. Now, these bands are called bank 0 this set of registers are called bank 0 this is called bank 1 and bank 2 and bank 3. Now, what is this arrow indicator ok it is not that these values are all rotated and then they go into the you know this content goes into this and this content goes into this it is not that ok please you know you may have seen those kinds of arrows for rotating bits within the register, but here what it means is I told you that VFP is a vector processor right it works on vectors suppose if I give an instruction ok F add I give an example ok S 8 comma S sorry let me erase it ok S 16 and S 24 ok. So, what does it mean I gave these registers here ok that is all. Now, what I am I trying to do I am adding these two and then result I am putting it here ok assume this. Now, there is a another you know apart from looking at this I mentioned about FPSCR ok I think status and control register it has got 2 fields ok let me talk about length field LEN field this is a 3 bit pad 3 3 bits are reserved for LEN ok assume this LEN bit in the FPSR register ok there is only one single register in the VFP and it has a value 1 0 1 what does it mean it is a 4 plus 1 5 if it has a value like this ok 5 then when you do a you know F add F ok actually it is a convention is that you know this is the instruction ok. When you perform this arithmetic and then you are mentioning these registers and the instruction while this instruction is being executed by the VFP please remember this instruction is executed by VFP it looks at this bit pattern in the FPSR it sees that it is it is having because this one went up 1 0 1 here it sees that there is a 5 then it adds 1 to this value ok and it assumes that you are interested in performing not a single scalar operation you are not interested in just adding F 15 plus F 24 and writing into F 8 you are interested in performing a set of 6 operations vector operations what does it mean you want to do this and then you want to add these 2 elements and write into this and you want to add these 2 elements and write into this like this goes on up to this ok. So, the starting point is considered to be these registers ok got it. So, it suppose if I tell instead of F 8 ok if I write an instruction assuming that the FPSR values are same ok this values are same I write an instruction saying that S 11 comma S 19 comma S 27 this value happens to be the same that means 6 operation, but you are starting from S 11 that means, this register S 19 ok and S 27. So, from starting from this register go down if you encounter this end of this go up and pick the remaining operands. So, up to what time up to 6. So, it will do this it will do set of this you know S 27 and 19 will be added and put in S 11 and these 2 will be done and put in this. So, how many are done 5 are done. So, one more will be done by taking these 3 in the set. So, that is what is called bank registers and how it operates on the bank and now let me erase this. Now, it is not that by chance I picked only these 3 ok. This bank remember that it cannot be any registers on this bank cannot be taken as an operand or a destination for a better arithmetic ok. You can do better arithmetic between these 3 banks, but you we cannot do from this bank. Now, you may wonder what is the use of this that you know are we wasting one set of bank of register register for this you know. I will give you some examples where we want to do a scalar operation then implicitly this bank is used for that purpose. So, these instructions these registers are used for scalar operations ok and these set of registers in the these banks can be used for scalar as well as vector operations. What I mean by scalar operation? Only you know operating on one register and another register and putting the result back. In a vector operation we go into set of registers and operate on on them. So, if you want to perform a vector operation we can take these 3 bank registers or if you want to do only scalar operation this bank register is used for that is a indication. So, register file is divided into 4 bank ok, 8 registers in each bank and please remember you can do a floating point vector operation using a single precision or double precision. So, if you if you do double precision you can have maximum core data elements can be double precision can be picked up from the vector. The vector length can be maximum of 4 if it is a developed precision operation in the vector length could be a maximum of 6 sorry 8 in a single precision operation, but it could be less than that no issues, but maximum it could be 8 because you cannot have a more than 8 vectors ok at a time. Otherwise you have to copy from memory and then perform it I will show you an example where that kind of operation will be done in the end of the class. CDP instruction access the banks in a circular manner, but not closed store instruction that is very very important difference ok. Only CDP that means, what all the add multiplying square root or whatever any arithmetic operations or any data processing on operations only act these registers they only treat them as a bank of registers because the vector operation what I showed you ok they combine and then they operate you know give you a vector of data I told you that right. So, they operate on the vector of data only during the arithmetic operation not during the memory access. So, memory access treats this register as a set of sequential set of registers starting from S 0 to S 31 ok or it treats them as D 0 to D 15 ok double precision. So, any suppose if you want to copy data from memory into the register you can based on the length of the copying you want to perform and then what is the starting registers you are giving up to the end it will go it will not wrap around. Please remember loading or storing a you know load store instructions do not wrap around they perform operation starting from a lower register number to a higher register number is the same as our on processor ok. So, they do not treat this banks of registers ok as a bank in circular manner they do not treat them or access them in a circular manner. Vector length is decided by the length field in f 3 s I mentioned that to you and then the the destination is a bank 0. So, another difference is if the destination register happens to the bank 0 irrespective of what is the value in length field it will treat it as a the scalar operation I will I will give you an example ok. Suppose you have a f add s ok you say S 0 comma S 8 comma S 16 what does it mean S 8 and S 16 is added ok and then put into S 0. Now, this destination register ok fd fd is a floating point destination register happens to be from a bank 0 right is it S 0 ok sorry is not very clear let me write it let me write it for a change clear handwriting S 0. Now, what happens once the f p V f p looks at this as S 0 it will not assume that ok you want to perform some more operation accessing after this S 9 then S 10 and then S 17 and S 18 go on and then perform this it will not do it. Though length field says may be 6 it will not do a 6 vector operation it will only perform a scalar operation ok that is what is a what is conveyed in the last bullet of this slide. If the destination is a bank 0 register the operation is a scalar only regardless of the value in length field I hope this is clear to you ok. So, this is very critical and you should understand this when I show you an example you will be able to appreciate it ok. Let us go little faster here loading floating point values in the register from memory and storing floating point values in register to memory there are these instructions supported ok and then some of these instructions allow multiple register values also ok LDM and similar to LDM. Such instructions can be used to load as those vectors of short vectors of floating point value transfer 32 bit values directly from V f p to ARM ok register transfer is also there. If you remember MCR and MRC instruction ok these are all the coprocessor register transfer instruction coprocessors to ARM register and then registered to ARM register to coprocessor register. So, this instruction MRC and MCR instructions are also supported by V f p because these are V f p is a coprocessor. So, all the coprocessor instruction are supported by V f p transfer 32 bit values directly from V f p system registers to ARM general purpose registers ok and then it can also do a multiple operations ok different kinds of operations on either a vector of data or a single floating point scalar values ok. So, single floating point values are called a single floating point scalar set of vectors you know data elements are point vector ok that is a and then copy a floating point value between registers. So, between the register itself in the V f p within the V f p you can copy them while copying we can do a sign bit can be x no inverted. That means, if you want to convert a minus negative value while copying to another register suffice 0 to s 1 you are copying ok and then you want to make it minus 2 plus that is a possibility ok you can invert sorry this one is not clear ok there is 0 to s 1 you are transferring, but while transferring you want to convert now you invert the in sign bit of this that is also possible. All these instruction also can be used in short vectors. So, you can perform these operations on vector of data also perform combined multiply accumulate. So, you want to do a multiplication of 2 vectors and then add to the result ok normally it will be kept in another register and then it will be put here. So, MAC operation you know is a very popular multiply accumulate operation is very popular sigma crossing instruction which are also supported by V f p. Conversions between single precision to get and data precision the double precision values as well as and with you know choose complement to sign bit unsigned to signed also you can do, but you have to remember that if you are converting a huge unsigned value sometime it may not fit into the 2 bit value ok because it only represents half of the unsigned value because time bit is coming there. Compare floating point values in register with each other ok with 0 ok those compare operations are also there. Now, I touched upon exceptions earlier let me give youoverview invalid operation this is an exception which happens when you do this kind of operation ok 0 by 0 it may generate a not a number the result will be put as a norm. Suppose see what I mean by this is suppose you are dividing s 0 by s 1 and you want to write the value into s 3 s 3 is what single precision register. Now, you have mentioned that you want to do a divide operation, but if suppose if it had a plus 0 here and then this was also having a plus 0 now this is dividing a 0 by 0 which is a not a valid condition. So, there are 2 possibilities one is it generates an exception. So, it goes to ARM and then ARM generates an exception undefended exception and then the exception handler the support code handles it or you say that in s 3 you write Poitman. So, Poitman is what this is the one representation a specific representation to say that I do not want to generate an exception I want to suppress it by filling in the value with a Poitman that means, what this is a pattern which is not a correct floating point value, but this is a not a number it is not actually holding a floating pointvalue, but it is the indication. So, whenever this bit pattern is seen by the floating point professor or your application you should act on it accordingly. So, that kind of a situation the we can have a a Poitman ok can be returned. So, it is all you know I typically says that either you do this or you do this. So, so when we have a VFP with a support code you can decide whether an exception has to be given or you just fill the if suppose you get this kind of a division generate a Poitman and then go go ahead and execute the next instruction. So, division by 0 you know you can either 2 options are there suppose you are positive value ok you are dividing by 0 you can say that there is a positive infinity is representation is there that you can generate and then go ahead with the execution or you generate an exception. So, there are 2 possibilities. So, the these are the things defined by the IEEE 754 in excess suppose you are not able to represent a value accurately then you can round it up because maximum you have a 23 bit of fractional part. So, if you are exceeding that you know limit then we can round it up ok and then show that in the floating point representation. So, if whenever you round off you say that you can create an exception. So, that if suppose the application developer wants to catch this kind of a scenario it can generate an exception and then what are the other different kind of exception over flow exception this occurs whenever you are adding a huge number of floating point value either it goes into plus infinity ok the addition results in the plus infinity or it will minus infinity ok. So, in that kind of situation either you can fill up with this value and then suppress it or you generate an exception it based on the application developers input what you want to do. Now under flow so, let me again 0.0 it is it was explained in the last class anyway I will come back I and explain here. See there is a a possibility that under flow can happen that means, you are not able to express this smaller value which you are getting it in the normalized notation ok. If you recall then you are not able to represent in the normalized notation which is 1.0 into that 2 power e correct and then you know this is f you know 2 power f we have 1.f ok floating the fractional part. Now when you are not able to use it in the normalized form you can always say that it is a denormalized rotation and then there is a separate notation for that that means, the you know exception sorry exponential portion can be 0 and then fractional portion can be a non 0. So, this can be used if you want to use a sign bit. So, you want to a denormalized number that means, what you are extending the limit of smaller number to a denormalized value. So, you can still extend it either this way cause it is a negative side you can extend it into this format. So, that can be decided based on a choice when an under flow happens. Now if it is too small to be represented in normalized form it can be either denormalized or it can be made as a plus or minus 0 ok either you can say that it is closer to this and leave it or you use a denormalized rotation for that ok. Now exception handling how do you do it? There are two ways untrapped that means, you do not want to trap it. So, you this this causes the you know cumulative flag in the FPSR because of you know untrapped this causes the appropriate cumulative flag in the FPSR to set be set to 1. So, if suppose you have you do not want to generate a trap then we can the QC flag can be set to indicate that there was a condition which you know exceeded the limit, but because I do not want to generate a trap I am just indicating it by setting this flag ok that is the internal handling. Any result register of the exception generating instruction to be set to the result value specified by the standard. So, I say told you suppose you are your number what you are getting is not 0 is here the number what you are getting after the division it becomes too small to be represented you can decide to put it as a plus 0 or minus 0 from based on where the result is coming from which direction. So, that specific value can be loaded or you can make it as minus infinity and plus infinity you can make this values as a result instead of generating any exception. So, untrapped is there that means, you do not want to generate a exception you either do this or you set the values according to the standard. If you want if it is a trapped you want to generate trap then we selected by setting the appropriate flag and then exception happens and then the support code will get executed and who calls the exception arm traps the exception. So, this is how the floating point processor handles the exceptions very good support code I will explain to you a complete implementation of VEP architecture this is due to existence of trapped floating point exception basically support code comes only when there are exception and which needs to be trapped that means, you want to generate an exception. So, the support code is typically entered to arm undefined instruction I told you vector when the VEP hardware does not respond to VEP instruction. So, it could be possible that when when the operation is not possible to be performed then the vector processor can say that I do not want to you know take that instruction. So, it will not respond back with the busy signal then the exception can be generated or it could happen later also after their execution also. Handles can be used for rare conditions ok and wherever the operation cannot be implemented in hardware the exception handling can be done. So, the division of labor between hardware or software is a implementation dependent execution ok. So, if there is no hard and fast rule that it needs to be done in one way and then I am just saying that there is a possibility of this VEP causing an interplatency to the arm processor. So, I should be aware of that because arm is executing in and VEP is also there ok. If you recall ok suppose there is a coprocessor load instruction ok load multiple instruction ok what happens the memory is there coprocessor is loading the value from memory into it is own register. So, if you are you you are you are clear now 32 single precision registers are there. So, it can decide to you know transfer 32 into 4 bytes of data from memory in one go. Now this coprocessor load instruction has to be done with the help of arm because arm is generating the address and then the data comes here. So, during this time there is a possibility of an interrupt happening IRQ happening or FIQ happening ok. In that case what happens this existence of VEP because of 32 bytes of 32 into 4 bytes of data is being transferred this interrupt latency is coming. So, do not think that coprocessor is anyway is you know it is running in parallel so it is not going to impact the performance of arm we cannot assume that ok. So, there is a tie up between them because of this existence of this a particular instruction is getting executed the arm interrupt latency may increase. So, when as a system designer we need to keep all this in mind before deciding on what kind of instruction you will allow and what kind of interpolate and see you can expect from your system . So, and then other than that this undefined instruction track what I said that now you undefined instruction track is generated by the vector processor and then arm is executing it. Now because when the undefined instruction instruction is created you know generated it causes IRQ to be disabled ok that is the default behavior. If if we do not enable them internally I told you one exception handling also where interrupts can be enabled internally inside the handler if we do not enable them then this is going to delay the IRQ IRQ handling then interrupt latency of IRQ will be delayed because of this . So, that you should keep in mind. So, use of VEP in a system therefore increases worst case IRQ latency considerably it is possible to reduce this IRQ latency penalty by explicitly reenabling interrupts ok. We have to reenable the interrupts into inside the handler to make sure that the interrupt latency is reduced . So, though FIQs are not disabled by entry to the undefined instruction handler it is recommended that FIQ handler themselves should not use VFP ok. What I mean by here FIQ interrupt can be you know they are not in disabled even if it is undefined instruction handling of a vector processor instruction is running during the time FIQ interrupt is happening then it will be serviced because we are not disabling the FIQ inside the default it is not disabled, but if FIQ handler itself is using the VFP then it is going to cause some delay it may cause a delay because if this then you know instruction which is being used inside the handler generates a vector processor you know exception undefined instruction exception then it is going to for the delay the execution. So, need to take care that the FIQ handlers do not use any VFP instruction please remember VFP instructions can be used anywhere ok. And in that case we have to some restrictions should be there that handlers do not use those instructions to improve their inter latency of the system ok. Now, how does the VFP and ARM interact? I told you that VFP load store instructions are allowed to produce data abort. So, VFP implementation are able to cope with the data abort and as explained earlier ARM takes care of generating the address all this you know VFP decides the number of words transferred by the multiple load store instruction. I am just summarizing it which you saw it in the coprocessor also that load store multiple are possible with VFP also in any coprocessor also. Now, and then what are the ARM instruction now you know that these are the coprocessor instruction this this is the data processing instruction this is load store instruction of coprocessor and this is the coprocessor to ARM register transfer ok. And it uses 10 and 11 and then if conditional code is not met then both VFP as well as ARM please that vector processor as a instruction as a no ARM. So, one more important thing I mentioned it in the law class like not law class in the coprocessor class. So, see conditional execution is handled by the FPSR the CPSR values in the ARM processor that is that CZNV flags. So, the conditional execution is based on these flags ok not based on the VFP flags VFP also has equivalent flag, but it is not based on this. Suppose you want to execute some instruction based on this conditional flag this needs to be transferred to the ARM CZNV and then it should be performed. So, there is a provision for that that is called F F ARM strat instruction what does it do it transfers the floating point status register values into the CPSR register. So, how do you do it we perform this C to R and R will be R 15 you know we give that you know the target register is R 15 in ARM and then we say that FPSR value SCR value then what happens is the lower 4 bits are transferred to the CPSR ok that is how it is done I think I explained this in the previous discussion. So, we do not have to go into it now we are going to conclude this class with one small example which you will love it ok. Let me explain this now do not get overwhelmed with the instruction and you know length of instruction it is going on so big see nothing F is nothing, but it is a floating point LDM and STM you recognize it very well ok the load multiple is not IIS is not the government service it is in increment after ok it is a stack operation or a multiple data transfer operation is ok it can be used for anything, but it is a multiple data transfer operation. So, we want to say that memory you increment it after loading this value whatever the current base pointer is pointing at ok if you recall and then S is to indicate that this instruction will be modifying the class. Now, this is the base register base register has to be a ARM register ok because address generation is done by ARM and then the base register is getting incremented based on this condition and then it is a load multiple load multiple is what it is loading into register which register the floating point registers S8 to S15 ok. So, basically what is happening here a set of 8 float values the 4 byte values are copied from memory to the registers in the processor ok inside the floating point processor. So, so this whole thing is understood and then it generates a a coprocessor instruction the assembler would have put a coprocessor instruction to give this information to the ARM and coprocessor and they work in hand to transfer this value from memory to the coprocessor registers. So, now what happens in this instruction a base register R 2 is used and it is now written back the incremented address is written back and then it is copying into this set of registers. Now, two set of registers are copied ok let mego to the next one. So, what happens is you see memory ok if you understand this then you will know how floating point processor works ok BFP I have a vector set of registers ok S8 to S15 ok. This is what it belongs to which bank bank 1 bank 0 is a 0 to a 7. So, that is not used a bank 1 is used and then another bank is used S16 let me change the color ok S16 to S23 this is a home register 8 registers are there this is bank 2 ok. Now it is copied from see ARM is here and 2 registers R 1 R 2 are there R 1 is pointing some location from that address increment often. So, it is this is lower address suppose this is lower address and this is higher address it is implemented and then it is fetched and R 2 is pointing to some other location in the memory and that is fetched and this data comes here this data goes here. So, ARM is generating the address and then data is going into the BFP ok it is not coming back to the ARM ARM processor it is going to the BFP because it is a coprocessor instruction got it ok. Now what is happening next I think you understood these two instructions. Now let us see what is this F add S S24 ok now S24 is coming only S24 is coming ok I am sorry I am exceeding this limit of the processor assume that the processor is bigger enough. So, now S24 is here up to F31 now this is bank 3 ok. Now assume the FCSR value ok that is very important that time see when this instruction is to be executed by the floating point processor it looks at the FPSR alien field and sees that ok I have a value which happens to be a vector of length 8 ok that means, what whatever value is here plus 1. So, vector of 8 that means, it is supposed to do vector arithmetic what I mean by that S 16 starting from S 16 and starting from S 8 which have the data already you have loaded them with all the data ok. To perform the arithmetic like that means, what S 8 ok is added to S 16 ok and written into S 24. Now do not stop with that because this is 8 7 plus 1 8. So, 8 operations need to be done and in sequence that means, S 9 will be added to S 17 ok and write into S 25 go on like this and then you do a S 15 ok plus S 23 and write into S 31 ok. So, who does this we are not given any of these register values only this register values are given and it is implicit that if it belongs to the different bank other than bank 0 and the alien field says how many that elements in a vector operation to be done then that many operations are done. This one instruction performs 8 additions ok remember now let me erase the whole thing no harm you remember this banks now. Now you see I wanted to show you one more example of S scalar operation here again the bank ok there are two banks ok the input brands are S 8 to S 15 and S 24 ok ok into S 1. S 1 belongs to which bank it belongs a 0 to S 7 belongs to bank 0. So, if one of the operands especially the RM operands ok this is RM operands RM and RM if you remember in the on days. So, this is a scalar register because it is belongs to the bank 0 ok then this instruction though this alien value is not modified between these two instruction ok it remains to be 8, but it will do a scalar operation. What I mean by scalar means it will not when it sees S 1 it takes only the S 1 value and then it multiplies with everything in S 8 and then writes into S 24 that means, it will say S 8 into S 1 write it into S 24 ok and then S 9 into S 1 same register same register please keep in mind that will be written into S 25 and so on up to S 31. S 31 is written into by S 15 into S 1. So, that actually you are doing a scaling operation you are performing a scaling operation by fixed amount of S 1 scaling you are doing it on 8 data actually speaking it is a vector of data, but one of the element is a scalar. So, that multiplication why multiplication it is a Mullins if it was a addition everything would have been added by S 1 otherwise multiplied. So, this is a one flavor of a another instruction I just wanted to show you a VFP instruction can be written this way. Now, what happens ok we have done some job inside the VFP unless this value what is computed comes out into memory there is no use of this operation right then only once it comes into memory maybe ARM can do something with that and then can perform something on it. So, you have to get out of the value what is there in the internal registers to the memory. So, that is the STM store memory store in the values from register to memory and the results are in S 24 because this this is the registers bank which is holding the register. So, you have picked up all the 8 registers and then R 0 is pointing to another depth location in memory where it is copied into. So, that is done and then now I am showing you one more subtract R 3 suppose R 3 is having a count of 8 nodes that means, suppose you have okmultiple sets of values it is not only 8 you have some 80 values ok. That means, a set of 80 values in another the R R 1 was pointing in one in address and then R 2 is pointing to another set of values ok. Both of them are having 8 floating point values single position ok 4 bytes each. Now, we have performed 8 8 data we have taken from the memory and then we have multiplied and we scaled and then wrote it into another location which is pointed by R 0 we have stored the 8 values 8 set of values. Now, you want to take up next set of 8 values come here and there and then perform the operation and write into next set of those. So, how many times you will do R 3 will be the count will be initialized with the 10. So, you perform 10 times this ok. So, because the registers are limited to only 32 you cannot have infinite registers in the program, but you are you are interested in performing you know a set of 80 floating point values in stored into locations and then writing the set of 80 values as a result. So, you can do it in 10 times, but that that control is done by this. Now, where is R 3? R 3 is an arm please remember arm is having the value the R 3 ok. So, it is looping back the looping back control is done by whom arm processor because the instruction sequence is controlled by arm. Now, what happens is after this execution because this subtraction is done in the arm processor please this instruction is a arm instruction you have to keep in mind which are floating point instruction VFP instruction which is done by coprocessor and which are arm instruction. So, when arm instruction is done the DMA is what branch not equal to that mean this is subtracted by 1 because one set of values are done and then if it is not 0 then you want to go back. Now, this instruction is coming away you may wonder what happens if my coprocessor is busy with the previous execution VFP is busy with a previous loop. Now, I am giving another floating now another floating point instruction. If you recall arm processor will send a CPI NCPI then floating point or any other coprocessor will say that I am present or absent and then whether I am busy or not. So, if the coprocessor is busy the previous loop it may not take this instruction or it may not take this instruction because lower module might be free it may do that, but it may not take this instruction if it is busy with a previous add or previous mult. So, effectively what will happen arm processor will block the pipeline of arm will stall in the execution because the the coprocessor instruction is not taken up by the VFP ok. So, there is no need to have any other mechanism of waiting the waiting is implicit because that is the condition that arm processor has put in unless the instruction is accepted by the floating point processor do not go ahead with the execution. So, the arm processor will not go ahead. So, it will get blocked because this instruction will be executed faster this also may be done faster because arm is doing this and it is a small integer arithmetic whereas, these are all floating point job. So, it will get blocked when multiple iterations of this has to be executed. This you should keep in mind how the whole thing is functioning you have to have a understanding of VFP you have to have a you know understanding of how this hand shaking is happening, how arm is generating the instruction no addresses, how this registers are controlled and which sees that NV is used all of these things should come in your mind when you try to understand this flow of instruction ok. So, I hope this explanation was clear to you. So, this is a typical example of how a floating point processor could be used for processing a set of data ok. So, we have come to the end of this class. So, we have covered all of them all of them I am very happy to share this very interesting part of the you know information. So, we covered arm instructions and then now we have covered the co-processors now we have completed co-processor now we are going into travel outside the arm. So, we have come out of the arm we went into co-processor now in the subsequent classes we will roam around the SOC ok. You know that SOC has arm processor and co-processor ok some memory. So, we will visit all the you know friends in the SOC ok as a part of the different units will be covering those things. So, see you in the next session with a some other interesting topic. So, with this we are completing the co-processor related instruction. So, I you know explained you the co-processor interfaces and then I explained you co-processor instructions and then now I showed you how a VFP is implemented using the co-processor interface ok using the co-processor instruction how a VFP is implemented before that we saw how floating point is represented. So, now we are covering we have covered all of them all the related topics of co-processor. So, we are ready to get into some other interesting topic in the in our journey towards arm SOC design. So, I am very happy to share all this information with you these are the reference books ok. More often I will be using this ok and this is also being referred and of course, arm manoeuvres are used in all the lectures ok. Thank you very much for your attention. I hope this was useful and talk to you in the next class. Bye bye.