 So, we are discussing precise exceptions. So, these are exceptions that happen with the certain instruction and are distorted. That is the definition of a precise exception and the name comes from the fact that when the exception happens the system state is such that everything before it has completed and nothing after it has started. So, the system state is precise in that sense when the exception gets handled. And these are exceptions that happen within an instruction. So, an instruction executes while it goes to a pipeline in some stage it raises an exception. However, not all within instruction exceptions may be precise. But precise exceptions are always within instruction which means a precise exception is always tagged with the protocol instruction. And that is why the restartable clause is important because it says that after you handle the exception you should be able to resume the instruction and execute it successfully. So, exception occurring in some pipeline stage and the exception must be taken transparently meaning that you save state transfer control to the operating system and then restore state and resume. In a pipeline processor an instruction may take an exception deep into the pipeline. For example, it may happen in the memory stage it could take a page form. By this time quite a few subsequent instructions are already moving in a pipe. Next instruction must have been fetched something must have been decoded something must have been executed. Because if you look at the pipeline timing when this instruction gets to M the next one is if X next to next is a decode and another is being fetched. So, that means by the time the exception happens there are three more instructions inside the pipeline. So, and to maintain the precise semantic you have to do something to make sure that these instructions do not get to complete. They do not get to modify any state of the processor and of course, you have to make sure that everything above it has completed by the time you take the exception. Because the previous instruction is currently right back. So, the way it is done is that each instruction carries an exception vector with it. Which tells if this instruction took an exception and appears in which stage. So, it is called a vector because usually the length of the vector is equal to the number of pipe stages. So, mark that corresponding bit on if for example, if instruction takes an exception in X stage it will mark the third bit on and so on. The vector is examined at the end of mem or beginning of right back stage. Because notice that the right back stage cannot raise an exception. So, when you are in the right back stage you know that you are not right. You have no for the computation left the only thing that you have to do is you store your result back to the corresponding register. So, right back stage is usually exception free and that is why this is the stage where you usually check this vector. Tell me if this instruction should write the register or not. So, in case of a marked exception all pipe stages are fade with zeros. So, suppose to turn off any state change. So, essentially what I am saying is that suppose this particular instruction takes a page fault in memory stage. So, you mark that vector and when this instruction reaches the right back stage you examine the vector. Vector says that this instruction actually took an exception. So, now you have to actually nullify the instructions which are already in the pipeline. So, that is exactly what I am saying. All pipe stages are going to be fade with zeros. So, that essentially they are norms. So, they move to the pipe they do nothing. And in our pipe stage pipe we know that by the time this instruction in the right back stage the previous one has completed. So, this is in fact enough to maintain preciseness. As long as you do this you know that nothing after this particular instruction has done anything in the process of state. And then what happens is a trap instruction is fetched and that controls that transfers control to the operations. And the trap handler saves the program counter of the exception instruction. So, that after the exception is handled we can restore the program counter and start execution counter. So, program counter of this instruction will be saved. This is a general mechanism played in this. So, a processor is said to support precise exception if all instructions before the accepting instruction execute normally. All instructions after the accepting instruction do not change any program or visible stage of the processor. And after the exception is handled if it is restartable execution must begin at the accepting instruction. Okay. Interior pipeline must implement restartable exceptions to be able to implement these ports and TLB uses. These are the two highly required restartable exceptions that we have to implement. What is a TLB? Does anybody know? Loc aside? Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. Buckeye. this instruction might be a multiplication which takes longer. So, there is no such guarantee that by the time this instruction completes the previous one has completed and there is no other you know such guarantee that by the time this instruction raises an exception nothing after it has completed because it could be that this instruction is an instruction with latency less than the add instruction. So, by the time it raises an exception this is already completed. So, now how do you really maintain precisely. So, we will talk about this soon normally the way this is handled is there are two floating point modes supported in processors one is called an imprecise mode and the other one is a precise mode imprecise mode overlapping between floating point instructions is limited and usually at least an order of magnitude slower. So, the imprecise mode has no problem essentially you can ignore all exceptions. So, it is designed to give you performance. So, we will talk about how exactly you can implement a precise mode first of all that is what needs to be thought about that how can I implement a precise mode in a correct polynomial performance. So, we talk about that. So, let us first take a look at the ideal pipeline. So, we have this. So, this is our five stage pipeline. So, first need first thing we need to understand is what kind of exceptions that arise in each of the five stages. So, in the first stage the fetch stage what can happen you can have a page form you are trying to page an instruction and instruction is not included. So, that is an instruction page form you can run into memory protection exceptions you may be trying to access some others or you may be trying to access your own data in the fetch stage. So, that will be to protection violation could there be a misaligned access in the fetch stage is that possible. So, it is our instructions are full vitalized. So, can I have a misaligned access in the fetch stage is that possible which essentially means can I have a misaligned PC ever and my program counter will be not a modulo not 0 modulo 4 is that possible. So, what are the sources of PC we have bunch of sources can be PC plus 4 it can be predicted mainly if it is PC plus 4 all the time it cannot be ever non 0 modulo 4 with all the 0 modulo. What happens can happen how sorry say. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I am asking how, how can that happen? You are using a compiler generated code, it cannot happen right. . . Predictions are coming from the branch target buffer, it stores previously seen targets, it does not cook up anything, it stores only what it has seen previously. So, they cannot at least be, they cannot be miss out. So, what if you, let us say you miss predictor branch, predicted target is something which is a legitimate instruction. So, you are going along the predicted path and you encounter a return instruction, but there are no matching call because you are just going along the wrong path, you should not be going along this path. The correct execution you should never be taking this path without taking some other path on which you actually called the particular procedure. So, now you have a return instruction which is an interest job and use a register as a target and that register content was never selected before something like that. So, that can lead to a misalignment, does anybody follow what I am just saying? So, you have a branch somewhere in my program. So, through some control path I have reached a point, this one also may be predicted. And here I take a prediction, I am going along this path and here I encounter a return statement. The point is that through whatever path I came, there was no procedure call statement, simply because this path is wrong actually. I should not be following this path at all, but because of prediction I am following this path and I encounter this JR dollar 31. Dollar 31 was never saved, because only if you have a corresponding JAL or JALAR instruction, you would actually put the return address in dollar 31, but you never came across any such instruction. Because this whole thing is actually wrong, you are going along a predicted path. Deep is wrong. Because of prediction. You will always be inside the function. Yes, that is right. There has to be a return address or yes, you have to return from the function. Yes, provided that register has not been saved to stack. That is right. That is all right. Suppose your procedure wants to use this register for storing some temporary computation there. So, it saved onto memory, but hasn't it restored? And then you start going along the wrong path and suddenly encounter this and start using dollar 31 which is put out there data sitting there, some data value. Anyhow, we are having to keep an eye on it. It does not. It does not. It does not. It is a 32 bit value. So, it just puts these as the program output. You could do that what you are suggesting. So, when you execute a JAL instruction, it would actually store only the first 30 bits or something. You could do that. So, this will happen with very low likelihood in this pipeline. This will happen if you have a deeper pipe and this will happen if instructions start executing out of the program model, which we will discuss very soon. So, we will revisit this example again, but keep in mind that it can happen. So, this is the pipeline accessing in the feature because of execution along this particular path and this arises particularly from indirect jumps where you end up using a wrong register to take the target. Second stage decoder illegal output again arises because of same reason because of this particular path. So, I would actually use the same example. It could happen that in such a case actually this starts out to be an aligned axis. So, the feature does not know it is innocently goes and fetches whatever is there in that particular address. I mean it interprets the content as an address, goes and fetches whatever is there. It may be completely a garbage value does not even make sense to decoder. So, it immediately raises an illegal output exception in the second stage. Third stage can have arithmetic exception. For example, in integer pipeline the only exception that can happen is sign forward. So, you can go and look up your list of instructions. This is the only bad thing that can happen in the third stage. Fourth stage can have page faults, memory protection and misaligned accesses. Again this MIPS compiler is actually very careful about this. So, legitimate load store operations will never have misaligned addresses except for those LWL, LWR, SWL, SWR. So, this can again happen because of execution along misaligned accesses. We will start using wrong addresses and we will start doing it. Right back stage does not have any exception. Any question? So, now the problem is in the same cycle multiple instructions can take exceptions and even worse exceptions can occur out of order. So, let us take a look at that. So, consider this particular instruction and this instruction. We are looking at these two instructions. So, this instruction in the MEM stage may not be this one. Let us take this one. This one and this one, these two instructions. This instruction can take an exception in the MEM stage. This instruction can take an exception in the FET stage. So, a later instruction can actually take an exception earlier in time. This instruction will actually, this exception will show up earlier than this exception in the point nine because this is how my time goes. So, that is exactly what we are saying. Exception can occur out of order and in the same cycle multiple instructions can make exception. So, you can look at this cycle. There could be exception here, here, here, here. All these four instructions can raise exception in the same cycle. So, we have already devised a method to handle this. So, exception vector associated with each instruction provides a way to handle this in order because we said that well, let this instruction take an exception. I am not going to handle it immediately. I will only mark it in exception vector. Finally, when this instruction which is the right back stage, I will actually handle the exception. Fortunately by then, this exception would already have been handled so that this instruction will actually not be in the pipeline. When we say that when we handle the exception, we will feed zeroes in the pipeline. So, these three instructions will actually get nullified. So, when this exception is handled, the exception restarts, this instruction reexecutes. Then again, this instruction will appear in the pipeline, may take again an exception and we will handle it later. So, exceptions will be handled one after another exactly in the program because of this exception vector. Any questions? Yes. So, you are saying when we actually handle the exception, at that point we feed zeroes in the pipeline. So, suppose that the third instruction will take data of the page for the third instruction fetch. This one? Yes, this one. So, there will not be any instructions or it cannot read anything from the instruction fetch would actually stop. No, it will not stop. So, we will return them whatever in the i f i d latch will just carry forward. It does not matter. It will mark the exception vector as a single bit in the page stage and that is it. It will not page anything. It is an exception possible in the address page. Sorry, right? It is an exception possible in the address page. Because right back we were just writing for instance. What kind of exception will it have? Sorry. Which is correct? The registers are not protected. The registers they do not belong to you. They do not belong to us. Any other questions? Is this clear? How I handle exceptions in a standard 5 stage pipeline? So, a few small problems that we have to worry about. So, I mentioned this because you will have to handle system calls in your second assignment and these system calls are actually exceptions. So, the handling will be exactly same and you will face the exact same problems. So, one small problem arises about the branch delay slot. So, let us try to understand why is it very special? Why is it different from others? So, let us suppose that you have a load instruction in the branch delay slot taking an exception. So, essentially I have a branch instruction and the delay slot I have a load instruction and then I will have either the target or the fall through depending on which way the branch goes. So, let us assume that the load in the branch delay slot takes an exception. So, the question is how do you really handle this? The problem is depending on where the exception is going. So, clearly so first of all to maintain the preciseness of exception you cannot execute the next instruction at this point. You have to take the exception of the load come back reexecute the load and then only you can reexecute the next instruction which essentially means when you take this exception of the load you have to remember which way the branch is out because then only you will know which PC to use after the load instruction. So, the exception PC is this one this instruction. So, in the normal case you will only remember this PC and that is it. So, if you only do that the problem is you take the exception then you resume execution here. What is the next PC? Is it PC plus 4? It may not be you have to remember which way the branch is out the previous instruction. It could be PC plus 4 or it could be some target instruction. So, there are two solutions let the branch PC be the exception PC. So, essentially what you do is you take the exception but start execution from here you reexecute the branch then reexecute the load and then continue. Then you do not have to remember which way the branch is out. The second option is you remember multiple PCs and some more states essentially you remember the load PC and also remember the next PC to execute after the load. So, of course the good news is that in none of the codes that are handed to you from the assignment you have only two assistant calling the branch in this section. So, you will actually never face this particular problem that an instruction the branch is not taking an exception. And of course we do not model page faults are raising in our single area. So, then there is no question of having a load instruction taking an exception otherwise. So, only exception in your second assignment are system calls that is what you remember. So, this case will not arise in your assignment. So, now the harder problem what about the floating point pipeline. So, first let us try to understand why this is harder. So, just to remind you we had this floating point adder which had a 4 cycle latency. The multiplier had 7 cycle latency and the divider had 25 cycle latency right. So, what can happen in the pipeline is that. So, let us suppose I have a multiplication operation. So, your book actually put some memory stage here which are Victorian because the multiplier does not need a memory stage. Things really do not change if you put a new stage here alright. So, that is the multiplication instruction. Suppose the next instruction is an add instruction. So, what will it look like. So, let us suppose that this multiplication instruction raises an exception in stage M 7 the last stage of the pipeline. By now the add instruction is complete. It has written back on it. How do you maintain precise exception in this case. So, that is exactly what the first point says there. Instruction is complete out of order. Is the problem clear to everybody. Now somehow to maintain preciseness you have to recover the overwritten value for by this instruction. For example, this instruction could be something like this F2 comma F0 comma F2 okay. A double position add next. So, it has overwritten F2 already. It will recover that value to maintain preciseness. So, that when this instruction goes into the operating systems exception handler. The way see the precise state of the processor. It knows that everything before it has completed. And everything nothing after it has started. So, now and also there could be other problems. Think about the previous instruction okay right. This instruction could be a division instruction etcetera we take 25 seconds. So, now if you try to replicate the solution that we are trying to do for integer pipeline. In an exception that does not really work. Because by the time this instruction reaches right back and I am examining this exception vector. There is no guarantee that this instruction is completed. And there is no guarantee that this instruction has not completed. So, this might have completed. This may not have completed. So, that does not really work anymore. You have to do something else to maintain preciseness. So, here are four solutions. One is the imprecise mode. That is the easy way out. You say that I do not have a precise exception, but you get good performance okay. The second solution is to have a history find. Can anybody guess what it might be from the name? What could be a history find? Sorry, second? Right, exactly. So, before it overrides the register. It saves the register to a separate register find. That is called a history find. It tells you what the record is. So, how does it help to maintain its size? How do I use a history find? Same register. For example, when you write that when you write that when you write that you will know that add has already written that. Okay. Now, why do you want them to write for same register? What if they write different registers? Then there is a problem? So, basically it is the point of precise exception is that I do not want anything to be modified after this if the multiplication is taking an exception. Okay. So, whatever register this add instruction is written to must be recovered now. Okay. And that is what you use to recover the history find. Before you allow the add instruction to in this stage before you allow this instruction to write to F2 you actually save the previous value of F2 in the history find. So, and when do you finally copy the value in the main register find? Except in the sorry say. No, so the new value is already in the register find right? In the main register find it is already written back. No, no I am saying with a history find the new value is already into the main register find. The history find stores the old value that is it. So, you do not have to do anything extra as such just get it for free right? Is that okay? So, what is the the register is being written multiple time now. So, what if I have another add instruction here? So, this one also completes before the multiplication gets to the right back stage when I examine the exception vector. And this instruction may be writing to F2 also. So, now what about now a history find will actually have the value of this instruction not the previous value actually. So, I want the value before this instruction. But that is now will be overwritten by this particular value. Because every time you update and register in the main find whatever the main find is having you copy that into the history. So, what do you do? What is all this? Right? You do not let it go right back only one right is allowed in the history find. So, you install instruction? Okay. So, that is one possible solution. So, what is suggesting is that you have a bit vector length equal to number of registers. And if one register is already into the history find you mark that bit. So, this instruction comes to right back that bit is already marked. So, you say cannot write. How long do you keep that bit in a bit? So, some structure will have to maintain this order right. So, that you have to you have to essentially have a one to one correspondence between the instruction and a history find. Okay. So, you have a separate FIFO which actually maintains this particular order of fetch whenever you fetch an instruction you enqueue that in the FIFO. So, instructions will actually be removed from the FIFO in that order. So, that will make sure that whenever an instruction is removed from the FIFO you know that now that history find entry can be overwritten. Okay. So, it is not at as easy as say we have done now. There are many small things to take care of. The other option is the future find. Okay. So, probably you can guess now what that is essentially what I mean is that well in this case you do not actually change F2 in the main find. You store it somewhere else. That tells you what the future values the main find does not change in this case. So, in this case now what will have to happen is this is actually FIFO. Whenever an instruction goes out of the FIFO you copy that value from the future find into the main find. Because now that value becomes visible to everybody. All right. So, do you have here also this multi-version problem which you had in the history find? Multi-versions of values. Again the same instruction value is the same register right. Okay. Now essentially what you can do is since you are maintaining this FIFO anyway why don't we have with each entry of the FIFO a value field which stores the value that these instructions produced. That will become your future find. So, essentially you have a FIFO field whenever this instruction is fetched we will put this instruction ID here and also it will have a value field which will be populated by the value this instruction produces. And whenever this is this will move out of the queue you move this value to the main find. That automatically takes care of this multi-versions. Yeah. What is the value you have to read from the future find the next instruction you have to yes. Exactly. Your future find now because part of your bypass network because what if the next instruction needs F2 where will you get it from? The main find doesn't have the most appropriate value. It's here sitting actually in the queue. So you have to bypass from the future find. Okay. So anyway none of these actually is an easy solution. Both have complications. So the P6 architecture enhances the future find to a retirement register file. We'll talk about P6 once we cover a few more concepts. And the P6 micro architecture is used in Pentium Pro, Pentium 2 and Pentium 3. So these processors actually use this particular structure it's called retirement register file. This particular old queue. It has a certain size of course which essentially means that if this queue is full you cannot fetch anymore. You have to stop. So we'll talk about this queue in the more detail later. It has various other names and all. But just keep in mind that it is essentially maintaining the order of instructions. The order in which they are fetched. Any question on the future find? The other solution is you can let the software handle preciseness. That is forget about partial whatever is unfinished. What happens is that whenever this, let's say this multiplication instruction is taken an exception. When this one reaches right back I check its exception vector. Immediately there is an exception. Without worrying about what happens to the instructions before it, what happens to the instructions after it. So the software handler that is going to handle the exception will have the responsibility to finish the incomplete instructions and ignore the complicated ones. And resume after the last completed instruction. So some of the instructions may already be completed here. So it will actually resume execution after the last completed instruction. The last solution is issue only if all instructions are guaranteed to complete without taking exception. These are very hard things to guarantee actually. So essentially what I am saying here is that you detect exceptions as early as possible. If possible whenever you are issuing an instruction you ask the following question. Tell me that, tell me if the instructions that are before me will take an exception or not. So you are asking the decoder this question. So whenever you issue this load instruction here you ask all the instructions that are currently in the pipeline before me will they take an exception or not. If the answer is yes, then you stall this instruction. You don't issue it actually. So that really solves this problem that you know there won't be any instruction after this accepting instruction in the pipeline. But how do we answer this question? It's not at all easy. How do you know that this multiplication is going to take an exception in the H7 stage? How does this particular decoder sitting here can say that? So it's possible you have to design a hardware in that way so that every functional unit checks for exception before starting the computation. In the very first stage the multiplier could actually check if this instruction is going to take an exception or not. That's possible actually. It can examine the operands and can figure out if there is going to be an exception or not. So this is exactly what is done in MIPS R2000, R3000, R4000 and Interpreting. So we actually stall instruction issue if there is a chance of an exception happening in any of the instructions prior to this. So we'll talk about R4000 very soon and see how it actually does that. Any questions on this? So a little bit about pipelining a Sysk ISA till now we have been looking at a Rysk ISA where we have these simple instructions where you know when instruction requires more resource in which stage it will access memory and so on and so forth. In Sysk ISA you have widely varying latency of instructions that magnifies the problems of the floating point pipelining by a large amount which we have seen here. And worse, there could be data hazard within instruction. So that we have never encountered in MIPS that we usually have hazards from one instruction to another. Here you can have data hazard within instruction because the same register may be read and written to multiple times in one instruction. So VAX 8800 invented something called micro-instructions. What is that? You essentially translate Sysk instruction to a sequence of Rysk-like simple instructions. And since 1995 your Intel architectures uses these tests. Essentially internally your XN6 instructions get translated to Rysk-like micro-instructions. So that takes care of many bad things about the Sysk ISA. What about precise exceptions in a Sysk ISA? It looks extremely hard to support because instructions modifies few states at different times and possibly multiple times. It's not as well defined in the Rysk copy. For example you can think of a string copy instruction which could actually take multiple page forms. Because for example you can try to copy a particle string spanning multiple pages to some other place spanning multiple pages. Source pages can take page forms and destination pages can take page forms. So maybe it can happen that you have copied some part of the string and then you encounter a page form. Because when you move to the second page you find the pages are there. So there could be many problems in handling precise exception. So one easy solution here could be that before you start doing anything you make sure that all the pages that you need are in memory. You touch all the pages, bring all the pages and then start string copy. You can use history or future file as we have discussed but Sysk makes that hard to do. Why is that? We have actually already discussed why that is so. In Sysk it's just magnified multiple times. Exactly. So because of this problem same register may be written multiple times within one instruction. So the problem of versions gets magnified several times in this particular audience. So the same instruction may require multiple versions of the same register. Because you can have an exception anywhere in an instruction. Suppose you have written to some register two, three times you need to write to it maybe seven more times now you get an exception. So you may have to write all these versions. So we have decided to save and restore partially completed instructions. Essentially what we are saying now is that you maintain state to decide where to start. So maybe you have copied part of the string and then take an exception. You just remember that that I have done this much of work in this instruction. I will resume the execution in the middle of that instruction. So now essentially you are changing the precise exception definition a little bit. We are saying that you can actually resume at the beginning of an instruction. You can actually resume in the middle of an instruction also. Because exceptions now can happen in the middle of instructions. So I will quickly go over this particular processor. It just shows that the pipeline, bypass and all these things just follow a slightly different perspective compared to a five-stage pipe. So this particular processor implements a 64 between this and that your MIPS R3000 which is a 32 bit MIPS processor. And R4K is a family and R4400 is a member of that family which we will discuss. It has an eight-stage pipeline. Essentially what they have done is they have taken this five-stage pipe and has decomposed further to gain in terms of frequency. So what are these five stages? Selects PC you have the multiplexer which selects your PC and it starts instruction access. Okay. And instruction fetch yeah, so starts instruction access. And there is second cycle which actually completes the instruction fetch. So now we have two cycle fetch all the pipeline. Third stage is register file access where you actually get to know if the instruction cache actually hit or not. This is very interesting. So you start the second stage of the pipe without actually knowing whether you hit the instruction cache or not. You get to know it only in this particular cycle. And if you hit the cache then of course everything is fine. Otherwise whatever you have decoded will be discarded because you also start decoding in parallel. You decode you also do the hazard check and activate interlock if needed. The mistake that the MIPS philosophy was that you take all hazards in the decoder and introduce interlock cycles at that point. Okay. Fourth stage is execution where you execute branches both condition and target. You also do ALU operations and compute effective address of nodes. Then there are three stages of memory access that is a data access DF, DS. There are two stages that do the data access. In TC you get to know if you hit in the data cache. You still don't know actually. At the beginning of the cycle you only get to know whether the data that you are dealing with is actually correct or wrong. And also in this stage you complete the store. So essentially how do you do a store? Suppose you have a cache box which is 64 bytes and a store operation will be modifying 64 bytes. So you first read out the cache box completely modify those 4 bytes and then write it back to the cache. So these data accesses actually get you the 64 bytes out of the cache. And in this cycle, in the TC cycle you get to know if you hit all the cache heat actually. And then you do the store. So the question now is what am I really accessing in the cache without knowing if it's a hit or not? Can you really guess and what's the meaning of this? What's going on? Same thing here, right? I'm saying that you start an instruction access you do an instruction fetch but you really don't know whether you hit the cache or not. Because usually the way you understand is that you first look out the cache if you hit, you access the data, right? So what is going on here? I'm doing it the opposite way. How do you figure out if there's a cache hit or make a clear understanding? Do you know what cache is? Did 220 cover caches? Who does not know about the caches? There's no answer. Oh, everybody knows. So how do you take the caches? Match the tag. We'll see whether the variant hit is set or not. Does it mean of what? Of the cache block. Of the cache block, okay? First it should be valid. Then the tag should be the cache. What does that mean? Matching means... So from the given address we take out some part of it which is called the tag and we match that against whatever is stored in the cache, right? In that block. The tag of that block. If they match and the variant hit is set we say it's a cache hit. Which means I can now access the data. I'm doing it here in the opposite way. I first access the data and then check if the data was valid or not. Can I do that? How do I access the tag? How do I get this one actually? Cache is an array of blocks. So how do I locate this particular tag? How do I locate the block? Oh, I'll tell you. Anyway, so there are some bits which are called index bits in the address which we use to find out which block to get, right? Okay. So this will actually point to... This is my cache. We'll point to some block and this block will have data, tag and a valid bit, right? Okay. Essentially this is what I'm using here. Okay. So nothing really stops me. So I have the address from my instruction. Or if I have the program counter I have this full address with me. So I can calculate the index. And nothing stops me from accessing the data without accessing the tag. Because the data will also be at the same index. So that's exactly what I'm doing here. I access the data. I access the tag in parallel. And while I'm accessing the data I make this comparison data and figure out if this data was actually correct. Okay. So that's what I'm doing here. Why does it help? Because now these two operations are not serialized. Tag check, tag access, tag check and data access. I can actually do them in parallel at the same time. What am I losing? I'm losing a lot of power in my cache. I actually spend a lot of energy here now. Because some of the data accesses are going to be useless. Okay. So I end up burning energy. Okay? Sir, so on the on the tags that are set are actually All done? Yes. Well for a set associative cache this is going to be much more difficult. Yes. So you have to access all the data and pick one of them depending on that. Yeah. So there you'll end up spending even more amount of power. So this 4,400 users I don't remember exactly. It will have at least two associative caches. So does everybody follow what he's mentioning? So he's saying that if you have an associative cache I'll end up actually doing more useless work. But these are done to save time. Otherwise I won't be able to do this in three cycles. I'll need to do it in more time. Same here. Exactly the same thing. And then finally I do the register right now. So what are the implications? I have a load delay of two cycles. Let's see what that means. Okay? That's my coin. So let's suppose that this is a load instruction this particular one. So value is available here. Right? But I don't know the value is correct until I get to this stage. Because this is what I get to know is the cache actually that I'm using is valid or not. Okay? So now think of an instruction that uses the loaded value. For example this instruction could be load mode R2, 0, R1 and then after that you have an add instruction which uses R2 R3, R2, R2, etc. All right? So it needs the value. So let's try to see what happens to this instruction. So this instruction features here issues here register file access so it needs the value here. So there is no bypass on this on the start button. It's impossible. The value is available here and I only get to know if it's correct here. So I need I'm of course again assuming that all stalls are introduced after this particular stage after the decoder the interlock stalls. So I store it for two cycles. I move the execution here. Then I can at least get the value. I still don't know if it's correct or not. And these were things that very interesting. So they say that the load delay stall is actually two cycles. And these are the two cycles. But I'm not yet sure if the computation is correct. So what you do is you are actually making a speculation here. You're saying that well, what is the common case? The cache is usually very high. So most of the time I'm going to be correct if I use the value. There will be some cases where I'm going to be wrong. So essentially in those cases you'll have to roll back one cycle to re-execute the next stage. So it may happen that at the end of this stage you get to know oh, this value actually was a cache miss. So essentially what you do is now you introduce no ops in the pipe and you keep these instructions stored here until the load miss resolves. It's going to take some time actually for the data to come back. So this is actually called a blind speculation where you are not using any history. You are just relying on the fact that caches are good. So they are going to give you majority of its minority of misses. So this is called load hit miss speculation. This is widely used today in all microprocessors. They use a more sophisticated prediction. This is a prediction. That's what I'm thinking. They'll actually look at the history of this particular load instruction and decide if it is going to hit on this in a cache. The first case is three cycles as I just mentioned. Also you need a hardware to back up by one cycle. Because miss may take longer the backup hardware turns the dependent issue in last cycle to a no op. And then stalls the pipe until the miss resolves. So essentially you turn this into a no op and keep this instruction stored here until the load data is available. The pipeline interlock is implemented to stall dependent for two cycles and the interlock is actually decided here by the decoder. So whenever this situation arises this decoder can figure out and then it will not issue this instruction for two more cycles. Wait here and then send it here and again do a tag check here and if needed introduce a no op and stall instruction. Is this there? Brunch delay is three cycles. This is much easier to understand actually. So I assume that this is a branch instruction. This is a solution here. So I cannot fetch here I cannot fetch here I cannot fetch here I cannot fetch here I can fetch here. And one of these is built by the compiler which is the