 you compare with respect to this particular point in time, you will find that all these instructions are actually using the value after. For example, sub needs the value here, the and instruction needs the value here and the or instruction needs the value here. So it is quite possible to forward the value produced here to these instructions. You just need some extra hardware to be able to do that. And that is what is called a bypass path. For example, the output of the ALU here will get bypassed to the input of the ALU in the next cycle, so that the sub instruction can consume the value. So, this is what it looks like logically speaking that the ad instruction produces the value here and forwards it to the sub instruction in the next cycle. And it also forwards the value. So, remember that the value is getting carried forward to the pipeline registers until it reaches this point of stage. So, here in this register the value is still available, which it can forward to this instruction. So, that it can consume here, so the idea here is that you read a wrong value from the register file in the decode RF stage for example, but the bypassed value with overwrite. And we also discussed how the implement it will require a bunch of comparators, which will drive the selection of the multiplexes of the input of the ALUs. So, that you can decide what value to consume. So, the value in this particular latch will essentially get bypassed to the input of these two multiplexes. Or if you want to imagine it separately you can also have two separate multiplexes, which will actually select between the register file value and the forwarded value based on the comparison of the register file. Similarly, you can have an extra multiplexer here, which would select between this value and the value that is been forwarded from the output of the aid. And then you have this multiplexer as usual, which will select based on output whether to get the immediate or to get the next PC. So, that is the basic idea of bypassing and then of course, when you want to bypass from here, you do the same thing, only thing is that it will require longer wires. So, that will also come here, so that now select between this value and the value will forward from here. So, now as you can see your multiplex is getting bigger now. Previously we had to select between this value and the value will forward from this latch. Now, we have one more value to select from. So, essentially means that as your source points of bypass increases number of source points you will have bigger multiplexes. For example, if you look at this instruction, what if you see if you multiplex it will have a value forwarded from here, you also have a value forwarded from here. You cannot disable that bypass that is always on. Only thing is that and there is of course, another value coming from the register file. Out of these the values you will select the correct one based on register identifiers. What identifiers will compare here? So, this particular instruction will essentially this instruction will essentially take R 1 and R 7, these are the sources. It will compare against what we will compare against, it will compare against the destination of this instruction which does not match. So, in which case it will discard this bypass that is discarded, but you will find that this matches with this one. So, it will actually consume this particular bypass value. So, as you can see what it means is that as your span of bypass network increases, you will also require more and more comparisons to make. For example, here in this case both of these will be compared against both of these and of course, only one of them will pass. So, the question is the basic concept here. So, we also mentioned one thing that is about a phased register file and that helps you in resolving these all instructions bypass. So, here the point is that the value is written here in this particular cycle and the all instruction reads the value in exactly the same cycle from register file. If you do not take any extra care of course, the value that is read by this particular instruction will be wrong, it is guaranteed to be wrong, because the value that will that is written here in this particular cycle will appear in the next cycle about the output of the register. So, that is how the register is designed. So, what we are saying is that well if we can squeeze the register right in only half the cycle that is a positive the first half of the cycle and we read from the register file only in the negative the second half of the cycle. Then this problem goes away, because now R 1 will be written in the first half of this cycle and I will read R 1 in the second half of the cycle and I will get the right back. So, that makes sure that this all instruction does not require any bypass if you fail to get the right value from the register file itself. So, of course, the condition here is given your processor frequency, you can squeeze register right in half cycle and register read in half cycle, otherwise it is not possible, otherwise what will you do? Well you have to enable another bypass where this instruction will actually forward it to the execution stage on the right back stage. Just one small issue here that is whenever you say that. So, this is the end of the cycle. So, there is probably no latch here or no red pipeline register here. The question is when I say that well I really do not have this phase register file right of read. So, register right will actually take the full cycle. Then where do you actually bypass it from? How do you how do you accomplish this bypass? What is the meaning of this bypass? Where is coming from the value? There is no pipeline register here to bypass from. The register file will have the value that is right. So, you need a bypass from the register file itself right. So, that is the important point here that is you need a bypass path originating in the register itself which will forward the value to this particular rate and that is the important. But we would have to identify from which register are we going to. All registers will have. But there should be a selector which selects from which register to. Other registers are actually. No you know your question is very relevant. So, you have 32 registers here which value to source from right that is what you are asking essentially yes right. So, essentially we have to remember what value was just written in the previous cycle. So, essentially in effect you have to use a pipeline register which will hold the value for one more extra cycle which was written in the last cycle. So, in this case R1 will be held in a special register which was written in the last cycle and would be used to bypass to this particular stage. So, it is just one register which will always hold the register that is written to in the last cycle. And is that we also need the destination of the art and need the instruction for the art instruction to guide the comparators. The opcode. Yeah opcode and the destination register. Destination is needed but why do we need the opcode? No we do not need the opcode we need the destination. To guide the comparators. Yeah right. So, that is what I have said right. If you look at this particular instruction you will compare R1 R7 against R4 R1. No, but the R1 and so it is actually. Oh why do you get this from this? Yes right right. So, that is also in the latch right up to this point it is in the pipeline register. Because you need that to write to the register pipeline right. Beyond this point it is not there that is what you see. Yeah. So, that is all we have to remember the register ID. So, here you will require a special register which will hold the value at the register identifier. Yes. And which will be used here to activate the bypass. And what have we done? In effect we have increased one more bypass path to this multiplexer. Now, we have a bypass from here to here, here to here and here to here. And of course one path coming from the register pipeline itself right that is here. So, out of these four we will select one based on what? So, this instruction will compare R1 R9 against R6 R4 R1 right. And this one will match with this one. So, this one will actually go get over ready. This value will come directly from the register pipeline. Is it here? What if we had a longer pipeline here? We will require more bypass path. We will require a much bigger multiplexer right. So, remember this implication that if you make your pipeline deeper, there are two things that will happen. One, your branch penalty is going to increase that we already discussed. That is if you increase your pipeline depth on this side of the pipeline before the branch executes. And after the execution stage if you increase the pipeline depth, you will make your bypass network more and more complicated. It will require more and more space. It will require bigger multiplexers. It will probably slow down your process of the pipeline. And also we used this particular acronym to identify these hazards. These are called lead after right hazards. Why? Because you are reading after this register is returned. So, we look at other types of hazards. So, phase register file solves recycle apart raw. Otherwise, we require one more bypass path. And you always feed bypass values to the input. How many sources in the bypass network? So, in this case if you imagine that your register file is phase. So, we have just how many two sources right. One coming from execution, one coming from here. There are two sources of bypass. And till now we have only seen bypasses going to the execution stage only. The destination that we have seen so far. It is only requiring a bypass after the execution stage. So, the question that is asked here is doing it a bypass to the main stage also. Can you think of an example where there will be a bypass path dominating in the main stage. That is the main stage needs some value coming from somewhere else. So, what are the values that the main stage requires? What are the inputs to the main stage? There are two things that it requires. Sorry, a memory border needs two things. What are they? Address. Address and size. Yes, you are right. Size is also one thing and value. What kind of instructions require a value? Source. Exactly. So, the memory needs an address for sure. Somebody has mentioned that it needs a size also. However, the size is not very important here because it comes from the instruction of course. So, there is a dependency here. The size is not generated by somebody else. But address may get generated by another instruction. And of course, there is a value for score instructions. We may get generated which may get generated by another instruction. So, there are two things that we need for the memory border. Now, can you think about a dependence? So, there is a memory instruction which needs this address and value. Can you think of an example where one of these or maybe both of these getting generated by another instruction? So, here is an example for that. We have an add instruction here which adds R 2, R 3 and puts the result in R 1. Then, there is a load instruction which uses R 1 to compute the address and puts the value in R 4. And immediately after that it stores R 4 back to some other address. So, actually I am copying a value from one part of memory to another part of the board. So, let us see how the pipeline will work in this case. So, here is my add pipeline. So, remember that time goes in this direction. The load instruction requests R 1 getting generated by the add instruction. So, there is a bypass path that will be activated here. So, R 1 will be consumed from this instruction through this bypass path. And otherwise the load does not require anything else. So, it will compute fine. The store instruction has a double dependence. It needs R 1 from add to compute the address. And it needs the value from load to generate the store value. So, let us see how it works. So, the mem to x bypass for this instruction will get activated to pass the value of R 1. So, R 1 is needed here to compute the address. So, remember that the memory operations compute the address in the x stage. And the load instruction will bypass the value from here to here for the store. So, it is getting generated here. You can bypass it here. So, is an example where you need a mem to main bypass. Is there any other situation where a mem stage needs to have a bypass ran in the other case? So, here is a general question that you might think about. That will help you answer many of these questions. That is suppose I have a cycle. And you are focusing on a particular screen of pipeline, let us say S. And you are asking tell me what are the pipe stages that can bypass values to the stage S. So, what can you say about those stages that the bypassing value to stage S? Will they come before stages or after stages? For example, here if I look at the x pipeline stage, will I require or let us suppose let us look at mem pipeline stage. Will I ever require a bypass coming from the x stage? Can you see why? Because x always appears before mem. There is no way, there is no reason why I need to bypass actually. Because if you lay out the pipeline here, you will always find that x appears before mem. So, there is no way I need a bypass from x to mem. Because whatever currently x cannot, it has to be after the main instruction. Instruction that is currently mem, let us suppose this load instruction. Point is in the mem stage. Instruction that is currently x is actually after it. So, bypass always comes from previous instruction. Cannot come from the next instruction, that is not possible. So, whenever you are looking at a bypass, trying to resolve where are the possible sources of my bypass. It can only come from stages after. Cannot come from stages before, that is not possible. And of course, you can come from myself, that is possible. Here, mem to mem bypass, that is possible. So, in this case, since the only stage after mem is right back, there could be a possibility of having a bypass platform right back to mem. And we will come to that very soon, that particular possibility. So, anyway, so how many destinations in the bypass network? So, in this case, we have two x and mem. There are two destinations. So, how many, what is the number of multiplexers that you require? So, whatever number of destinations you have, whatever number of sources you have, there will be a cross product of that. And of course, you have to look at how many inputs are there. So, in this case, for example, you will require x to x. These bypass will require two multiplexers, because there are two inputs to the execution pipeline stage. And there is a mem to x also. So, there will be two multiplexers, each accepting two inputs from bypass and one from a registered pipeline. If I ignore the right back to x, bypass for now. And the mem stage, we will have just one bypass platform now. We will come back to right back to mem bypass very soon. And there will be one multiplexer, selecting between this bypass value and the value coming in from the execution stage. So, let us show you the diagram of the support. So, this is the store value, coming from the registered pipeline. So, you will be selecting the store value between this and the value that is getting bypassed from the output of the memory stage. Where is my value? Yeah, this value. This value will get bypassed. I am selecting between this and this. So, based on the comparison of the registered identifiers. So, in this case, I will be comparing R4 with the destination of this one. And there will be a match. So, I will say, oh, I have to take the bypass. So, yeah. So, as you make the pipeline deeper, you will need more and more bypassed parts. So, now the question is that, you know, till now, we have seen that in all the examples, even though we do not have the value of the registered pipeline, we can avoid all stocks by bypassing. There is no bubble here, any of these examples. We have bypassed the values exactly in time. No bubble here, no bubble here. Then the question is always possible. The answer is no. There will be stalls in certain situations. So, here is an example for that. So, this is a typical scenario where you require a bubble. An instruction here consumes a value that is loaded. So, whenever you find that a value that is getting produced by a state s and needed by the next instruction before stages, the next instruction requires a value at the x stage. That is not possible to find by bypass. That is impossible. Why is that? So, this is how it is described pictorially. So, this is where the value is produced. And this instruction needs a value here. That is impossible. That is a negative in time. So, how about this and instruction is ok, because it will require the x value, the value here that will get bypassed. So, in this case you really have no choice, but to stall the sub instruction for one cycle. You have to delay its execution by one cycle and that has to happen when the value can be bypassed. So, these are normally called the technical term is pipeline interlock. So, these are actually stall cycles. So, you need a hardware pipeline interlock to stall the sub by a cycle, the sub instruction. So, this brings us to the acronym MIPS. So, now I can explain what it stands for. It stands for microprocessor without interlock 5 stages. So, that is what MIPS stands for. And so, this is the only problem that MIPS had. So, question now is the name suggests that there is no interlock cycle. So, there has to be a reason, there has to be a way to avoid this stall. So, they rely on the compiler. So, they call this particular slot a load delay slot, which they rely on the compiler to be filled by an independent instruction. So, currently compiled MIPS code in the early generations of MIPS never had a dependent instruction on load in the load delay slot. So, that reads you of this particular bound. And which is why it is called microprocessor without interlock 5 stages. It had the compiler to fill the load delay slot with something independent or a node. So, which essentially means that the code that I am showing here is actually illegal MIPS code. So, of course, the load delay slot got removed later in the later generations of MIPS. But in the earlier generations, this was the case. So, this code will never be generated by MIPS compiler. So, for your second assignment, the compiler that is handed out is actually a little more advanced. It can generate this piece of code. So, you have to actually implement this particular interlock. Keep that in mind. Any question? Sir, in phase execution, how possible in this case? That is independent. Phase execution of the register file. Yeah, that is independent. No, I mean, there might be an exception of that. Because of the pregnancy of the MIPS. Could be possible, they did not do it. That is all. So, there must be frequency constraints. You can phase every stage if you want. That is possible. By phasing every stage, essentially what you are saying is that I am splitting the byte stage into two halves. But really, there is no latch in between. So, it is a multi-cycle byte stage. That is what it means. Any other question? So, a little bit more on bypass install logic. So, what does a bypass logic look like? How many forwarding paths? How large are the maxes? As you walk through your second assignment, these concepts will become even more clearer. Because you get to draw a multiplexer from a piece of paper to figure out what the logic should be. So, or if you can, without drawing, if you can forward in your head, that is even better. But anyway, the point is that as you do that, it will become very clear how many of these things you require. What about load interlocks? So, the question is, where should I detect this interlock? So, you can see, if you go back to this example. There are many places where I can detect that there is a problem. I can detect it as early as here, as soon as the instruction is decoded. I can detect it in this stage that the value is not available. Actually, there is a hazard, so I can install it here. But of course, these are the two possibilities only in this case. But if you have a deeper byte, there are many other options. However, detecting possible hazards early simplifies things. Because you are not too far into the pipeline, which means you do not have too many things behind you. So, you can install the pipeline early. And the fixed positions of RS, RT and RT, these are the register identifiers from your MIPS encoding, are important for register file access and hazard detection. Because this simplifies matters a lot. Because you know exactly what to compare, which positions of your instruction to compare against. So, in MIPS, all interlocks can be implemented in the decode state itself. So, that is very good, that makes it, that makes the design simply to distribute the interlock logic all over the pipeline. You can just implement everything in one particular pipe stage. So, what do you need? So, that makes sure that if you implement all your interlock logic in the decode state, what you have to do? You have to tell the fetcher to stop. That is one thing. The second thing is you have to putting no ops into the forward direction of the pipeline. So, you send all zeros to the execution stage. You have to keep on doing that until the interlock results. So, MIPS R3K does not have any harder interlock. As you have just mentioned, compiler fills the load in the slot. But in generations ahead of that, they actually have the load interlock implemented in the decode state. So, in this case, how do you implement the interlock? That is fairly simple actually. As soon as this is decoded, you know the source of this instruction. All you have to do is, you have to check the previous instruction, which is currently sitting where? Which is in which latch? Which is in this latch, currently sitting here, right? Currently executing here. So, all you have to do is, you have to take the content of this latch and check two things. Is the previous instruction a load instruction? And does the destination of the load match with one of my sources? That is it. If the answer to both is yes, that means I have to solve it. So, essentially what I have to do is, I have to tempt the fetcher to stall for one cycle. And I have to feed all zeros into this particular latch in this cycle. Next cycle onward, the proper action will be implemented. And this one we have already discussed, branch target in decode RF. You can actually bring the condition execution also in this stage, if you want to. That will only lengthen your cycle a little bit. But that will read you of all branch problems. But anyway, so I am not going to that. We have discussed in great detail branches and all. So, next thing is, we have till now, we have looked at this simple pipeline, right? Where execution takes a single cycle. But that is far from what the reality is. There are multi-cycle execution stages. And the very reason for that is to primarily support floating point operations. These are much complex to be finished to the cycle. For example, if you think about a floating point multiplication, it is almost impossible to finish in a single cycle. Also, multiple functional units may be needed to avoid structural hazards. So, if you cannot pipeline these multi-cycle functional units, you may require multiple of these to avoid structural hazards. Because now essentially, what you will do is, there is a delay of 10 cycles in a functional unit, which is not pipeline. The next operation cannot go in until this one finishes. So, for the following discussion, we will assume that, we have four functional units, which consists of one integer in ADU, which will do all the integer operations except multiplication and division. We have a floating point and integer multiplier. So, essentially what we will do is, we will carry out all our integer multiplications on the floating point multiplier, which we can do without any problem. We will have a floating point, add on and subtract. And we have a floating point and integer divided. So, all integer divisions will be added out of the floating point divided. Is it clear? The architecture. Now, our latency of an instruction is defined by the number of cycles needed to produce the result from the time it issues. The textbook defines it in a slightly different way, but we will follow this definition in the lectures. So, do not get confused. Assume that, integer ADU instructions have a latency of one cycle. So, these are simple operations. So, they can finish in a cycle. Loads have a latency of two cycles. Why is that? Because we have already seen that. In the execute state, it computes the address. In memory state, it executes a load. That is why these two cycles. So, from the time the instruction load issues to the time the value is produced, it takes two cycles. X state and middle state. Floating point add is four cycles. Floating point and integer multiplies seven cycles. And floating point and integer divide 25 cycles. So, that is the model of execution. So, maybe I will list it here so that we can remember. So, we can remember that, floating point add remember that subtractions also happen in the same unit. So, four cycles and floating point multiply also integer multiply same seven cycles. And floating point divided 25 cycles. Any question on this particular model of function unit? So, we will use it in the subsequent slides to demonstrate how multi-cycle X encryption stages make your life much more complicated. So, there is one more term to define that is repeat interval of an instruction. This is number of cycles between two instructions in the same category that can execute without a structural hazard. For example, number of cycles between two floating point additions that can execute is a repeat interval of the floating point adder. So, it depends on how the function units are pipelined. So, for example, if the adder is not pipelined, you will have to wait for four cycles before fitting the next instruction. So, let us assume that all the units other than the divider are pipelined except this one these are all pipelined. So, division has a repeat interval of 25 cycles while other instructions can issue back to back repeat interval of one cycle. So, which means there will be seven different there could be seven different multiple of multiplication operations in the pipeline or the multiplier. So, what does the pipeline look like? So, I have a fetch stage, I have a decode register file stage and then I have different pipe stages. So, I have an I a u x one cycle then I have a load pipeline which is just x and then I have a floating point adder which has four cycles. So, I will say a 1, a 2, a 3, a 4 and similarly I will have m 1, m 2, m 3, m 4, m 5, etcetera up to m 7 and divider will have d 1, d 2 up to d 25 and at end I have a right back search. So, all the instructions will produce values So, this looks very different from our previous architecture. Now, I have So, let me look at maybe I will put the green one also. Now, four different pipelines going on and even for the load instruction if I want I could put a separate ALU to compute the address maybe I should do that as an address stage and a memory stage. So, there are five different pipelines They all get diverged from the decoder. So, depending on the decoder outcome I decide which pipe to send the instruction to and finally they will write back So, clearly you require more pipeline latches or pipeline registers So, between these stages you require pipeline registers any other complications. So, that is all about the remaining slides. So, let us start with a simple one Structural hazards. So, divider is not pipeline that causes a big problem because if you have two division operations which are separated by less than 25 cycles you will have a problem. The second instruction has to be installed and then I can decide in the decoder itself. I want to implement all my internals in the decode stage. So, whenever I have a division instruction I check if currently there is anything in the divider. So, that state can be easily maintained here. So, that I can install the instruction. So, clearly that is going to introduce bubbles in my pipeline and we use So, a slightly smarter thing that I could have done here is that I can I can only stall the instructions. So, here essentially see there are two things you can do. One is that whenever you push a divider here you say that for 25 cycles I am not going to issue anything else. Because the register file will be written after 25 cycles. The other thing that you can do is you can say well I will continue issuing instructions only when I get the next division I will check and then stop. So, that is slightly better in fact much better than this because here essentially what you are trying to exploit is that the division operations are not that big. So, you will have a lot more instructions to put into these four pipelines while the divider operates. That of course makes your life much more complicated because now you have to be very careful while filling these instructions into these pipelines to make sure that they are independent of the running division operation. So, that is one problem about the divider. The second structure has an arises to register write ports. So, here is an example. So, let us take that example and see what happens actually. So, I have a multiplication instruction. So, how does it execute fetch decode a 1 m 2 m 3 m 4 m 5 m 6 m 7 right that is the divider of the multiplication instruction. Then I have two more instructions which I do not care they fetch here. Then I have an add instruction. So, let us see how it executes fetch decode a 1 a 2 a 3 a 4 write back. Then I have couple more instructions and then I have a load instruction which has fetch decode x m and then you have a write back stage there. The problem is with these two instructions they try to access the register file in the same side. They want the write port in the same side. They may be writing to different registers, but they need to separate write ports to be able to write. So, that is another structural hazard that come up because of this different matrices of instructions. So, you have to be careful in schedule with the register write port you have to make sure that there is no clash in the write back stage. Because unless you have more ports you cannot really allow this to happen. So, again so you have a solution you either have more write ports or you have to introduce hardware interlocks which means stock size. So, where do you introduce this interlock this particular register write port interlock. So, what are the options you could detect right here that there is going to be a clash in the register write port when this instruction goes through because I know exactly the latency of this I know exactly the latency of this I can calculate actually the decoder sitting here that whether there is a connection for register write port or not. So, usually this is implemented in the shift register write port cellular. So, essentially what you do is you maintain a shift register. So, when you when this multiplication operation issues after the decoder you calculate how many cycles hence it will actually access a register file. So, these are essentially bits in my shift register they correspond to the cycles. So, 1 2 3 4 5 6 7 8. So, in this cycle M 1 will execute this cycle M 2 M 3 M 4 M 5 M 6 M 7 write back. So, I will mark a 1 here in the right to the register in the shift register. So, when this instruction goes. So, every cycle I decode a new instruction I shift this register on this side by 1 position. So, by the time I decode this instruction this bit has moved where. So, this instruction this bit moves here this instruction this bit moves here. So, when I decode this instruction. So, this bit is sitting here at this point. So, now I calculate from for this one this is a 1 a 2 a 3 a 4 write back. I move this bit 1 1 position when I decode this and I find out there is a clash in the same slot this instruction is going to access the register file and this instruction is also going to have. So, whenever there is a clash I introduce an interval. So, let us delay by 1 cycle choose of this interval. So, this bit will move on by 1 cycle and then when this guy goes it will actually follow it. So, there will be 2 ones. Subsequent instructions will check both the bits essentially what I am doing is I whenever I involve for a new instruction I prepare a mask with exactly 1 bit on I end it with this particular register. If the output of end is non-zero I know that there is a clash and I have to stall the current instruction. How many cycles I have to stall? So, I have to check the stalling by each cycle to see if the thing resolves until I get a 0 outcome on the end operation. So, that is one option I can detect in the decoder. The second option is I will let it go and detect it in the memory stage or the write back stage. So, essentially memory write back in this case is the last stage. So, essentially I will say well I will delay until I reach there and then I will detect which is fine also. The only problem is you have to introduce stalls you may have to introduce stalls to multiple different pipelines. Here I am showing only 2 instructions that there are a clash there will be more. For example, if I did not have this instruction here and if I had the load here this would also clash like for example, if this instruction was a loaded this would actually have a write back here. So, you may have to stall multiple pipelines if you delay until write back stall which one. So, I have these 2 instructions may be more which one should I stall everybody except the first one. So, the oldest instruction will be allowed to proceed others should be stopped. Is this step delivery? These 2 structural hazards one involving register write put and another involving write it. So, any non pipeline unit may have a hazard and your write course may have a hazard depending on how you resolve the schedule. You will have new raw hazards. So, here is one more example let us work to this. So, by looking at this you can see that let us see what are the dependencies. This add instruction consumes the result of the multiplication and it also consumes the result of the node. So, both of the sources depend on 2 previous instructions. So, this is my load fetch decode execute the traditional pipeline for load. Mult fetch decode does it have any dependency? No. So, this is a 1, 2, 3, 4, 5, 6, 7 write back add fetch decode 19 F 0 which is produced only here right. You have a long stall very long I cannot start a 1 until this cycle where you can actually have a bypass and 7 to a 1. However, the load instruction will be complete by then. So, there is no problem as such. So, you can have a you can have the value bypassed you know it is on in this particular cycle and then I have a store fetch this is stalled cannot do anything you can only decode here and you will continue like this. So, you have now more hazards that you take care of. Now, there is a new type of hazard which is right after right. So, you have an add instruction which produces dollar F 2 you have some instruction and then you have a load instruction which also produces dollar F 2. So, let us try to see what happens in the pipeline. I think I was keeping a mem stage all together here that cannot be stopped. Then I have the load fetch decode X web stage that should escape. So, there is a problem. So, you can see that there is a problem right. In this case both the instructions are right into the same register in the same cycle right. So, if you have already implemented the right port schedule the interlock we will have resolved this right. So, this instruction will essentially get delayed by one cycle or in the decoder and then we will move and write to the register file with no problem and I move the load instruction a cycle ahead. So, there is no right port schedule in flash right, but there is a problem what is that. Exactly. So, what is the final value of dollar F 2 that will survive. That is instruction right which is not correct. The dollar F 2 should have the value of the load instruction at the end of the sequence of instructions. So, this is called a right after right hazard. You have the same destination and the earlier instruction is writing later than the next instruction. So, of course this is a problem, but before that let us try to ask the question that is this realistic at all. Can this really happen that there are back to back to instructions they write to the same register. What is the meaning of this if you think about a program I do an addition operation write to very blanks right. In the very next instruction I say x equal to some value right. So, this can be equivalent to a C code which says x equal to let us say y plus z and then I say x equal to a 0. Why did I do this kind of operation actually I have never used that result right. So, the question really is that can this really happen and here is a code which is straight from you know Mips compiler. So, of course there could be many other situations where this can happen. For example, if you instead of doing an addition if you did a multiplication here you would have a much longer pipeline and even if you have put a few instructions in between before the load you will have the same problem. So, here what happens is that the compiler building the branch delay slot with an instruction with target F 0 all right. So, this instruction will always execute and then the branch jumps to the label where you actually load a new value into this case. This is the part where we legitimate code there is no problem with this and this will lead to the same problem involved result all right. So, the question is that how do we really handle this. So, again we will take the same procedure that is we will introduce interlock delay issue of the later instruction or prevent the earlier one from writing there are two things we can do one is that we can figure out that this load instructions right is not going to be of any use. So, I am sorry this add instructions right is not going to be of any use. So, we can nullify that right back completely and the second option is that we can delay the delay this particular instruction by how many cycles is needed to make sure that the hazard goes away and again we can do this in the decor stage. So, again in the help of the register circuit we need to save it may be slightly different one we should essentially look for not only just clash in the right port, but we also check that with the register destinations are also important and in that case it will introduce delay. So, what is the hardware you can do this with the shift register. So, for hazard detection we need to look for integer as well as floating point address that is the essentially that is what we have just seen here just focusing on the integer side is not enough. Internal floating point instructions use separate register files right. So, there are two separate register files for these two. So, you might wonder you know could there be a bypass coming from an integer instruction going to a floating point instruction and that will be possible and that is yes even though the opponent on two different register files. So, we will look at examples very soon floating point load store uses integer registers as base. So, this is one example where a floating point instruction may require a value to be bypassed for integer instruction like for example, here if dollar 2 was getting computed in the integer pipeline you would have a byte you might require a bypass from that instruction to this floating point load even though this load is actually a floating point load. So, there could be hazard between integer and floating point instructions also there are move instructions MTC 1 and MFC 1 we have discussed this earlier we are discussing the MIPS ISA that move to or from floating point register file to integer register file. So, why needed we have also discussed this does anybody remember why do I need to move a value from floating point register file to integer register file or vice versa that connection exactly. So, if you want to cast a floating point value to an integer value you would make a move from floating point register file to integer register file and the other way required to move in the other direction. So, in these last two cases we need to detect hazards between integer and floating point instruction as well because MTC 1 that is move to floating point file essentially we will have a floating point register destination and an integer register source and that source may get produced by an integer instruction. So, there will be hazard coming from integer instruction to this one similarly MFC 1 will have a target integer register and a source floating point register. So, that may have a that may actually start a bypass to an integer destination because the destination is integer file integer register here otherwise hazards can happen between integer instructions only or floating point instructions only simplification made possible due to separate register files. So, any problems of having separate register file. So, this is a good thing that simplifies your bypass network right. So, if you did not have you would have actually a lot more cases of bypassing if you would share the register file ok. So, that is one reason why you would not unify register file and other problem is that you probably require a bigger register file which may actually slow down your problem that makes sense to divide term simplification.