 We want to pipeline this function where these two computations depend on a value of r which is computed from x. So, yeah. So, that was this was a particular computation we want to pipeline. So, as you can see for a particular value of x you still need the previous value of r and then you update r. So, that is what is being done here and a combinational implementation would essentially take the value of r from the latch whatever has been latched by the previous one. And new value of r will go into the latch at the same time as you take the current output. So, the way to pipeline it was that we said well if we put a pipeline latch here meaning that this is stage one, this is stage two this one does not work. Because this particular value essentially is not really available on time because as you can see here the forward path here goes to the latch. So, particular value of x comes in I compute this. So, on that clock it is staying it is here not yet there actually. So, the next value of x cannot take it from here, but should take it from here it is already there right. So, that is called a bypass which one if there are conditions we do not have. Which one this one? The top one because before that this one this one. So, up to this it is ok right we talking about this one. So, let us see what the class is. So, he says that can we bypass r in this case how we have to think it is we cannot know why exactly. So, it is not even computed here there is a point in bypass. So, bypassing comes into picture only if the value has been computed right in this case it is computed after this point in the stage. So, this has a. So, essentially in this case what is going to happen is that you need a value of r in a particular cycle when it is being computed in the same cycle it is being computed and you also need that that is it possible right. So, if you draw the pi sign diagram that we did last time right in this case we have 4 stages. So, time goes in this direction. So, this is let us say x 1 x 1 is doing this for x 2. So, in this particular cycle s 4 is computing r for x 1 and that r is needed exactly in the same cycle which is not possible. So, these hazards are not easy to resolve where as we said last time I think we formalize this particular notion also. So, source to your data comes up to the distribution stage that is a bit that is not easy to resolve and the only way to resolve it can somebody suggest. So, going back to the example what can I do here resolve this for you. Well this is my this is my pipeline I cannot change the pipeline. So, r is computed in s 4 but, needed in s 3 by the next combination same suggestion how I do resolve this what would you do in this case you can do what you what kind of queue. So, remember this thing that at time x at time 20 r is being computed and the same time I need it. So, what queue are you going to use. So, it is not there you know it is not there it right. So, it will be available in future. So, what is the buffer we will do here delay the cycle which operation yes of course, yes I can delay s 3 here. Yes I can I can I can delay it by a cycle then essentially what you are saying is for every computation of x I will lose one cycle. So, what do normally do to look with the future that is what we are trying to do here right. What do you do schedule some other instruction. I have nothing else here it just is a very specialized computer which computes this particular function. Yes what do you do to look in a future. Well that is if you want to what you say delay. So, you can do directly restores. What are you saying whatever is computation is there before you do the output we can directly restores. Well yes. So, we will, but the earliest we can get is here it is impossible to write here that the value that we do. The value will be computed only at the end of this particular cycle. Remember that these are clock surveys you cannot start something at an arbitrary point. How do you how do you how do you say what will be tomorrow's simulation. You predict can I predict the value of r. So, that is possible if the values that I am computing are not completely random. If there is some pattern if there is a high enough if there is if there is presence of pattern I should be able to predict what value are you getting right. So, I can predict the value of r here and use that, but I know whether I did correct or not only one cycle later. So, in case of a wrong prediction I have to undo what I did. So, this thing is going to work only if your prediction activation is more than 50 percent which means in more than half of the cases you are going to win otherwise you are essentially you will be hurting. But, often the case is that it is a it is a concept here to everybody this whole prediction thing. But, often the case is that the which prediction means that when you do the wrong when you detect that you have done something wrong undoing and redoing the right thing may be so expensive that you might require a much higher prediction accuracy it will outweigh the loss. So, for example, you may require 90 percent prediction accuracy to outweigh the loss of 10 percent misprediction. So, we will look at this particular problem in a different context, but I just wanted to bring this up here. Is the concept clear to everybody? There will be some piece of hardware sitting here which will be monitoring the pattern of the values of r and try to say well in the past 10 instance I have seen a 0 value from r. So, it is a pattern minus that will be sitting here and giving me the next value of r which we verified here. So, going back to where we started. So, we were looking at this particular computation. So, we wanted a bypass path to correct it. And we said that if a source stage of data comes before or equal to the destination stage even if the data is written to a storage later stage we can resolve such a hazard by a bypass. So, in this case source stage is this one where I produce r and the destination stage is also this one. So, I with a bypass I can resolve it. And if I want to change the computation further suppose I want x times r cube. So, I have to keep on bypassing until it ends up in a storage level. So, if you look at a particular stage s let us say this one. So, it is demarcated by two latches or two sides from the remaining pipeline. So, latch in front of it that is this one contains r produced by the current computation current value of x will produce an r which will get latched here. And the latch after it contains r produced by the previous computation this one contains the r that is completed by the previous one. And this is what I want that is why I am taking the bypass from this latch as opposed to this latch. Is it clear to everybody this bypass concept sorry this one this one currently contains. So, this one will contain the final value of r. But of course, it will take several cycles to propagate through this latches. Any question about this is the bypassing clear from which latch is bypass whether the front latch or the back latch So, bypassing always works when you take the value from the latch behind otherwise it is not a bypass. You are just consuming the value that you have produced nothing else. So, as I said last time usually the computer design cycle today is about 5, 6 years at the time it takes to design a microprocessor. For example, the microprocessor that will be available for you you will have crops in this crop in 2016 the design started in 2010. So, it goes through several phases and usually the research starts with simulation because you cannot right away see come out you know the round table will decide the design and send it to find that is not possible. That is costly because and if that has bugs you lose a lot of money. So, you have to simulate your design to make sure that it is correct that is the most important criterion it gives you good performance the second criterion and it is robust does not fail. So, simulation is very important. So, and when you simulate these computers you have to simulate a pipeline. So, let us so essentially by simulating I mean that you write a piece of software which mimics the behavior of this particular hardware. So, then without building the hardware you can feed the software with your input x you produce output fx and you can measure the performance how many cycles you consume and so on. So, let us take a look at this simple example the one that we started with yesterday it is the cubing service. So, it is a two stage pipe let us call them m 1 and m 2 I have a latch in between. So, you can see that I can easily write down a piece of software describing this particular block it is a multiplier. The question arises about this particular latch here this latch here and this latch how do you how to manage there. You might wonder what subject this should be right. So, let us think about it the first question that arises is in a sequential simulator in which order would you simulate these two these two pipe stages to capture the correct behavior in each cycle. Now, two stages m 1 and m 2 should I simulate m 1 first and then m 2 in a cycle because remember in a particular cycle both the stages must be working it is a pipeline m 1 working on the certain input x m 2 should be working on the previous input x in the same cycle. So, I am simulating a particular cycle I have decided in which order I should simulate it what do you think of course, I have two possibilities first m 1 then m 2 first m 2 then m 2 which one is correct because in the first cycle we will have m 2 for m 2 1. So, in a sequential simulator remember that your simulator is a sequential sub software you first simulate m 1 and you latch the value produced by m 1 here and then you invoke m 2 m 2 will see the value and immediately come to m 2 that is not a correct behavior m 2 should be computing that in the next cycle. So, it has to be the other way I do not have any other option. . So, what if I compute M 2 first and then M 1. So, what, so he has suggested wrong solution that do not update the lattice. So, which essentially means I will have some internal storage right to store the lattice computed by M 1. So, I will be avoid it suppose I do M 2 first and then M 1 with that similar to that is right it does right first cycle M 2 goes there is nothing to do it says there is I do M 1 will compute something M cycle M 2 will pick up what M 1 is computed and M 1 will pick up the new A right is that ok. So, thought rule 1 keep this in mind in a in a pipeline simulation it should always go from back to front similarly in the pipe stages all right and this is the reason why you have to do that ok. Is it clear to everybody why you cannot do M 1 before an M 2 then. Yeah, so it is a sequential piece of software right. So, if you if you first to M 1 M 1 will update the latches and then when you invoke M 2 M 2 will immediately see the updated latches right it will compute on that. So, essentially in effect what you have done is that in the same cycle you have invoked M 1 and M 2 on the same input may be I will show you. So, suppose you have some you will have some main input right while one take input x invoke M 1 x. So, that input into into a latch M 1 let us say and effects equal to M 2 M 1 right that is for a simulation ok. So, I take x. So, essentially ok sorry cycle plus plus this is one cycle of a simulator ok. So, I taken you taken you input x I compute put it in L 1 M 2 picks that up computes M x by computation finishes in a cycle. So, I take an input x produce y in one cycle that is wrong right it should have taken two cycles to produce y ok. But, if I switch the orders see what happens first cycle M 2 gets invoked does nothing there is nothing in the in the latch ok M 1 takes up x puts it in L 1 cycle implements by one next cycle M 2 picks up whatever it has been put by M 1 computes at x right there are two cycles you see the first output coming out and then every cycle there will be one output coming out from this. Yes it will yes you get some garbage value yes which is why this this particular sampling latch is important you know first time you you you the first clock when when it strikes you can probably get a get something garbage. So, what is the problem if M 1 is picking the actual values from the latch M 2 is picking the actual values from the latch. Well, then you take an input x you mean what if I do M 1 M 1 and then M 2. Yeah. So, in that case you will be picking up x and producing fx in one cycle that is the problem because this is L 1 here is just a variable nothing else right ok. So, if you switch the order right it does not it does not work right. So, if you put fx here equal to M 2 L 1 you update L 1 this function takes up L 1 produces fx in the same cycle that is not correct that is a wrong behavior right. My pipeline is supposed to produce one output two cycles later two cycles after it takes up the input. So, this one actually may mix your combinational cycle. So, we will know first cycle output. Oh no you will miss the first output you will miss the first output actually it is going to get overwritten actually right ok all right. So, I mentioned here sequential simulator just to make you comfortable about thinking when you go to a parallel simulator the problem does not change. In a parallel simulator what I might be doing is that I will have a threat for simulating M 1 another threat for simulating M 2 and I am actually pushing that problem to the scheduler now. The scheduler has to decide which one should I simulate to the M 1 before M 2 or M 2 before M 1. Same problem exactly same problem ok. So, that was about our cubic circuit. So, now let us take up a more general problem that is how to stimulate a general pipeline propagation. So, I have two stages here again a and b and suppose the latency of b depends on the value of a x. So, whatever value that gets latched here will determine how long it is going to take for b to compute the value all right. And the maximum latency of b is much higher than the fixed latency of a. So, which means the computation of b is highly non-terminally sometimes it takes a very small amount of time sometimes it takes a very large amount of time that depends on the value of x. And the maximum latency of b is much higher than a ok. So, that means, if you clock the pipe at 1 over max a comma b that offers the worst case performance all the time that is correct there is no problem with that ok. So, that is what we learnt right that if you have two stages you should take the max should take the max, but in this case it is going to give you very poor performance. And actually most of the cases we might be able to do better than this because we are essentially designing the circuit for the worst case. So, that is why the question is what if I clock it at 1 over b. So, by the way small a and small b are latency of a and b all right. What if I clock it at 1 over b. So, of course, in some of the cases what may happen is that b might be doing something i has gone over right. So, what is the solution? So, here it says that replace the latches by q's ok. So, I put a q here also I replace these by q these by q and I clock it at 1 over a. So, what is now going to happen? In some cases the q is going to drain may be at a faster rate than this gets filled out or may be at the same rate. In some cases it is going to get drained at a slower pace. So, this q will grow shrink depending on how fast b is computed right. But of course, whenever you know there will be some cases where the q is now full because b is now running slow. So, a cannot inject any more a ok. So, that will automatically push put pressure on this q and eventually the environment will back up. But eventually b will finish computing the q will start draining pipe will start moving. So, these are very typical scenario in a processor where you will find that different pipe stages have different latencies. And of course, you will not try to clock it at 1 over max of the all the pipe stages because that will be the worst case. You will still be very optimistic and clocking at a faster rate, but put q's in between to to know take care of this particular slack difference. Is it clear to everybody why I want to replace liches by q's ok. Yes. So, you can think of it as q's or liches exactly. But in this case some subsequent stages will be empty only first up to q will be keep moving. Possibly yes. Empty why should it be empty b should be computing something which is why the q is full. After that. After that yeah may be empty yes that is possible. But what I am saying is that given that max of a comma b will be a rare case hopefully. In most cases you will be fine they will be delivering performance much better than this one. So, now the question let us now again go back to the same question how do you simulate this one in your pipeline. Are there any new problems or am I actually done should I just go and simulate b first and then a. So, I can the question is same right I can write down a piece of software that tells me what a does and I can write down a piece of software which tells me what b does. The question is about the interface how do you simulate the interface right. It is right yes. So, the fundamental question I am interested in is in which order should I simulate b first a before b or b before. I have q in between and as you can as you know the q offers 4 functions l q d q full empty a before b or b then a I tell you that both have problems. So, let us suppose that I simulate a before b. So, since I have already told you that there is a problem can you think of a situation where I might not be able to mimic the behavior of the pipeline. In small a and small b are equal in the same problem as this pipeline also. No, I am not asking about a specific scenario I am saying this has in general a problem. So, in the same cycle you can have the we can process the same data. So, that problem remains right that I enqueue something and we fix up that q token immediately in the same cycle that is not that is not a problem. What is the problem with b before a somebody see a problem. So, it is like whatever time we takes with respect to a you are simulating as both of them are taking the same problem. There is no meaning of q now like if the design was wrong saying the q was smaller one as it should be. Now, I know the point is that the stimulation of b will still be a multi cycle simulation I will still be mimicking that behavior which means the q may actually grow up gradually and next in a while b is walking I will have to simulate that of course. But you do not know what is the number of cycles taken by the. I know that yeah these are parameters to know the simulation I know small a and small b. Then probably you can run the me for that much cycles only. But remember that the latency depends on the value of x. So, when I input comes in I will there I will know the latency of this computation not before that. So, when I get a next I will invoke b for those many cycles those many cycles are charged within those cycles the q may grow up. So, significant limit and it may come full actually. But anyway I do not know if these details are important here what I am asking is scheduling b before a has a problem. So, if q is the amplitude and b will keep on computing it will go on blank computing b that q is empty. That means I have no input coming in. Yes. Yeah that is ok. Yeah. If it is overflow then. What is overflow? q. Which q? This q. When this q fills up a will stop accepting inputs. Yeah. So, now of course when you invoke a there is a condition that this q has at least one slot. Then only with a can complete. So, scheduling b before a what is that problem? Why not b might be busy, but yeah, but a is free to go. So, I am structuring a similar to this do that do not worry about that how exactly structure. But I am saying what you know on every cycle I have to make this decision should I tell you b or a. So, what happens? So, when does b get involved? When does b does something? When there is something on that q. Otherwise b has nothing to do actually. It just goes to see. So, there is something here is not empty and this is not full. That is also an option. That is also a condition you would check. But if I am thinking about the environment, b interfaces with this side of the environment. Can you avoid m of. Why is it not? No, the clock frequency is same right. Same problem reading all the three letters. So, suppose imagine a situation that this q is full, alright. In a particular cycle this q is full. So, how should the pipeline behave? a should not do anything in that cycle, right. We should pick up an item, process it and put it in the. Assume that this q has at least one slot. That should be the correct behavior. So, let us see now what our software simulator does. It invokes b, picks up this item, puts it here. Now, invokes a, a, c, o, there is one slot here. It picks up an x, processes that, that is wrong. That is the wrong behavior. Is it clear to everybody? Well, a should not have been invoked in this cycle. The change of this q state should not be visible in this cycle today. But it became visible. How do you fix it? After a is completed. Exactly. So, in such a simulator, this is again very, very typical in the process simulator. Every cycle's operations will have to be separated into two parts. The state updates will happen at the end of the cycle, alright. So, what you should do is, one, one. So, you take the, sorry, let me just hook up a new function. It says q dot head, right. So, if you return with a head of the q, but not dq it actually, alright. So, b will operate on that. And a will operate on, so let us call it ql1, operate on environment's head. And then you can do the dq, q dot ql1 dot dq, ql1 dot nq and all other things. And then cycle plus plus. So, essentially the computation of one cycle in two parts. One is computation, other one is state update. Then even though we got the head of the q, a will still think that the q is full. So, it will actually not wake up in that cycle.