 I will talk about should hopefully be recap of some portion of 2 20. It is basically about pipeline, it is basics of pipeline. And keep in mind that essentially we are trying to implement the instruction set that we discussed and we will be focusing mostly on bits. So, we start with a very simple operation or rather a function which is a cubing function. So, we have to compute x cube given x. So, try to see how we can implement such a function. So, we can write f x as m of n, where m of x y is x times 1. That is your cubing. So, this decomposition of f allows us to implement f with 2 c del e connected multipliers. So, this is how it looks like. So, first we apply m and the output of it we apply m a. So, this is clear and suppose the multiplier latency is m nanosecond. Then the question arises how frequently can I input can a new input be sent to this hardware, how frequently should the output be sampled. So, keep in mind that this is a purely combinational circuit. There is no clock nothing, you give x after sometime comes out f x. However, it is very important to realize that in real world there is nothing that is purely combinational. Because a hardware that is useful will be used repeatedly. It will be reused and that essentially means that I give an input I get an output f x. I will probably give it one more input get one more output f x give one more input give one more output f x and immediately when you start doing that these two questions become very relevant. But how fast can I reuse this machine reuse this particular piece of hardware. I give x can I give the next x in one microsecond time. So, that is a very useful question and. So, in this case what is the answer to this 2 n right. So, I give x output comes out after 2 m nanosecond and without any difficulty I should be able to give one more x once the previous x has already been computed. So, in this case I should be providing inputs at a rate slower than or equal to 2 m nanosecond. It is 1 for 2 m nanosecond and I should be finding my output also at the same rate. So, what happens if I if I provide input at a faster rate. So, suppose I give x and after less than 2 m nanosecond I provide the second x what will happen what do you think what is the incorrect result. So, what will be the result will it be something wrong. So, there is no register here. So, this is a multiplier it takes x gives me x square and it takes x square and x gives me s cube. So, I give x and within less than 2 m nanosecond I give the second x what will happen in the computation what do you expect to see here. What do you see here is it there is nothing already right. So, what what value am I going to get here now I am saying I give x and before I reach 2 m nanosecond before that I give the second x what will happen here what value am I going to get. So, it depends on when I sample here right and how exactly how fast what the paths are through this hardware. How fast this this particular line is going to be affected by this new report. So, to be safe we have to obey this particular law that for this particular circuit I should not be providing inputs at a rate faster than 1 over 2 m nanosecond and I should be sampling my output at the same rate. So, that so the take a point here I am trying to say is that there is nothing that is purely combinational in real world there is nothing like that there has to be a sampling rate somewhere where you provide inputs you will sample outputs. So, this is what it looks like these are some funny structures, but very important what are they called flip flops that is one way to realize this these are usually called latches often they will be called pipeline resistance. So, I will be using all these synonyms in this course. So, the function of this is that very interesting for this signal is present it will take the value here and transfer the value to the output after some delay. So, that is what it does. So, and in this particular course we will be using one particular type of these that is we will assume that the signal given here in the square wave square wave signal and this one is going to transfer the input to the output only either on a rising edge or on a falling edge that is called an edge triggered latch whenever the clock. So, this is called a clock whenever the clock it strikes it will take this input value here after a small delay that will appear here and we will also distinguish between positive edge triggered latches and negative edge triggered latches. So, the ones that operate on the positive edge. So, these are the positive edges these are the negative edges. So, this is my cubing combinational cubing circuit I will continue to call it combinational even though I have two latches separating these combinational circuit from the environment. So, environment is providing me x I am giving of f x to the environment and I am doing that in a combinational fashion frequencies 0.5 over m v guards that is 1 over 2 m throughput is 1 output every 2 m nanosecond any questions so far. So, one obvious observation that we can make here is exactly one multiplier is active at any point in time I give an input this one computes the square then this goes idle this one computes the cube and then when this one computes the square this means idle. So, why not use just one multiplier obvious question. So, if we want to do that we will require some sequencing logic and offer slightly more throughput at almost half area cost. So, let us see what that means. So, conceptually what we really want to do is we will first use this multiplier to compute the square and we have to remember the output of that and feed it back to this that is what we roughly want to do. So, first I am showing it here in a conceptual fashion that is by unfolding the whole thing. So, we still have two multipliers, but I have introduced this now in the middle. So, that is a pair of latches latching x and latching x square and this one has to be operated faster because this one has to be operated at 1 over m v guards because this one gets ready in m nanosecond. So, I must take a clock and latch it so that in the next m nanosecond this guy can work. So, the frequency seen by the environment is still 0.5 over m nerds. So, output is one output every 12 nanosecond. In reality throughput will be actually slightly worse. Why is that? Well, we have not gone there. I am just talking about this one now. We have a latching between, but I still have two multipliers. So, in the next slide I am going to fold them actually, but about this I am just talking about this one now. Why is it worse? Exactly. So, there will be some delay going through these propagation delay of these. So, what will happen is that at every m nanosecond this clock is going to strike. So, on the rising edge I will take whatever value is here, but after only some delay this value will be at here here. So, that delay will take you to get out when that will be possible. You can see all this stuff. Is there a theory of meaning? Have you introduced hardware? It is not coming for free. You are losing time there. So, here still we waste half the work unless each multiplier is gated in alternate internal cycles. So, that is one option. I can say that well at every internal every alternate internal cycle I will disable this multiplier if there is some way of doing that. So, then essentially that will go to sleep. I will not waste any energy or power which is that multiplier. So, of course, the obvious solution is to fold them together. So, let us see what that looks like. So, the circuit gets slightly more complicated. So, let us see why that is so. So, essentially I have brought this multiplier and put it on that. So, the latch remains unchanged here. The only problem now is you have to resolve for the second input is going to be to this multiplier. One input was x of course. So, if you look at the second multiplier it was still taking x as one input for the other input was x square. And if you look at this multiplier both the inputs are x. So, when you bring this here you have to now resolve the input So, let us see how to do that. We put a multiplexer here. So, one input is x all the time. The other input switches between x square and x and this is a toggle latch which will keep on toggling at every side 0 1 0 1. So, whenever this goes 0 it will pick up x whenever this goes 1 it will pick up x 1. So, that is the sequence in large in a toggle latch. So, otherwise everything else remains unchanged. The only thing that we have done here is that we have added some more latency in the circuit. So, it is going to be even worse than a previous one. You have this latch now and you have this multiplier now in a critical latch. This toggle latch is not exactly in a critical latch, but this multiplier will be seen in a critical part of the combination. Unless because the second input has to go to the multiplier before getting there multiplexer. So, this is a multi-cycle design that is what and the reason is obvious that I have a faster clock 1 over in the universe, but I take multiple cycle of that clock to compute this output. However, if you look at the environment, the environment still sees the same sampling rate. It does not improve because of course, it should not improve. We have not done anything smart here. We have just folded one multiplier of the other. We are just using that multiplier twice. So, the environment still sees the same sampling rate. The overall throughput will be worse than 2 m because now the critical path involves this multiplexer and this latch. So, multi-cycle design makes sense only if there is a chance of reusing hardware. So, that we can save it. Otherwise, there is a point doing a multi-cycle design. Here, we could actually use one multiplier. That is why we did a multi-cycle design. So, multi-cycle design will actually go through the multiple times before providing the object. Otherwise, we will lose in throughput. As you can see here, we will lose in throughput and increase area compared to a combinational design. So, now, if you go back to the one that we started with this one, this one still gives you the best throughput actually out of all these three options that we have evaluated till now. This will be very close to 2 m. And what I have done gradually is that I have reduced area, but I have increased my throughput. I have increased my latency to result. So, that is exactly the point I am trying to make here that you should be doing such an optimization only if there is a chance of winning in some department. In this case, we will lose in terms of area of the circuit, but we are losing in throughput compared to the combinational design. So, as a different example, suppose fx is this, g of x times x. And let's suppose latency of g is g. So, before we move on to look at this particular function, any question of this? Is it clear? So, you are saying that this toggle latch and the multiplexer is a control. So, then the entire circuit is part of functional unit or there is a control. Well, then you can make a logical division, but that is an abstraction. In reality, things will be all together sitting side by side depending on. So, the one primary optimization goal will be to minimize the wear length, because we talked about in the first lecture that wear is so slow. Communication is slow. So, where exactly this will sit, this is still very high level. In the float plan, this may go somewhere else. Any other question on this? Is multi cycle design of the Turing function? Is it clear to you? So, let's take this one now. This is slightly different. The difference comes in the fact that it doesn't really use a multiplier second time. It invokes a different function g second time after using the multiplier. So, this is my combinational design. My environment has to be simple at the rate of 1 over m plus g. It should not be faster than this separate. And throughput is going to be 1 output every m plus g nanosecond. So, now if I ask you, should I go for a multi cycle design with this? What do you think? This is what it looks like. I have an internal clock. I have an external clock and internally I essentially latch this. Of course, I cannot fold it onto this because these two are different. So, this is all I can get. So, now the question is what do I gain by doing this? Clearly I don't gain anything in terms of area. I actually lose in terms of area because I have an external latch here. Do I gain in terms of throughput? So, how do you resolve these two question marks here? What are these three questions? Start with the internal clock. How fast should I clock this one? What are the options we have? Small f, small g, m plus g, m plus g. Any other option? So, he says m plus g. Both places m plus g? Yes. Are you sure? But with some shift, the second clock will be with some initial shift. Initial shift? Yes, it will be there. Sorry, what? For the first loop, my starting point will be slightly shifted from the first loop. Why is that needed? Because the second clock m plus g cycle is different from the first loop m plus g. That is ok. So, you are worried about the fact that the first output may be guarded. But, forget about that. A plus g? Yes. Is that good to think? Why is it m plus g this one? What is the reason? So, if you go back to the previous one, internal clock was 1 over m here. It was not 1 over 2 m. And this is correct, right? Is everybody convinced or are you just believing what I was saying? You can argue that that this is correct. It is going to produce connected out. So, yes, I could do 1 over 2 m. That would still be correct. But, then it will slow down your economy. So, what should I do here? What is the correct thing to do? What is the fastest clock that I can feed here? What about m? Why? Because the output of the m part will be available after 1 second. Is that enough? Whichever is the slower one? Yes. So, this is going to be 1 over max of n comma g. Whichever is the slower one. We have to make sure that when I sample the output of this one here, this is better to be free. We should not be working on something. Otherwise, I will mess up that whatever computation is going to be. Is this better everybody? The slower one. What about this one? What is the environment frequency? m plus g. Are you sure? I sample it at 1 over max m comma g. Am I going to get an output by m plus g? How long it is going to take? Next class m. See, once you realize that these two clocks are actually tied together. They depend on each other. You cannot decide them independently. You can cancel immediately. When you say this is 1 over max m comma g, how fast can you swap the input of max? Louder. Max plus g. Does anybody see problem with max plus g? 1 over twice of max. What is the problem with this solution? Max plus g. Exactly. There is a chance of overwriting the computation actually. This one actually has to be 1 over twice max m plus g, m comma g. This is what it looks like. What have we gained by doing this? There is a lot of wasted work and energy. Because half the time, only one unit is working. No improvement at all over combinational design. In fact, it is worse in all departments. Look at the throughput. It is going to be worse. Twice max of m plus g is going to be bigger than m plus g. Bigger than equal to m plus g. Area, we are worse. We have added latches. And energy, they are probably going to be worse because of the energy consumed by the latch. Yes, but I cannot get the latch. The latch has to be working. So, this is a typical scenario where you should not be going to a multi-cycle design. Whatever may be the case. Now, moving on to pipeline. That is the next enhancement. Multi-cycle design gave you age in terms of median, but you lost in terms of throughput. Now, if you pipeline it, of course, we will actually put the multiplier back. We need two multipliers. Because this is going to be one stage of the pipeline. This is going to be the second stage of the pipeline. Now, we actually see the environmental improvement. That is the final goal. I want the environment to see the improvement. This is the only case when you actually start seeing the environment improving. So, now I can actually sample the input and output at a much faster rate. That is 1 over n. Because what I can do is that I take the input, feed this multiplier, while this multiplier is working on the previous input. And throughput is one output every n nanosecond. In reality, again, the throughput will be slightly worse because of this latch delay. So, the fundamental requirement of pipelining is that you should be able to decompose a function to be computed into a series of functions. So, in other words, you should be able to express fx as such a series sequence of functions. So, in one stage, I will do fkx. In the next stage, I will do fk of k minus 1 of that. Ideally, you want to expect a k times faster clock and k times higher clock. Is it clear? The basic of pipelining. That is how you should start. You can pipeline any computation. But the computation should be able to be broken down into such a form. Question? Yes. Sir, the previous slide is going to be discussed in the M and G group. So, combination circuit output remains the… Suppose that G is greater than n. And so, we are talking at 1 upon G. So, the M, the multiplier latency is M cyclopsychological. So, the output will be… The combination of circuit for M will keep on outputting the same… Yes, it will be. Yes, it will keep on computing the same thing. So, as long as you sample at 1 over 2G, you are still going to collect this. Anything else you do will make it incorrect. Pipelining. So, going back to the other one. How do you pipeline this one? So, here this is how you pipeline this computation. And your environment will also see the improvement immediately. It will be 1 over max in combination circuit. So, that is a definite improvement over combination circuit, which you are seeing a throughput of M plus… Mass M comma G is going to be less than equal to M plus G. So, frequency is 1 over max M comma G here. So, throughput is 1 output every max M comma G nanosecond. And again, in reality throughput will be slightly worse, but still better than the combinational implementation. That is the whole point. So, pipelining, if done well, should always bring benefit over combinational implementation in terms of throughput. We are not commenting anything on area because in most cases, we will actually end up increasing the area because we have to introduce latches. Now, let us go a little deeper into pipeline. Imagine feeding a series of inputs X to a k-stage pipelining implementation of some computation Fx. So, usually you want to expect the computation on two inputs to be independent of each other. We feed X1, the pipeline X of X1 completes Fx1. And next I give X2, pipeline X2 and so on. So, usually you want to expect that the computation of two inputs would be independent of each other. This is true for some class of functions which are called stateless or memoryless. That is, when you are computing F of X1, whatever states you produce, you forget them. And that does not influence your computation of F of X2. So, these are called stateless or memoryless functions. Consider the following function with a global state R. So, now we will start talking about functions which are state-holding. And let us see how pipeline gets complicated when you bring in such states. So, here is a function definition. Fx returns Y and what does Fx do? If R equal to 0, so there is a global state R. If R is 0, then it computes Y equal to X cube. Else, it computes Y equal to 3X. It either multiplies these three values or adds three values. And finally, updates R equal to Y mod 3. So, now what do you think? I feed X1. What I do for X2 is not really known until I finish computing F of X1. So, how to pipeline this function? So, we will assume that we have two multipliers and two adders. Computation of one input depends on the previous one. That is pretty obvious, right? Because the value of R is going to change. So, these are called pipeline hazards. That is now the two inputs cannot be computed independently. And cannot compute a particular value ignoring the history of the pipeline. That is impossible. So, let us see how to pipeline this one. So, first we will start as usual with a combinational representation of F. So, this one computes X square. This one computes 2X. This one computes X cube. This one computes 3X. Then I take R, feed it to a multiplexer. Select based on R, which one I should do. And then apply this one to a mod 3 block to update R. And this one is my output Y. Clear to everybody? So, what is the latency of this? This latency is going to be 2 of twice max M comma A. So, these two are going to be concurrent. Plus the multiplexer latency. Plus the modular latency. Now, there is a question here. Should I include this one to my latency or not? Because my output is actually ready at this point. This is why the environment gets marked. What do you think? Should the modular class be inside my latency or not? And the latency will decide at what rate my input and output will be sampled actually. It should be. It should be why? It depends on R. Okay. R is delayed. It might be calculated. Exactly right. Well, but remember that I am using R much later. Because my modular function may be so slow, if the next input comes in, finishes computation and at the multiplexer, I haven't yet updated R. That may be possible actually. So, it is always safe to include this in my latency. So, it determines the throughput and IO sampling rate. So, as you can see here, there is a lot of wasted work. Because, you know, I will be either taking this one or taking this one, right? But I am computing both. But I have done it in this way because it is somewhat easy to pipeline. Otherwise, what you could do is, you could get M or A based on the value of R. That is, disable whichever path you want to disable. And you could move the multiplexer to front. Okay. And you select which path to take. All right. But that becomes very, very difficult to pipeline. We will see this protocol exactly this case when piping your pipeline in your processor. Okay. Because here the fundamental problem is that when X comes in, we do not know which way to go. We do not know whether we should take this path or we should take this path until and unless we know the value of R. So, we will stick to this particular design. We will waste a lot of hardware, but this is easier to pipeline. Any question on this? So, let us see. First option, let us suppose we want a two-stage pipeline. So, we put a pipeline latch here. Okay. So, first question that will be answered whenever you try to pipeline something, does it produce a correct result? Forget about problems. That is the first question. What do you think? How do you analyze this for yourself? Okay. I assume that you will think in future. So, I will tell you the answer. Yes, it does. Because this one computes your X square and 2X. That passes on to the next stage. And I will take in the next input here. All right. And there is no problem as such. The only question was about R, but there is no issue with R. Because if you notice, I am clocking R with the same clock. So, this one gets modified only after I read out my output. It is the same clock. So, the R value that we fed in the multiplexer is the value that the latch had when the computer started. There is no problem with this. What about performance? How good is this pipeline? What is the clock rate of this pipeline? How fast can I run this pipeline? Which stage is going to determine the pipeline through a frequency, you think? This one, right? Why is that? And it is found that you have something more extra here. So, this is going to be the slower stage. So, as we have already seen, we should take the max of 2, right? Okay. So, that means, so these are often called unbalanced pipelines. We have two stages which are not balanced. And that means, there is a chance of improving. Because I should be able, if I can balance them, I must be able to gain clock frequency, right? Because the slower stage is going to be my clock frequency. So, I make it a three stage pipeline. So, again the same question you have to ask, is it correct? Is it still correct? Because as I just mentioned, the value of r that particular value of x is going to be used, is going to be correct. So, because r is modified only here and the value read out, will be before that. So, but you have to be a little careful here. That is, you are reading from and writing to the same register in the same pipe stage. That was true even for the previous one. So, we were writing to this particular register and reading from this register in the same pipe stage. So, if you want to see what happens exactly. So, let us call this s 1 and s 2, two stages. So, time goes in this direction. So, for a particular value of x, let us say s 1. X 1 observes s 1 and then s 2. X 2 starts here. So, when x 1 is doing s 2, x 2 will be s 1. And then it does s 2 here. So, that is the problem because the value of r that s 1 is going to use here has already been modified. Does everybody see that? So, because what really is going to happen is that, if you, this is where x 1 gets sampled. This is where this last will be sampled. This is where the r value will be updated. And this is exactly where x 2 will start executing. What about performance? Did we improve upon the previous pipeline? What is the expected throughput of the clock frequency? Who determines the clock frequency of this pipeline? Max of all three. So, this is definitely going to be better than the previous one. Because the longest one in the previous case has been broken down to 2 now. And if all the algorithms are same, you can expect that either this or this will be the intermediate stage. Because this is going to be a very fast stage. Doing a mod 3 operation and a multidimensional. Suppose I had a mod 3 implemented as a divide. So, you say that well, let me isolate that. Is this correct? Are we good yet? There is some problem with this one. There is a comment here. Say the back edge crosses a pipeline latch. Why is it important? What is the problem? He says that there is some problem with the r value. What is the problem? Can you explain it more? Might be in the case mod 3 is faster. There is a might be case here. I tell you that it is wrong under all the details. So, if you try to do this again for this pipeline. We have S 1, S 2, S 3, S 4, S 1, S 2. So, I need the value of r at S 3 stage. So, S 2 needs this value here. The value is latched here after S 4. So, there is no way to get the right value. So, X 2 is actually going to get the value of X naught previously. So, keep this in mind. Whenever you see that in your design. A value is getting produced in a stage which comes after the stage that consumes the value. You know that there is a problem. This pipeline is not going to work. And that essentially translates to a back edge crossing a pipeline latch in a patterned pipeline. So, these are called pipeline hazards. And dissolving this one is not going to work. There is no easy way to resolve this. You need a value here which is latched here. Which is in fact, which stabilizes at the end of S 4. I need the value at the beginning of S 3. There is no easy solution. So, here is another hazard of slightly different nature. So, in the actual hazard, the problem was that given X, we did not know what to do with this X. Which way to go? And which way to go depends on the computation of the previous X. Here we know what to do, but we may not have the data ready to do that. So, here is the function. Defines fx returns y, where y is x square times r. And r is y mod 3. So, here the danger of computation is known, but the required data may not be available. So, when we try to pipeline, we realize that. Another similar example, where we make it x times r square as opposed to x square times r. And r is again y mod 3. So, let us see how to pipeline these two. This is y is x square times r. So, again we will start with the combinational circuit. So, we have two multipliers. Take the input from r in the second multiplier. And the remaining things are same. Latency is 2 m plus modulo. Modulo, let me see. That determines your throughput and I will sample rate. So, I put the pipeline last there. Are we good? Yes, we are good. There is a problem there. R is consumed and produced and consumed in the same pipe stage. So, we should be fine. Are we good? No, we are not. So, here we have a back edge. Crossing this and if you look at what is actually happening, r is essentially not ready when you need it. The data is not ready. So, here is the hazard rule. Keep this in mind. Source stage of a data comes after the frustration stage. That creates major problem in pipelines. And there is no easy solution that can win back the last performance because of this. The only solution is that you have to wait for one cycle. So, in general, you have to wait for number of cycles equal to distance between the source stage and the distribution stage. So, in this case, the distance is 1. If you stall this particular stage for one cycle, you will get that outcome. The other one, x times r square. So, here you can easily see the problem. r has far-reaching influence. It goes to the first stage actually. You need from the very beginning of the computation. So, let us see the same 2M plus mod. So, when I try to pipeline it, there is a problem. So, this is okay, right? So, you have the same problem. We have a problem. So, this is somewhat easier to pipeline computation. So, we keep x equal to x into r square, but we change r to be equal to x mod instead of y mod. So, I can compute r first. But remember that the computation of 1x still depends on the previous computation because the r value comes from the previous x actually. It is not this r that we are using in the computation. So, there is still a dependence that remains on the previous computation. So, can we pipeline it? And answer is yes. Let us see how to do that. So, this is my combinational circuit. I bring the mod block here. I take x, produce r, everything else is unchanged. And now my delay becomes max of 2M comma mod. Because I can run this one in parallel with this 2M. So, I put a latch there. Are you good? Yes? Yes. Why? How can it be? I mean, what value of r am I using? I can logically drag it here, right? That is what we suggested. So, if you look at it, the latch also crosses a forward edge. So, that means what I can do is, when this cycle finishes, I can latch the value of r so that the next cycle can take the value of r from here, right? That is called a bypass path. So, if the source stage of data comes before or equal to the destination stage, even if the data is written to storage element in the later stage, we are fine. Can be handled with the bypass path. So, in this case, this is a stage that is producing the value. That is written after a long time. But the value is here, actually, ready for you to consume. So, if the destination stage is equal to this or before this, we are okay. We can handle that. Is it clear to everybody? Resolving hazard with bypass. So, I think I am going to stop here. We have covered pretty much all the cases of pipelining that you see when you are pipelining a processor. So, believe me, pipelining a processor is orders of magnitude more difficult than pipelining these simple computations. So, you see the problems. Before we go into that, I will spend a few minutes next lecture trying to explain how to simulate a pipeline. Because as we discussed in the first lecture, in a computer design cycle, you usually start with simulations before you finalize the design, before it goes to the concrete or design. So, you have to understand how to simulate a pipeline. It may seem very trivial, but why should it be in a particular? So, we will talk about that for a few minutes next class, and then we will take up pipelining a little bit.