 This time we're going to be looking at how we can apply the concept of pipelining to the architecture that we've built. Currently we have this single cycle architecture and the way it works is that an instruction comes in, the program counter, we fetch our instruction, we send our instruction off, it gets parsed, turned into all sorts of things, and we gradually move from left to right to our diagram. But once we've computed something, we never stop putting that information out. So once our registers know what data they need to produce, they will continue producing that data for the rest of the cycle. This means even when we've come and we're ready to write something back to our registers, they're still putting out the data as though we'd never done anything with it. So we're effectively using a lot of our combinational units like our ALUs, our multiplexers, and our sign extension units as state units. They're actually holding some data for us throughout the entire cycle. But we don't really need them to do this. Once we've computed the results of our ALU, we really don't care what the ALU does after that. We just need to be able to have those results handy for whatever we're doing next with them, whether that's putting them into the data memory or storing them back to registers or determining a branch. As long as those results are still available, then we'll be in good shape. So we'd like to break this architecture into a series of blocks where each stage is self-contained. We're not going to need to go backwards through any of the blocks at a later stage. We'll just worry about how do we keep data around for that next stage. So we're interested in dividing up our architecture into a series of stages, each of which is relatively self- contained so that we don't need to go backwards through our architecture. And ideally all of our stages will take about as much time to compute. This will give us the maximum amount of speed up that we can possibly get through pipelining. So we're going to be especially interested in breaking things up into small blocks. In this case, I'm going to start off with an instruction fetch stage. This will be our first stage and the only thing it's going to do is fetch an instruction out of memory and then go fetch the next instruction. This will be great for most of our instructions. It will allow us to fetch one instruction every cycle and then just continue fetching instructions every cycle. For the next stage, I'm going to have an instruction decode stage. Decoding the instruction is actually really simple. We're just passing the bits from our instruction wherever they need to go. So that really doesn't take up much time. So little in fact that I'm going to include the registers in with this. So we'll be able to decode our instruction and go get data out of registers in the second cycle. Our third stage is going to be our ALU stage. That's the big time consuming part of this stage. The other things that we've got in here, such as our branch hardware, are not terribly expensive. They will take far less time to run than our ALU. So they're not really going to slow this stage down. The fourth stage will just be our memory stage. Our memory unit can certainly take a lot of time on its own. So it will get a stage all to itself. The fifth and final stage is going to be for right back. We'll take whatever results we've gotten and we're going to store them back into the registers. It turns out stages 1, 3, and 4 will be relatively balanced. We'll take about as much time to fetch anything out of memory and running the ALU will take about as much time as that. Our decode and right back stages are actually a whole lot faster. Those take about half as much time as the other stages, which means we're going to be able to write stuff back to the registers in the first half of our cycle and then read stuff out of those registers in the second half of our clock cycle. This means that we'll actually be able to access the registers both for writing and for reading in the same cycle without having any real worries there. The next thing to consider is what do we do at these boundaries? I've drawn in these boundaries separating the various stages, but how do we establish those? How do we enforce that boundary? And since we're not just going to be passing the same information through our architecture through the entire cycle, what do we do to make sure that say our ALU has the information it needs? So essentially what we're going to do is turn these nice dividing lines into register blocks. There won't be registers like the block of registers that we're used to accessing with instructions. They're just going to be these kind of pass-through registers. We would read some data out of them at the beginning of our clock cycle and then at the end of our clock cycle we'll write some new data in, but we're not going to be able to control what data they have without some extra hardware to specifically access these things. So these registers will give us a way to hold on to our state just long enough so that we can use that information in the next clock cycle. This way we're not going to have to have the registers sitting around putting out data for five clock cycles. They only need to put out that data for one cycle. Then the ALU can use it in the subsequent cycle while we're pulling something else out of the registers.