 This time we're going to start looking at how Tomasulo's algorithm processes instructions. Tomasulo's algorithm is the standard method for implementing a superscalar processor. And in this case, each of our instructions is going to go through some common instruction fetch, instruction decode, and instruction commit hardware. But, in between there, the instructions will be issued to functional units based on what they want to do. For an add instruction, we'll send them off to an ALU. If it's a load instruction, we may send it off to a memory unit. If it's a floating point addition instruction, we'd send it off to a floating point adder. We can also take into account how full each of these functional units are. So, if I have several integer adders, whenever an instruction comes in, I can say, well, I'll send it to the one that's the least full, whichever one is most likely to be able to run this instruction quickly. All of these functional units, though, operate independently. Each one of those can be processing a separate instruction at the same time. So, you can actually be processing as many instructions in one cycle as you have functional units. But, all of our instructions are going to be issued in order. They'll come in through that common instruction fetch and instruction decode hardware. And then they're going to be committed in order. In between, though, since the functional units operate independently, we'll be able to process those instructions out of order, which means this will be an out-of-order processor. But we want to be sure to issue our instructions in order and commit them in order to avoid a whole bunch of hazards in the process. So, this is kind of a generic diagram for how Thomas Hulu's algorithm works. We have our common instruction fetch and decode hardware, our commit unit, as well as two functional units in this case. I also have the block of registers way up there. An instruction comes in, comes in from the instruction fetch and decode hardware, and it's going to be sent to one of the two sets of reservation stations, depending on whichever one matches up with the functional unit that you are interested in. If you have all the parameters that your instruction needs already, for example, you can just pull all of them from the registers, and you can just put that information into the reservation station, and the functional unit can run your instruction as soon as it's available. When it's done, it will take that result and it will put it out on the data bus. That will go off to the commit unit and it will be committed in time, but it will also be sent to all of the other reservation stations in the system. This is because when an instruction comes in that doesn't already have all of its parameters, it's going to get to a reservation station and write down where that parameter is going to come from, which of these reservation stations has the instruction that's going to compute the data that this instruction needs. We write down where that comes from. So then, when that instruction, say, finally moves through the pipeline, comes out the other end, it will send that data back around the bus, and each of the reservation stations will have a chance to grab that data. So if my new instruction needs that data, then it's going to grab it off the bus, toss it into its register, and say that it's got that data now. If that's all the data needed, then it's ready to run, and it's great. If not, it'll sit around and wait until it sees the other piece of data that it needs. Put together, these will allow us to avoid all of the hazards that we could potentially have without any real effort. We're not going to have to have any special hardware for forwarding or keeping track of when instructions are writing to registers. Some of our hazards will be handled just by the commit unit because it will commit things in order. The forwarding will be replaced by the reservation stations just waiting around for their data to be available. They don't immediately grab things out of the registers. Instead, they sit around waiting for data to come around on this bus. Tomasulo's algorithm, though, is actually used in real hardware these days. Here's an example of the microarchitecture that's used in the Zen processors that AMD released in 2017. You can see a lot of the same things that we saw in the previous picture. In this case, we have a really complicated instruction fetch decode unit because we have a much more complicated architecture. Got a lot of things that you'd expect in there. We've got the decode unit all the way on the far end. We've got a branch predictor. We've got some caches to speed up the fetching. Once we've fetched the instruction, we're going to send it off to one of those functional units. We've got a whole bunch of them really on this row where they've got the reservation stations in their rows above. But this row really has all of the functional units, all of the integer ALUs on the far side. We've got some floating point adders and multipliers on this side. This hardware doesn't really have separate load store hardware. It actually tries to reuse some of the hardware from the integer ALUs and then sends that target address to the load store queue. One other interesting thing to note about this architecture is that we can issue six instructions into our functional units every cycle. This gives us lots of instructions that we can process constantly. But our hardware is actually able to commit eight instructions every cycle. This way if we have some long string of dependencies, we've got a whole bunch of instructions that were dependent on each other. They kind of create a bottleneck in our system. And if we can commit more instructions than we fetch every cycle, then once that pile of dependencies resolves itself, we're able to flush out those instructions pretty quickly and clear out a lot of our functional units so that they're free for new instructions when they come in. If we didn't do that, then all of our reservation stations would gradually fill up as we have dependencies. Instruction here and instruction there that we just can't commit faster than the six instructions that we're issuing every cycle. So Tomasulo's algorithm is really still used. It's a great way for partitioning our hardware to get a lot of use out of our hardware and to actually get really good performance as well. There's a lot more to keep track of than there was in our single-stage architecture or our five-stage pipeline, but we're still using a lot of the same elements even. We still have the same instruction fetch, instruction decode sort of hardware. We still have some instruction commit hardware. We've just changed how we're going to process those instructions in between.