 So today we will talk about one of the last two processors that Compact produced. So last one was 21364. So this one was before this 21364. 21464 was in design when Rokoni got bought by HP and ATOM. We will talk a little bit about 464 also if we get time. So again, just like, how do we fix it? Anyway, okay. So we planned this again, just like with R10K. And the first silicon that came back from factory for this processor that roughly three times equals R10K. So it had 50 million transistors on a 310 millimeter squared dial. With 0.35 microseconds. So if you go back and look at Vips R10K, it had a die size of about 290 millimeters squared. The number of transistors was about 6 million. So you can see that this also reflects the unity of the designers that would fit with the same die size, so many transistors, which at the same transistor size. So Vips R10K also had 200 microseconds. Just like Vips R10K, which is the code's renames for instructions and recycling. The pipeline was taken from predecessor to 116V. And essentially, the out-of-order mechanisms such as renaming, etc. were added on top of it. The basic pipeline was just borrowed from predecessor. It had on-chip 64 kilobyte, 2-way set of CD-velo on instruction and data caches. And an off-chip direct back-to-cache with variable size. So again, it could vary from 1 megabyte to 16 megabyte. You could configure it at good time. So it fetches four pre-decoded instructions from instruction cache. 32 entry written address stack is there, which helps you predict written addresses. So this processor introduced a new feature that is line and wave prediction. So it's essentially a branch start and power point fused with instruction cache. So what it is is that for each block of four instructions, the line predictor tells you which index to press from next and the starting block offset in that index. So essentially, so let's say this is my instruction cache, which has two ways. Currently, I'm fetching from this particular cache index and this wave. So I fetch four instructions from this index and from this wave. And then I ask the line predictor to tell me where are my next four instructions. So it's essentially just a predictor, which predicts, probably we will say that, well, seven next four instructions. That's what he's going to tell you for most of the time. However, if one of these instructions is a branch instruction, it will tell you where the target is, so maybe here. Well, this is the next line to predictor. That's the line predictor. The way predictor tells you which of the two ways to pick. So at this index, you have two possibilities, right? Whether for this one or this one. So the way predictor tells you which way to go. So prediction is correct. It provides a fast way to access a large two ways that are set in cache. So otherwise, they found that it was pretty much impossible to access a 64 kilobyte two-way cache at whatever frequency the processor was starting in. Because essentially here, if you notice, we are avoiding the time to decode the index completely. We are avoiding the time to do a comparison of tag and doing a multiplication completely. We are just asking the predictor to tell you which line to which way. I will go and access it. But of course, in the background, we have to continue doing the other thing, right? We will decode the index. We will update the PC, decode the index, look up all the tags, make a comparison, find out which way is the right one. And if you correct it correctly, then you will find otherwise, you cancel whatever you fetched based on the predictor. And you will fetch the right thing. So it also acts as a target predictor for unconditional branches, subroutine calls, and other predictable branches. So as I said, it's essentially a BTB used into the expected cache. Is this clear? Is this particular concept? One more prediction. It has a fairly sophisticated branch predictor. So the 2R264 pipeline has a 7-cycle branch predictor. This is the minimum time between prediction and branch execution. This is the minimum, the largest. So this is much larger than your mid-start which had a much shorter time. So it uses a hybrid branch predictor called the tournament predictor. The basic idea is to have two different types of predictors that are good at predicting different classes of branches and have a third meta predictor, or the chooser, that decides which of the two predictors to pick for each branch based on the dynamics. So we have already discussed the predictors in the class. We have already implemented one in your homework. So this way you can pick the predictor that is good at predicting a particular branch class. So what are the two components? So it actually uses a sag and a gag. With a bipodal meta predictor. So in your homework you implemented a sag and a g-share. So one interesting thing that 2R264 did was particularly complicated with global history. So if you recall, your gag predictor has a global history register. So 2R264 is 12 bits, which indexes a table of 4 or 9, 6 entries. So these are basically sag and a bipodal. And these coulters will tell you the outcome of the gag predictor. So if you just look at the gag predictor in isolation, the problem is that there is a 7-cycle latency, minimum 7-cycle latency from prediction to the correct outcome of the branch that you know. So it is clear that your global history register will lag behind the correct outcome. Because now you predict a branch whose outcome will be known 7-cycle latency. In between there will be branches coming in that will actually access a stale g-share. So one way to avoid this problem is that you believe your predictor all the time. You say that whatever the predictor tells me, I am going to push that in as my history. So that is called speculative update of the global history register. And what it buys you is that your GHR is most of the time up to date provided the prediction accuracy is high. Otherwise of course your quality GHR will be down. But of course once in a while, your predictor will make mistakes. So then the question is how are you going to fix it? Because I have corrupt the GHR. So what you do is that as you already mentioned for any way for rolling back your register map you need to make a check point whenever a branch is sick. So you make these 12 bits also part of the check point. So whenever you detect a branch with prediction you will also restore the GHR for those on that check point. So in case of correct prediction there is no bubble and 264 supports 20 implied branches. So compare that with 4 in case of pattern K. So this means you need a higher prediction accuracy of your predictor. Because you say that we are a thumb group. We say that if I allow n outstanding branches and a prediction accuracy p for each branch then I want p to be n to be more than half. Assuming that the prediction will be better. So if you plug this in there for n you can find that p will come out to 0.97, 0.98 about that. So you need a extremely high accurate branch predictor to make any sense of this predictor. In other words you assume you are guaranteed to be not guaranteed but most likely to be on the wrong path if p falls below that. The second stage is called slot. So here the branch prediction outcome is compared against the line and wave prediction outcome for traditional branches. And in case of a mismatch the branch predictor overrides the results of the line and wave prediction and results get one cycle bubble. Essentially whatever you have fetched believing the line and wave prediction will have to be discovered. So that's the first thing that we do in this particular stage. Now the 2 1 2 6 4 has two internal A-view clusters and each cluster has two sub clusters. We will look into more detail about these sub clusters very soon. Now the point here is that during this particular stage the integral instructions are statically assigned to one of the two sub clusters based on some shenugin cluster. So essentially you are saying that I will statically bind an instruction to one of the sub cluster of an A-view. Even though what may happen is that if you did it dynamically it probably would have been better because when the instruction finally becomes ready the statically assigned sub cluster may actually be busy doing some things. Whereas the other sub cluster may actually be free. But they chose this particular design for simplicity. So they are statically assigned this and that's why the name comes from this stage. It's called statically assigned. So this sub cluster assignment will be used by the integer issue queue logic later for sending instructions to the proper sub cluster. First is the rename. So this is very interesting. The way they rename is fundamentally different from the way we have talked about in the class and we have seen what R10k is doing. So it supports up to 18 physical registers in integer register point and 72 in floating point register point. And for every instruction the register map is saved. So this provides immense flexibility to you because now you can say that for every instruction I have a specific rollback because I am checkpointing the register map on every instruction. But this also means that your register of checkpoints should not be too heavy. So they provide 80 register checkpoints. So essentially they allow 18 flight instructions. So to save 80 register map checkpoints you should not require too much space. That's what is required. So it's completely different implementation compared to R10k. So let's try to look at that more and see what we actually do. So recall that whatever we have discussed is now and also what R10k does is that if you have n logical registers you have a map k root which is n entries high and if you have b physical registers each entry will be log p plus 1 b draw. So this extra bit is for the balance bit that we talked about. So if you calculate for 80 register checkpoints so for awkward quantities 8 is 32 as usual and b they are talking about 80. So to be very accurate so in this case how much is this? Each entry is 8 bits. To be able to import 80 physical registers. So one checkpoint would be 256 bits and we have 80 checkpoints. So that's a lot of data actually. So instead of this what I do is they have a p entry table which is your p physical registers and each entry is log n plus 1 b draw. So again this particular bit is for the balance bit. So this is my value to be common. So how do you do the mapping? Whenever you get a logical register you make a fully intuitive comparison and exactly what entry is going to match along with the balance. And that entry whatever entry that is the row number of that is going to be the physical register. So which one is better? So roughly comparing n log p with p log n which one is better? p to the n or n to the p? We can assume that p is bigger than n but it's always true. Can we establish this in a relationship? Sorry? Which one is bigger? n power p is bigger. So I'll give you one example for each. p to the n is bigger than n to the p if you take p to the 3 n equal to 2 p to the 4 n equal to 3 So it's not here but anyway for this particular design point you can establish an equality. All of these will be bigger than one of these. So in this case I haven't calculated but I think this is going to be larger. Or bigger, I don't know. That's not really important. What is important is that how do you check point is This is a gigantic table. p log n bits, whatever it is. So you can calculate. So n equal to 32. So this is going to be what? 480 bits. 480 bits. Because each entry is going to be 6 bits. Sir, how do you look up this map table? I do a comparison. Associated comparison with all entries. It's a camp. It's not a ramp. This is 480 bits. Whereas that one was 256 bits. So you have a larger map table. So for any checkpoints, if you decide to checkpoint this whole table. So it will be gigantic. It's not a way of checking points. And this was a major contribution of what this is for. Yes. Differences. Differences. Which I point one of them. Reactions and action instructions. There will be like, there won't be many differences and most of them will remain the same. There will be differences in some positions. If we are the same in the other differences. No, I don't understand. What will be my difference? There will be some mapping that will change. One of the same copies is the other one. What's the worst case? What's this? How do I restore an arbitrary checkpoint? Is that easy? I have to sequentially restore, right? So you can save one of the tables, then for some 5 or 6, 5 or 7 tables will save differences. Then you can checkpoint 1, 4 tables again. Can I do something better? I tell you that it's enough to save p bits. So think of an instruction, right? Currently certain things are mapped here. How many entries are going to be valid? Two logical registers cannot have, sorry, one logical register cannot have multiple existing maps, right? So there could be only n rows that are valid, right? So if I look at this particular column, I'll find that there are exactly n turn-on bits. The remaining will be off. Does that really end? So currently whatever is mapped, okay? So we have n logical registers. They are mapped to certain n physical registers, right? Now until this instruction commits, if I pick up any of these currently allocated physical registers, it will still have the same map, right? Is that valid? So I look at a certain entry here, okay? Which is let's say valid, right? So this particular sum logical register here has the corresponding physical register map, okay? When does this map get complete? When the same logical registers get redefined, right? Then redefined and the instruction established in this map, sorry, established in the new map actually graduates, right? So the new map doesn't exist at this point, right? It's in the future. So for the viewpoint of the current instruction, when it graduates, I claim that all these maps will still remain unchanged. Now the physical registers can be complete. Is that correct? Because I can look at any of the map physical registers that will complete only when the next producer arrives. That has to arrive in the future, right? Which hasn't yet arrived yet, right? Sorry? Then why is this particular logical register here? It should be mapped somewhere here, right? So currently I'm looking at only the most recent producer of each of the logical registers in this map here. That's the view that I have, okay? And that view is not going to change, right? When this particular instruction graduates, that view is not going to change. It's not going to change actually. So I claim that if a checkpoint is valid bit term, so if I take an arbitrary instruction, if I want to restore the checkpoint map, all I'll do is copy the valid bit column and put it here. That's it. Do you know which logical register is mapped to? We'll have exactly n bits called. Whichever n rows it will activate, essentially what's the state of the map table when this instruction was added. That's what I'm claiming. Whichever n bits are enabled. So when this consider can be possible. These n physical registers were actually... No, what? Yeah. The n logical registers. The n logical registers. The n logical registers. So only n bits are valid in this. So it will map to n physical registers. So these n physical registers were mapped to... So what are... So what is the mapping? So which physical register maps to which logical register? How do we get to know that? No, so that's what I'm saying. So consider an instruction which you have already been currently. Okay. This is the state of the table. Currently it has some n bits on in the valid bit column. Okay. What I'm saying is that in the future, at any point in time before this instruction graduates, if I want to know that the register map back to that point when this instruction was renamed, it's enough to just copy the valid bit column. What it was at that point. Because now the maps have changed. When I took the check point, there might be instruction ahead of me. Before me. Before me. That's right. And then they... They are already... Right? They are already renamed, right? They are already renamed. So their renaming has a reflected here? Yes. There might be some other instruction after me which the new map... So we are talking about renaming the current instruction I. Yes. We are talking about the instruction before me. Yes. Let's call it I find, which is already renamed. Yes. The name of the map it has produced is already named. Yes. I try and commit. I try and commit. There is no other instruction. Yes. So let's say I try and produce some logical register and not map into P naught. Not mapping into P naught. The previous map was... The previous map was... P naught prime let's say. So when... And let's suppose that in between there was no instruction producing N naught. Right? So when I is renamed, P naught would say N naught. Right? So the point is that you can just check point this... Just 18 bits of meter. What penalty do you pay? You pay a bigger size penalty for your map table. But the check points become very thin. So the point is from... From... From renaming to commit for instruction between there no mapping happens here. From renaming to commit of file. Of file? Yeah. Table does not change. Of course it changes. Yes. There are instructions continuously being renamed. What I am saying is that none of these registers which are mapped at this point cannot be freed. They will remain allocated. And that's what I am exploiting to argue that if you want to roll back the check point to this particular state you just copy the value of the copy. So the end result is that for storing 80 check points you essentially need 6400 bits because this is 80 bits. Right? So... Which is of course the larger than your MIPS identity check point because identity was having only 4 check points. Which required 1000 to 24 bits. But you can imagine having 80 check points here. It would be huge. Any question? Not clear? Sir actually for this I think Yes. The instruction comes when we recycle right now. Recycle the previous one. Like as I said here is the example. High price produces logical register is not. Yeah. So if I go to the row of P0 which is let's say this one. Basically this is P0. If you say N0 here and this bit will be off. Right? Let's suppose the previous map was P0 prime. Which is let's say this one. So P0 prime will also say N0 but this bit will be off. Yes. So you are asking something. Yes. So finally when I try and commits this cycle will not work. Not until then. So this instruction like current instruction you want to roll back. Yes. By default later instruction. Sir I say. Sir. So consider for example consider I prime. Okay right? Yeah. So before we make this change we check point the valid bit column. Okay right? So if I want to roll back to I prime this bit would actually become off when I copy the valid bit column. Okay right? Because the check point this bit would be off. It becomes off only after a step is the new map. So the valid bit column tells me exactly what I need to do. Yes. But of course you cannot say that after I prime, I has committed after that for a few cycles you cannot say well I now want to roll back to I. No that's not possible. Because the register map will now have changed you know irreversibly. So that will cannot reverse. Okay. So that was pretty neat actually. So the end result is a much more efficient register map setting and restoring hardware. Sir why do they do this store this for each instruction? Yeah because they said that essentially this is one way of decoupling your pipeline implementation with the register map. So what happened is that depending on your pipeline implementation depending on how your system is designed certain instructions may raise certain types of exceptions. So this will allow you to actually support any arbitrary exception for any instruction. Okay. You can just roll back the check point. So this allows you to know make definition of exceptions in the processor. If you have this body you can submit it. And of course later also you can modify your process to say that well although you can check point all these I only allow you to check point for branches. You only turn on that particular one. So possible to restore register map of any arbitrary instruction in a single cycle not just branches. So in this particular stage every instruction is also assigned a reorder map of entry. And this serves the purpose same purpose as the active list just a different name but much bigger. So active list was 32 entries in article these 80 entries. It allows you to have 18 flight instructions in the pipeline. Okay. Each instruction is also assigned to one of the two issue queues. So there are essentially two issue queues. One is integer queue and one is a floating point queue. Integer queue holds all integer ALU and load store instructions. Essentially all instructions that need to meet the integer register point. Okay. So notice the difference with R10K and a separate queue for load store operations. Okay. The second one is loading point queue holds all FPU and loading point store instructions. So essentially these are all the instructions that need to meet the floating point register file. So there is a slight problem with floating point stores because they need to meet the tier register file for computing address but they need to meet floating point register file for getting the value to be sold. So they essentially say that well we will put these instructions in the floating point queue. Okay. But for address computation they actually do some interview communication. So floating point stores are special. Need to read both integer file for address operand and floating point file for data operand. So they are split into two separate instructions. Maybe address part goes to the queue and data part goes to the floating point. Loads would write to the register point. Yeah. So floating point loads are included. So here the queues are defined in terms of the read quotes. How they bind to the read quotes. Not the write quotes. Okay. So how do you issue the 20 entry integer queue can issue at most four instructions every cycle out of order to the pre-assigned A-U subclass. So we have already done that. The starting part. Okay. And the 15 entry floating point queue can issue at most two instructions every cycle out of order. All right. So by out of order I think that these queues are collapsible. That is by you can pull out instructions from the meeting which would create holes in the queue. So you have to collapse the queue so that it creates space for the table for new instructions. Because one requirement is that whatever instruction is present in the queue they must be maintained according to their age. Okay. That's very important. All right. Because that would guarantee fairness. Because essentially in the cycle if there are multiple instructions that become ready the older instructions are given priority. So that they can move forward. The issue load instructions are now allocated in a 32 entry load queue and stay there until the end. Okay. So once they have issued from the idea queue they sit down on a separate queue which is called the load queue. The store instructions go to the 30 entry store queue. The store queue entries are wider than the load queue entries to be able to hold up to 64 bit of stored data. So remember that the data gets moved to cache only a minute. So you can think of the load and store queues as two parts of the RTK address queue with one major difference. The load instructions are now executed out of order which was not the case for RTK. Okay. They have sent load instructions. Okay. Sage 5 operative read the integer ALU cluster has its own copy of the register pile, integer register pile. These are kept coherent from a one cycle inter cluster transformer. So remember there are two ALU clusters. Okay. And each ALU cluster has its own copy of the integer part. So that creates a problem coherence. Right. Because one cluster may modify some register here which may not reflect which may not be reflected here. So there is a one cycle inter cluster transfer costs. So in the single cycle you can actually make it go ahead. Okay. Which actually has implication because now what it means is that suppose you send a instruction which produces let's say register 10 in cluster 1. And you send an instruction which consumes register 10 in cluster 2. So this cluster 2 instruction is going to be delayed because of this one cycle inter cluster. Okay. So the schedule was odd enough actually to push instructions belonging to the same dependence chain to the same ALU cluster. Okay. Which also has some brown side because what may happen is that one cluster may become very crowded. Okay. But the other cluster may be very crowded. So scheduling they are scheduling algorithm takes care of these things. So the issue digital instructions read their source operands from the respective cluster register file. The floating point instructions read their operands from the floating point register file. The next cycle the operand values may be overrated by the bypass values. So this one we have also seen is the traditional bypass parts. Sir, why do they have separated register files? Yeah. Why don't you guess? That's a very good question. Why do you like to separate the register file? Why not a unified register file? Right? So in this case as we said let's see what have we already said? Doesn't matter. So if you think about it, right? Let's suppose that the clusters are symmetric. Let's assume for the sake of the register file individual files, right? They would have some number of quotes, right? If you refile it the number of quotes would be added, right? That would slow down the register file. So what they found was that this one cycle penalty was far better than the final register file. And they would make sure that they should really not require this one cycle transfer to to make sure that the instructions dependent instructions get directed to the same cluster. So from state 6 on hold the instructions execute on appropriate functional units. The two individual clusters are not exactly symmetric. Each cluster is divided into an upper sub-cluster that is U0 or U1 and a lower sub-cluster that is 4A U0 U1 The lower sub-clusters are identical and each contains logic units one adder and a load store virtual address calculator so that's your L0 and L1 so you have logic you have adder and you have address calculator U0 contains an integer multiplier branch unit logic units and an integer adder so that's your U0 and U1 is identical to U0 except it does not have the function but has a motion video unit and some special adder instructions like population count leading zero count etcetera. So this particular instruction tells you how many zeros are there on the MSB on the most significant side until you read a one. So as you can see overall what you get is one multiplier there are four logic units because all the sub-clusters have one each there are three adders L0 L1 and U0 sorry and you have two branch units two shifters and two adders sub-clusters okay. The floating part unit so there are four in number what are they a fully pipeline multiplier a fully pipeline adder an un-piled line divider an un-piled line so you might have noticed that there is no integer divider they actually use this one but we need to calculate their virtual adders in L0 or in 1 they issue out of order from the integer Q as already mentioned at this time they are also placed into the load Q or the store Q these two cubes are maintained in order that is already instructions are closer to right. So essentially load Q okay so loads may execute out of order stores execute in order but a store issues it checks all the instructions in the load Q that are after it in program in the program order and in case of an address conflict meaning the load has got a long manner it initiates a squash starting from the offending load and a revenge is started from the offending load so you need to restore the register maps so this is also one reason why you would like to check for it your register map for load instructions because any load can run across this particular problem and we have started getting to have this problem because they did not issue loads out of order so the reason why the possible squash is due to order for a load issue the processor uses a load point game so what is this unit whether to hold the load back until all stores before it can be issued and how do we decide this this table is updated whenever a load gets squashed due to a store conflict so that in the future the load will be held back and it will not be issued separately so it's a very simple table which essentially identifies load instructions that are flowed to this problem and they would be issued a load checks all instructions in the store queue before it and in case of a perfect match which means matching the size of the address it simply picks up the store value and does not even access the cash so this is the load forwarding case where you have a load and you have a store before it and the load completely overlaps in the store this is the value taken from the store queue itself and the load it's speculation in connection with the issue load dependence so we say that the issue are dependent of a load speculatively without knowing if the load is going to hit in the cash amount certain can always optimistically speculate that the load will hit which is why it always does this speculation and if it misses we will have to squash the dependent and utilize that particular store essentially one issue cycle is wasted in the load basis so 2 out of 6 4 uses a load it is predicted when a load issues it predicts using a table based on the history whether it is going to hit or miss this time just like a branch predictor takes the money so it's a binary predictor the prediction is hit this logic may consider issuing dependence on the load the prediction is basically load you will hold back the dependence until it knows for sure the way of the load so the last stage of the pipeline retirement of commit 2 out of 6 4 commits at least 8 instructions every cycle under certain situations it can sustain a retirement bandwidth of 11 instructions for a short period of time retirement rate more than my fetch rate I am fetching 4 every cycle I am saying that I can sustain up to 11 instruction every cycle why do I do this what's the rational behind it instruction is stuck we will always fetch 4 but it's not sure that we will do so we might be having more than 8 right but what we need by quickly retiring they have already executed these are complete instructions exactly so freeing up not the increase it also freees up other resources registers and everything so these are essentially you let them go the better because they will free up resources so when instruction comes to the RV it can be considered for retirement so what does it do when it retires it updates the branch predictor it frees the register map copy for this instruction so remember that every instruction has an register map check point you have it frees the load store Q entry if it's a load store it frees the whole physical register and updates the freest and frees the RV so next time we will talk about intelligence