 last time we discussed register renaming in detail and how you recycle registers. So, and finally, conclusion was that the reorder buffer size. So, reorder buffer was a people structure for maintaining the order of instructions and discrete when an instruction with types. So, that is the entry in the cycle. And we also deduced a relationship between ROB size and the register find size. And conclusion was that if you have more physical registers, you will have more in flight instructions instead of possible and that will allow you to extract more parallelism of the program. And to support of course, the fundamental assumption here is that to accommodate a large ROB flight instructions, you will have a large ROB. And to have to support a large ROB, you will require an equally large physical register find. We also deduced a relationship last time. So, one small thing that was left was about the remodeling. So, register renaming essentially deals with the problem of mainly registers. That is the compiler ran out of registers, it might have picked a register for restoring the result of an instruction and that might interfere with the output of some other under the instruction or the output of some under the instruction. So, these are the right after right and right after read dependencies. And these are purely artifacts of compilers shorting out registers. And with register renaming with a much larger register set inside the processor tries to solve the problem. So, ideally if you have a large enough register find, you would only be limited by data for the database. So, memory renaming is the other thing that deals with. So, the problem is exactly similar you have. So, there are two situations. So, essentially we are talking about war and war hazards. So, war is basically between outputs of two instructions. So, outputs happen to be the same. The outputs happen to have the same storage of two under instructions. Then you have a right after right dependence. And here essentially we are talking about between output and input to under the instructions. The read happens before the right. So, you read from a register, later you write to the same register in a same different instruction. So, the second instruction had to do the right to the same register simply because the compiler could not find a different register. So, the memory dependence is exactly same we are talking about two dependences here. So, here it will happen between two store instructions. Two store instructions write to the same memory location. However, this one is not really an artifact of any compiler. So, it is basically a program property that two store instructions write to the same location. So, that is a right after right hazard. And as such on a window it is almost it is impossible to remove because the rights have to happen to the same memory location. Here we are talking about a store followed by a load. So, and the store writes to a memory location x the sorry a load followed by a store. So, a load reads from a memory location x and the store writes to same memory location x. So, again here. So, you can see the problem the problem is that if you switch the order of these two instructions the load is going to be the wrong that. So, essentially you have a load some register one to some address from some address and then after bunch of instructions you have a store which stores to the same address and you have to maintain this order. The load has to happen before the store there is no itself. Otherwise the load the load is going to get a wrong value from this address. So, that is the war hazard. Now, the question that we are trying to answer is. So, this one is war and the war is basically a store from some register to some address. And then you have another store from bunch of instructions from some register to the same address. The question is is there any way that you can actually ignore this order. And the simple reason of ignoring this order will be that the store may get ready before the load. So, there will be there will be reasons to execute the store. So, so the question is here also this store may get ready before this store. By being ready we mean that R 10 becomes available before R 1. And here R 10 becomes available before R 1 plus there may be certainly there is an address. So, the question is can we do that. And the answer is yes. So, you are going to use exactly same philosophy as we did in the register in the case. What we will do is inside the processor we will have some special storage for holding the values of the store instructions. And this is essentially called a store buffer. So, store address renaming is still termed through the store queue entries. So, you remember that we we decoupled our sorry we distributed our issue queue across functional units. So, we had an integer ALU issue queue we had a floating point ALU issue queue we had a store queue and then a load queue. So, the store queue entries are a little special in the sense that they have a field for storing the bangles that the storage will write. And as such you need such a field anyway. Because a store may execute at any point in time, but it cannot operate memory until it comes to the here the ALU. So, so now we can actually execute these two stores out of order ignoring this, but in order because the values as such will be copied from in this case will be copied from R 1 to its particular store queue entry. The value here will be copied from R 10 to this stores store queue entry. And ultimately of course, they will update memory in order when they go to the here ALU queue. The advantage here is that we are able to overlap address computation of these two instructions overlap or again you know we can run the conquer. We have enough resources and we can execute any of these whenever it gets ready. So, you can compute the addresses and we can read the register files out of order and put the values in the storage queue entry. And of course, eventually the memory update will happen. Similarly, here I can issue this store ignoring this load completely and put the value in the in the in the store queue entry of this particular store. However, you have to be careful when you issue this load eventually. So, that has to be taken care of. So, this is so here the whole process is summarized. So, store address renaming is still done through store queue entries different stores to overlapping addresses can issue out of order and compute addresses and read values to be stored from the register file. Here I mentioned particularly overlapping address because that is the problematic one. That is where the dependence arrives. If the address are independent and non-overlapping then of course, I can execute them in order or order I want. Stores come in order and this is when memory is updated with the new batch. So, memory update is taken in order that has to be taken care of. What is the load selection protocol? So, load issue out of order when ready and all stores before it have already computed address. So, this one we have already discussed. So, we have we had an order between nodes and stores. You can imagine a fused load store queue as a single fee for. So, I have a load instruction here. So, what is saying is that you can issue this load instruction whenever you want whenever the operator is the load instruction already. However, you have to do one thing that is you have to check if all the stores before this load have already issued a load and by issuing already issued and computed address. So, that has to be guaranteed because whenever this load will issue you will check if its address overlaps with any of the stores address. And if it does we have discussed what needs to be done in the last that. So, depending on the overlapping time you may either take a value from the from the youngest store before the load or you can you can let the load wait until the particular conflict has been resolved. So, this one looks a little naive. So, we will soon improve it because simple reason is that there should be a mechanism to not wait for all the stores to finish because this load may not depend on all the stores may only depend on one of the nodes. So, this is a very conservative approach which is still correct. So, we will improve upon this very soon. And load execution happens in two phases as we discussed earlier. First the load will issue and compute the addresses. So, first of all this criteria has to be have to be satisfied for the load to issue. And when the load ultimately issues it will first compute the address and then it compares the address with all stores before it. So, of course to know if this criteria is satisfied or not this has to happen first the load has to compute the address. So, the way it is done is that whenever the load offers the ready I will issue it. It will compute the address it will come back check all the stores here all the addresses. And then if there is a match it will do something if there is no match then of course there is no story. So, what we have actually effectively is that we have improved concurrency between the memory instructions. And it is done by doing the memory renaming through the store queue entries. So, you have a temporary incarnation of this particular address this becomes one store queue entry. And it is temporary as long as the store remains inside the processor this particular store queue entry signifies this particular address. Ok. So, after you do all these things what you know. So, number one is instruction fetch latency. You have to load the memory to bring the instructions every cycle that is going to be very slow. So, you need an instruction cache. So, we talk about caches after this. So, instruction caches should hold depending on the management policy of the cache most recently used. And hopefully you can reuse them over time. Second problem is branch misprediction. So, observe that you predict a branch in decode and the branch executes the exchange on the pipeline. So, there are several pipeline stages before the outcome is known. So, misprediction amounts to a loss of at least NF instructions where F is the fetch width. So, that is pretty obvious. Because every cycle I am fetching F instructions if I have an in cycle branch penalty if I make a misprediction I would have put NF number of useless instructions wrong instructions. So, that is a huge loss actually. And if you have a very deep pipe having very large fetch width the implication is that you should have a very good branch predictor. Third one is low memory. So, just to give you a feel about this problem here is one particular data point. Nothing to do with any particular processor just an arbitrary point to convey the point, convey the problem here. So, let us assume that you have an issue with a 4 meaning that every cycle you can use maximum 4 installments. Assume that the frequency is the frequency of the processor is 3 triggers. Memory latency is 120 nanosecond. So, then the question is suppose one memory operation is currently in flight. So, particular memory operation load let us say a load instruction. So, to be able to hide the latency of this particular load operation how many instructions do I need. So, what does it mean I have to hide this much of latency. So, the processor it should look like as if there was no memory operation. So, during these 120 nanosecond times time window the processor should continue to issue 4 instructions every cycle just to find some somehow 4 independent instructions every cycle. So, what does it mean I want to 120 nanosecond is 360 cycles that 3 triggers multiplied by 4 is 1440 instructions. So, if a load instruction is outstanding that means if I want to hide this whole latency I must be able to find 1440 independent instructions to issue giving this time. So, fundamental requirement to be able to do that is that I need to ROB of size 1440 at least otherwise I would be able to do this. So, essentially this is impossible it is not just not possible actually and the main problem is shortage of resources you cannot support such a large ROB and if you want to support such a large ROB you need an equally large register file you talk about that register file size may be slightly smaller. So, 1440 minus number of registers. So, this is what ultimately becomes a problem resource constraints the ROB size the register file size the issue queue size. So, these these structures ultimately limit your ILP and this is related to this problem and in addition to that you have now the two problems alright. So, we will talk about this particular problem in great detail in the remaining portion of the course how to resolve this issue actually this is a very hard to such problem because this is where today cycles go when you run a program and you try to find out where exactly the program spends time you find that 70 to 80 percent or even more amount of time just accessing them. So, the data has become extremely slow that is the big problem related to the process you give one data part the process will compute very quickly and ask for the big data you just do not have in a bandwidth or in our intelligence in the system to provide the data that passed. So, this is the problem and we will discuss some of the solutions that that are out there, but till now it is an open problem and very much open actually alright. So, any question ok. So, going back to the basics a little bit we call that this was our equation for computing execution time cycles per instruction times instruction count time cycle time. So, so far we have focused on the first component that is how to reduce the CPL or if 2 by dc instruction to data cycle. This one is the this one is determined by the quality of the compiler instruction count. The architecture as such cannot be much more. The compiler compiles a program which you much of instruction creates. So, as far as an architecture is concerned this is more or less constant of course, modular instruction set architect that will provide the quality of instruction produced by the compiler. But the micro architecture is the implementation of the ISA has nothing to do with instruction count that is a constant problem. So, we talk about the third component now a little bit. So, cycle time reduction is another technique to boost performance which essentially translates to having a faster clock. So, you want more data hours and you know and you can in person should understand that it is really product of three terms. You cannot ignore the first two terms at least definitely cannot ignore the first term. And the first to the last terms are actually are not even there. If you have a very fast processor occasionally your CPI is very large. So, you have to be careful and will exactly that is what we are talking about now. We are going to discuss why that is so. So, how do you get faster clock frequency? So, there is usually from an architecture view ignoring the device physics and electrical engineering part there is only one way to do that that is you make your pipe stages smaller. So, if you have a pipe stage pipeline if you are double the frequency you have each pipe stage you have a 10 stage pipeline. If you want even more frequency you keep on having whatever you can. So, you have seen that in one example how many things the pipe gives you more frequency. So, that is what an architect can do to get faster clock. Of course, there are other ways to get faster you can have smaller transistors which can which can be clocked in a faster way the faster transistors. So, we are going to talk about that. So, each pipeline stage should be one cycle for balance progress and need. So, essentially you need to break pipe stages into smaller stages to get faster clock. So, what do we know about this what are the problems of making a pipe deeper can anybody tell it. Sir, say this prediction this prediction penalty increases branch risk prediction penalty that is all you have anything else. Sir, in case of current it could make a misprediction. Yeah. I have to in the type of that is what exactly you have mentioned increase prediction penalty. Increase penalty in terms of cycles not in the absolute time right. We are not talking about that in terms of cycles. And that essentially means an increase in time. So, really lowering your cycle time may not buy you a lower execution time all the time. What else? Bypass. Bypass paths increase in numbers. What does it matter from the point of view of execution time. So, getting these wires on one board and all that is not possible in the sense. Not possible in the sense. Not possible I mean it takes time they vary on. It takes time to design. No, not I am not designing. So, basically your structure once we have one day one day one day you have to build years. Point. And that would increase latency in all between the symbols. Latency will increase. Latency will increase. Because of the length of the wire you are talking about. Length of the wire plus you have need a multiple access. Comparative. Comparative. So, all these taken together may actually not give you the target clock frequency. You find that ok I have a 5 stage pipe with 2 vigorous frequency balance pipe each stage is equal. So, 5 stage pipe with 2 vigorous frequency I am making 10 stage. You may get up to 3 vigorous you may not get 4 vigorous. Because of this reason. So, these are the couple of things that feed into your target frequency. And ultimately what it means is that at that particular cycle time whatever your target was your effective CPI is actually bigger. You are losing the whole signal. So, super pipelining is the term that is used by the computer architects to essentially convey the phenomena that you are pipelining a processor too much. And that is what is used for getting faster clock frequency. So, each 5 stage contains small amount of logic. So, that it can fit in small cycle time. Basically it will grade your CPI if not careful. So, we have just discussed 2 points why it can. Branch penalty is even bigger we have talked about that. So, Intel Prescott was one of the one of the processors which Intel has produced in its history with the highest frequencies. It had a 31 cycle branch penalty. So, at that point actually pretty much Intel had to stop the frequency. Because there are many other problems to be talked about. So, branch mis predictions cause massive loss in performance. For example, Prescott had issue width of well fetch width of 3 microbes every cycle it fetch 3 microbes. So, 31 cycle loss would essentially mean 93 microbes are lost. So, every mis prediction you make you lose 93 microbes. It is a big loss actually. You will fetch for 31 cycles along the wrong path and you will have to throw all out. Long pipes also put more pressure on resources such as ROBM registers because instruction latency increases. Can somebody explain this particular point? Similarly, what I am saying is that if you have a deeper pipe you might have to increase your ROB size and register pipe size. Exactly. So, since you have a deeper pipe you will have more instructions in the pipeline at any point in time compared to a shallow pipe. So, you will have to increase your ROB to accommodate these many instructions. And we already know that ROB size and register pipe size are connected to each other. So, if you want to use the whole the entire ROB you should have a larger physical register file also. And why is there a problem? Well, if you have a large register file your register file actually is going to increase in thousands of minutes. So, smaller it is faster. If you make some structure bigger it is going to be slower. So, if you have a register file read that slows down. If you wanted this target frequency you have to pipe on the register file accesses. Introduce more pipeline stages and essentially now it is a feedback loop you make the pipe deeper and deeper you need more and more registers. You pipe on getting deeper because you are trying to get that frequency. Because actually at one point you have to say that well I cannot make it any deeper and what the penalty you pay is you essentially want me to reach this target clock frequency. You have to run it at the slower clock frequency So, yeah so instructions occupy the registers longer the design becomes increasingly complicated where delay does not escape. So, as somebody has just mentioned that the bypass may be slow essentially what it means is that the bypass values may not be delivered in time to the estimation possible. So, till now whatever designer talked about we said that well you bypass from one stage to another the value reaches in one cycle right or immediately it may not happen actually. Even if you have a full bypass you may have to wait for some time just because the bypass path is very long on the multiplexer because so why that it just cannot accommodate in time. So, it has many downsides so, because this is the point it may severely degrade CPI So, here is an alternative what if I do not make my pipeline deeper I have a shallow pipe but I have a wider issue and this is what I am saying is that I have a 5-stage pipe and I am happy with a small top sequence but what I do is I issue 10 instructions it is a very wide issue so then I do not have all these problems I have just discussed it is a shallow pipe but now the bottom is that you should be able to find those many individualities otherwise you will be wasting issues so deeper pipeline gives you faster clock at the expense of increased branch penalty possibly wider bypass wider issue reduces CPI by exposing more parameters and which one is better depends on whether your program can offer so much of parallelism that is you can fill up your issue slots so, usually again there is a limit so programs it varies from program to program but usually there is a limit you are on a certain point you will find that you are unable to fill up all the issue slots and what are the other problems of having more issue very complex very complex suppose there are n instructions one by one there are issue slotters so it should be n cross n bypass so every instruction will have bypass to the other instruction potentially that is the maximum yes and in addition to that the circuitry to find even in instructions will have more comparatives because does it in the wake up logic will be very complicated yes of course they are always there ok so we will wrap this up with the review of the p6 micro architecture so this goes back a long time actually it was introduced in Pentium probe and was extended with MMX and SSC in Pentium 2 and Pentium 3 so what do these acronyms stand for you must everybody I hope must have heard of this what is MMX multimedia multimedia multimedia extension what is SSC somebody what is SSC yeah I believe everybody has heard of this sorry no no SIMB first case is sorry similar structure multimedia that is the acronym for SIMB yes can everybody guess what is it no extension ok yes no no ok this is streaming SIMB extension so so they were introduced in Pentium 2 and Pentium 3 just just go and move SSC streaming SIMB extension so Pentium Pro was the first processor after Pentium Pentium was a super scanner processor with two pipers they were both not they had these U and V pipes Pentium Pro was the first architecture which introduced this out of order execution and all other algorithms so it was a 14 stage risk pipeline so as I mentioned even though x86 is a SISC ISA internally they get translated to risk microbes internally the pipeline looks exactly like very similar to MIPS risk pipeline 8 stages spent in the front end so here is a breakup 2 cycle fetch 3 cycle decode can anybody guess why decode is so long 2 cycle fetch 3 cycle decode 1 cycle ROV allocate and 1 cycle dispatch to reservation stations so that's the front end and then in the reservation station the instructions they wait for arbitrary number of cycles then 3 stages in execution unit and 3 stages in commit so that's your go in stage pipeline why is the decode so long ROV access you need ROV access for ROV allocate ROV allocate the service why is decode so long you need to take for determinant no, not in the decode stage as they are in the real stage why you are working with a processor in your homework you have seen it's decode logic right well the instruction set is very complicated it's extremely complicated so it takes a lot of smart minutes to decode x86 instructions actually so that's the reason why decode decode is so complicated whatever it has to decode from sys complex to micro operation so that's where the translation also takes place for lengthy instructions execution stage gets elongated beyond 3 cycles so this is the minimum 3 stages it fetches 3 IA32 instructions every cycle who hasn't seen this particular term IA32 everybody what does it stand for literature what is literature literature so it's hard to do it in 10 line essay ok so it fetches 3 IA32 instructions every cycle with a limit that it can fetch up to 16 bytes right so essentially what I am saying is that it is a minimum of these 2 3 instructions are 16 bytes it will fetch that much every cycle decoder translates this into 6 disks like microbes again this is a max bound on the decoder decode bandwidth decode maximum of 6 can translate into 6 disks like microbes 4 allocated to the first 1 each to the next 2 alright so the 4 decoded microbes so this is a static allocation of instruction to the translated microbes the first instruction will get 4 microbes alright and 1 micropeach to the next 2 if any instruction needs more than 4 microbes it is handled by the micro instruction sequence because you can see that this is the maximum that the decoder can handle translation right rename 6 microbes every cycle and this bandwidth has to match the decode bandwidth so to make sure the pipeline smoothly droplets otherwise any of those you know if the renamer could not rename 6 instructions 6 microbes the renamer would become a bottleneck actually so you can make sure that the bandwidth matches so in the code stage if the second instruction needs to be generated more than 1 micro then it would be actually decoded in the next cycle so actually what they do is that they take these 3 instructions internally they reorder them so slot the complex instruction across and now the 2 slots are given in the same instruction so it may not work so it should just give you an idea about the complexity of the decoder it is fairly complicated actually okay so rename 6 microbes every cycle and interestingly Pentium Pro did not have any register renaming so it looked like our initial design the renaming being done by the issue queue slots and Pentium Pro called this particular this rename slots retirement register file so essentially what they did was that the ROB entry had a special field which would store the value of instruction the value produced by instruction and that particular field taken through all the ROB entries is called the retirement register file and the name comes from the fact that the instruction executes produce a value in that particular field and it remains in that field until the instruction retires and at that time the value will be copied from the RRF to the actual register file so that is where the main process is and in some academic literature it is also called RUU register output unit although RUU is actually more complicated than RRF and we will not talk about RUU if you want to read about RUU I can give you an example and allocate 6 ROB entries again this bandwidth has to match the decode and rename bandwidth because rename means cycle you better allocate 6 and recycle ROB size is 40 if we cycle it allocate 6 and RRF is also 40 as you can guess because RRF is essentially a field of each ROB entry so if this is my ROB it is the same size you can see this particular field as an extension of this particular entry so total size is 40 there is a unified 23 entry resolution station so this is very different from the distributed architecture that we talked about it is like a same issue issue queue the register alias table to keep track of logical register to RRF map just like the rename map we talked about between two consecutive front end stages there is a queue so that some slack can be out recorded so that is very obvious between every two stages between fetch and decode there will be a queue of instructions so that the fetch and decode can work at their own bandwidth and same for the remaining stages also for example for some reason the renamer may stall what could be the reason why the renamer is stalled the decoder is not stalled but the renamer is stalled if there is a queue between decoder and renamer so the decoder can proceed until this queue fills up while the renamer remains stalled what could be the reason why the renamer is stalled sorry there is no reason to register this is RRF right but yes you are right we face that whatever you say RRF is full if RRF is full that means RRF is full because these two are located together right so RRF we allocate to stall first that will back to put it to remain and gradually it will fill up the pipeline in the rename stage it allocates the RRF entry it doesn't there are two different stages RRF is located in it yes effectively it resorts RRF entry at that point by allocate I mean at this stage it finally writes the instruction with the RRF so reservation already happens here you mark which six entries will be allocated but finally you write the values of the instructions like octodes and everything else in the next cycle okay the fetcher uses a branch target buffer and the two labels act predicted so there is a four bit history with each BTB set and it's a four way 512 BTB so there is a 10 cycle branch penalty why? commits three microbes per cycle maximum operating frequency is 200 Mb it might sound funny today you may do the standards so that's your V6 micro HL this one is a highly blocked diagram what it looks like so this is the instruction fetch unit that talks to the BTB this is the instruction decoder this is the micro instruction sequencer this is the register alias table that's the rename apper that talks to the RRF and ROV the reservation station which issues instructions into the execution units so let's see what units are here so we have instruction execution unit floating point execution unit the address generation unit so here we actually compute the addresses of the memory operations and there is a memory instruction unit which talks to the data cache which goes to the bus interface unit to the next level of the cache and there is a memory ordering buffer which maintains the order of those two instructions that go out to memory so if you are interested in reading this paper I can give you the reference you can read more about it