 So, the one that we discussed first is based on Intel made first micro architecture is the first one that it is announced in early 2000. It is a completely new architecture on bit of Intel 3. So, in the class we mentioned that Intel 3 uses the E6 micro architecture we discussed in some detail. So, in 24 trace caches were introduced they replace the convolutional instruction cache. It uses very high speed and particular units often these are called double pumped areas we will discuss that. So, essentially what this means is that if the processor runs at 3 gigahertz they will use in up to 6 minutes they will complete in half a cycle. Made first had 42 million transistors on a 217 millimetre square die on 180 nanometers in mass. So, this is a fairly advanced completely new mix R 10 K and R 41 6 4 which are in mid 90s. And the one that we will be discussing now was called the Millimetre. So, Intel Pentium 4 you will be not know because this is not come out really in the market. So, I mean why we will buy a Pentium 4. Now, Pentium 4 has several traditions we will discuss 3 of these the first one is Millimetre next we will discuss more and the next one So, these three pretty much show the new designs that we introduced as you go along. Prescott was the last one actually. So, Millimetre consumes a 55 watt at 1.5 volts. And it introduces a large number of new sylvanes functions and those are sylvanes functions. So, this is just a partial pipe. I am just trying to show you how long it takes to verify a branch prediction. So, calculation of next instruction pointer takes two cycles. So, so we need to go this is the program signals trace cache fetch takes two more cycles. So, we will occasionally find pipe stages in the drive. So, these pipe stages actually do not do anything they will be communicate from one part of the chip. So, this happens more as you try to achieve higher frequency because what happens is that where is do not get faster as you reduce transistor size. Transistors get faster so you faster computation, but since where is do not scale as fast as that your communication becomes more or less constant time. So, as you try to scale at a higher frequency what happens is that you find that suddenly you just cannot communicate from function A to function B. And actually accomplish the task on B. So, A has produced result which you want to communicate to B and B has to compute something new. So, you suddenly find that you cannot accommodate this communication at compute of B in a single cycle. So, then you have to divide into two parts. One of the cycles will just be communication nothing else. So, that is this particular stage. Then you allocate allocating queues allocate ROVs etcetera. Rename for two cycles then this one actually schedule the instruction queues this one schedule final grain micro of queues will discuss that that takes three cycles. And then finally, you dispatch micro of to the function units that takes two cycles register file read takes two more cycles you execute. So, this is the fastest possible execution that I am showing here. There are instructions that take longer. And after execution inter x86 ISA there is something called a flag register that you have to update based on outcome of the instruction. For example, data overflow by addition operation. So, there are many other flags that are mostly used by branch instructions to decide whether the branch is stationary or not. So, you flag a bit and then you decide whether which way of branch should go. And finally, the branch outcome is driven for one more cycle to the front end of the pipeline. So, if you count you will find that starting from the beginning of the pipeline to the time you get to know that the branch prediction was correct or not it takes 20 cycles. So, you can just add this up it comes to 12. So, it tells you that there is a 20 cycle branch prediction penalty. I predict a branch 20 cycles later minimum 20 cycles later I will get to know whether I was correct or not. And we soon get to know that every cycle this particular processor features three x86 instructions. So, 20 cycle last with 60 last instructions. You essentially wasted 60 instructions in the pipeline which is a huge resource. So, the implication is that you need smart branch predictors to find this long branch inspiration penalty. So, there is a block diagram of a micro architecture. So, there is a front end branch target buffer which has 4096 entries which talks to the instruction TLE and the instruction perfection. So, this one will be used only when it actually needs the trace cache. We do not find the instructions in the trace cache. So, we discussed how a trace cache actually works long back. Similarly, the trace cache is a branch target buffer as well. So, since trace cache in the trace cache a cache line contains mark sequential instructions we can take the line because there will be branches in between and depending on the target that you have seen in the past the trace will be built. So, when you read from a trace cache a particular trace you have to verify that the branch is behaving the same way or not if they don't then you have to discuss the trace. That is the purpose of having this branch coming back. You ask this PTB to tell you what the branches branch outcomes are that fall in a line and if they match exactly then only you can do the trace otherwise you discard the trace and go to the L2 cache. There is no L1 instruction cache in this architecture. The trace cache provides microbs which is why you see that the decoder is not in the path of the trace cache it is actually before it. So, only if you don't take from trace cache then only you decode otherwise you don't have trace cache gives you decoded microbs which directly go to the microp cube or if the instructions are fairly complicated you can actually execute it from the micro coded reloading mode. The microp cube sends instructions to allocator in the renamer. So, what they do is that of course they rename instructions which we have discussed already and the allocator allocates the instructions into two different cubes. One is a general purpose cube contain integer and forward point operations. Other one is a memory microp cube from these two cubes you send them to the respective scheduler cubes. So, these two cubes are pretty large, but in these smaller cubes so that you can actually prepare your different instructions and figure out the different ones. So, these two cubes send instructions to what they call schedulers. These are very small and here you can see that there are five schedulers one is a memory operative scheduler which will schedule a bit of this cube. There are four schedulers here one is a fast A new scheduler one is a slow A new scheduler one is a general floating point scheduler one is simple floating point scheduler. So, what they do is that all of them send the instructions to the register point. So, for example, floating point schedulers would send instructions to the floating point register point by the others who send the instructions to the integer register point. And then you have several functions in this here as you can see on the floating point side you have a floating point move unit and floating point ever makes SSC and SSC too. On the integer side you have two address generation units which will compute the loads to addresses. And there are two double comp A use that operate at the double frequency of the processor. And there is a slow A which execute complex instructions and operate at the same frequency as a processor. Memory operations continue after this particular address computation to L over data cache. And in the miss here they would go to the L2 cache and in the miss here they would go to the bus only to the system. So, that is the rough micro architecture also you can see that there is a direct path from L2 cache to the instruction DMP and instruction preparation. The reason for this is that if you do not hit it trace cache you have to go to the L2 cache to access instructions there is no L1 instruction. What is the difference between general FB, general and simple. Yeah we will discuss that you see what instruction goes to each other. So, the question about this general problem. So, let us start with the front end. So, here I will of course not try to go stage by stage because that will take a long time. 20 after branch there are many more after that. So, I will just club them together in terms of functional what they actually do. So, front end includes a trace cache, a trace predator, a decoder and instruction DMP, a branch predator, a branch target buffer, a return of the stack, a micro code raw and a micro cube. So, decoded micro ops are fetched from a trace cache of capacity 12 k micro ops and all of them may not be discussed already about that. The trace caches may have implications. Each trace cache line contains 6 micro ops, but can fetch only 3 every side. Trace cache uses a trace predator which is essentially a branch target buffer working in the trace cache to carry out branch predators. So, the sole purpose of this is that when you are consuming a trace it might contain several branches. You have to know whether the branches are going to be in the same way or not. Also it consults a 16 entry RAS on return instructions. Our trace cache means the ITLB is accessed to generate a physical address for accessing the entry branch. While fetching from L2 a bigger BTB with a branch predictor is used that is it has a similar organization as the P6 micro architecture which we have discussed. So, why L2 should have a BTB you should imagine your L2 cache as the L1 instruction cache. Then you can see that why you need a BTB. Because it is just as if this is your first level of instruction cache or where you are fetching. So, you will have to know what my next instruction point is going to be that is the purpose of this BTB. So, this is the reason why you had a BTB in the fetcher. But these two BTBs are managed differently this one and the trace predictor. So, how do you have a trace cache risk as the IA32 instructions are fetched from L2 cache. They are decoded into simple risk micro ops. So, we are talking about this point in the path here. So, you have missed the trace cache. So, you are going through this. So, you bring something from trace cache goes to a decoder and then put in the trace cache. So, decoder can handle IA32 instructions that can be processed with at most 4 micro ops. More complicated instructions are executed from micro core draw. So, this one also we discussed earlier we are talking about the Pekkin pro architect. So, what is the aspect of the decoder. On a front end BTB miss static prediction is used to guide the L2 fetch. And what is the static prediction it says for not taking back opaque. So, this one also we discussed in the last. As the instructions are decoded they are in micro op queue. The trace gets built dynamically as and when branches are verified. As soon as a trace length reaches 6 it is sent to trace cache because a trace cache line contains 6 micro ops. The trace cache instructions point to an L2 cache instruction point will manage different. So, there are 2 different instruction points. Once this is done we move on to the allocator and the rename one. The allocator consumes 3 micro ops every cycle from a 3, 4 micro op queue allocates an ROV entry out of 126 entries. So, why is it 126 because does anybody have an answer. Because you want something that is divisive by 3 because you are allocating 3 every cycle. So, that you can bank it you can organize it to have 3 banks 42 entries each. Allocates necessary physical registers out of 128 entry into the 40.5. Why is it 128? Because it has to be a power of 2. Allocates 1 entry either general purpose queue or memory operation queue both queue are free for. Allocates a load queue entry for a load. Load queue has 48 entries allocates a store queue entry for a store store queue has 24 entries. The rename or rename 3 micro ops every cycle maintains an 8 entry register alias table for these 8 registers for x86. Writes the renamed instructions in general purpose queue or memory operation queue. So, the next stage is micro ops scheduling which consumes instructions from general purpose and memory of queue and sends them to respective schedulers. There are 5 schedulers for different types of instructions each with 8 to 12 entry collapse in the schedule queue. Each scheduler handles different types of instructions they already mentioned about this past day in this way in this way. The schedulers work on the 3 constraints availability of operands availability of issue codes and availability of functional links. So, there are 4 issue codes in little medium 4. 4 0 is shared between fast A and U and floating point move store and floating point exchange. So, you can see what the fast A and U does it carries out and subtract logic operations it computes a store data and branch operations. 4 1 is shared between fast A and U and slow A p. So, there are 2 fast A p's right slow A p does S S C and M M X. 4 2 is for load and 4 3 is for store. So, these are the 4 codes that we are talking about. So, here they are shown actually. So, this is these are ports 2 and 3 we have these 2 and here you can see the other 3 other 3 ports sorry other 2 ports. So, there are ports that are shared between. So, you can see that this is one port and this is another port. So, what is the implication of sharing issue ports? So, the question is of course, we would have designed the past self. So, that you have dedicated dedicated issue ports. So, again I mean we discussed in a little bit earlier also why you might want to share for example, register file that. The reason is that some instructions. So, these are very common, but these are often not so common. So, you can assume that most of the time the fast ALU will actually get a port whenever you use it. Similarly, here these are not very common instructions, but these are common instructions the fast ALU instructions also the slow ALU instructions are not that common, but shift and rotate are actually common. But most of the time you can assume that the fast ALU will get a port and load and store are given dedicated ports. The reason for this is very interesting that you find that memory operations are always given most of the time are given fast lanes for in x86 architecture because x86 has very small number of registers. Those who have got the compiler of course might understand that if you have small register number of registers you will have a lot of load store operations in your program because the processor will have a lot of registers in the speed data on the memory and then against loading back. So, loading point exchange exchange is a loading point register with top of the loading points. But it is not an attribute operation it is just a loading point exchange structure. So, fast ALU can receive an instruction on every edge from either port 0 or port 1 because both the ports are connected last year. The slow ALU can receive one instruction from port 1 every cycle. The slow FP and simple loading point units can receive one instruction each from port 0 to port 1. At first one load and one store can be fed to the cache every cycle because they get only one port each. Issued instructions proceed to read operands from register 5 which may get over written by a bypass. So, how many read write ports in register 5? So, you can compute that based on this particular issue requirements. Similarly, you can compute that for porting point 5. And Pentium 4 had a multi cycle pipeline bypass network. So, although you have designed bypass networks you have a pipeline those. You are sure that you can bypass values from one stage to another in a single cycle. As you go toward high frequency this may happen. So, what you may have to do is that on the destination stage of course, it is just bunch of wires going to the input of the multi cycle. That is your bypass path. You have to put matches. Because the wires it just takes time to communicate from here to there which will not happen this way. So, it pipeline the bypass path. So, let us suppose this is my destination pipeline register. So, I have to bypass from here and this length of the wire is so long that you cannot do it in a single cycle. So, what are the options now we have. Well, you can say that this value will be held for several cycles. And during those cycles this register cannot generate any more bypass values. Otherwise the value will get current. The other option is that you can put latches in between. So, you segment the wire. So, that this can be done in a single cycle. So, then while the first packet of value is here in this particular segment you can generate new values. So, that was a key requirement to meet the high performance. Otherwise it was almost impossible to have high performance in this process. Fast ALU bypass is carefully designed to operate in half cycle. So, the next slide you look at the values. So, there are two double clamped ALUs produces results in half cycle. So, how they do that? First adder acts on low 16 bits in quarter of a cycle which is basically in half cycle of value clock. And it is made to the upper half forward of value as well as bypassed the lower half forward of value for dependence. Process is upper 16 bits in the next quarter cycle and updates flag in the next quarter. So, essentially what happens is that in half a cycle you are done with the operation, but the flags are not in updated. The flag of retrievers another one quarter of cycle. So, the data is sufficient for beginning of a cache tag you come, because this is also used by the address generation unit. And the for the index part, the index part of the cache falls within the last the least significant 16 bits. That is how they actually size the ALU on cache. So, that the the first ALU is enough for generating the caching mix. The upper part of the address will be needed for ALU transmission. So, the ALU loop should just take over the 16 bits and it is input masses. That is basically enough. So, pictorial is what it looks like. So, this is the the first ALU which operates on the on the lower 16 bits and see that bypassed immediately here. So, that it can start operating on the next operations lower 16 bits. Also, it is sent to the next ALU to operate on the higher 16 bits. So, the caching mix has to be sent for the for the for adding up the next 16 bits. And then the output updates the flags. So, this is basically the carry out. It signifies over flow over flow or extra borrow whatever you want to do that. The barrel shifter is a four cycle one. There is a 14 cycle multiply and a 60 cycle divide and of course, there are MMS and SSC. So, these are basically the functional limits that you have. The cache island has a small and fast L1 data cache. Load access latency is critical for IHR2 as I have just mentioned Intel architectures pay special attention to memory operations. So, the L1 data cache is 8 kilobyte four way associated 64 bytes and write through. Virtual index is intact integer load latency is two cycles and floating point load state six cycles. Load hits speculation for dependence as we have discussed for RTK and R4P164 with the help of predictor based on partial address match. In case of this we execute all the dependence. So, here unlike the other architectures they actually do not start a refresh. They only re-execute the dependence and here how they do it is that they have an estimate of the upper bound of the number of dependence. Because you can exactly calculate the number of cycles that you take to know the exact outcome of the of the hit of the cache and with those cycles how many dependence can issue that you can also calculate from your constants. And they actually maintain a buffer of that size where they dynamically remember who are the dependence for each of the nodes that are complete and the final. There is a for entry LO on MSHR we have discussed the purpose of this particular structure in the class. Store to load forwarding is allowed if the store address matches the load address and node size less than the store size. So this one is a little conservative because this rules out the common place. So suppose I have a load instruction which loads these bytes and I have a store instruction which stores these bytes. So this is the store and this is the load and this is my program model. So the store comes before the load. So in this case the load can actually take the value from the store completely because the load is stored but that is disallowed. What I want is that the starting address must be aligned and the load size must be less than the store size. So this is only a subset of forwarding to P. Here on L1 cache we have 256 kilobyte each way I want to read by line size right back into cache. Seven-sided round trip into the latency. It has a multi-stream hardware procedure identifies up to eight independent streams using simple pattern based predictors. So mostly strike predictors which try to pick up arithmetic progressions in addresses. Prefixure stays two cache lines ahead of the current request. Interfaces to 100 million spot pump 64 which is the bus that connects to the computer. Any question? So we move on to the next one. After development. So what is the code? Oh I'm sorry. So it actually maintains four different clocks at four different phases and on each for every age of each of the paced clocks you can transfer 64 bits. So essentially it has an effective frequency of 400 meters. Why is it the same as the process of the clock? Sorry? There is no same with the process of the clock. Yeah there would be a synchronizer at the interface. So synchronizers are essentially buffers. The problem is that from this side we have fast requests coming in. We have a buffer which would actually allow you to have a slow interface on this side. Any more questions? So this one Willamette was 180 nanometer. So in between there was Northwood which was 130 nanometer transistors. 55 million transistors on 146 millimeters per die. The only thing that they changed from Willamette to Northwood was the double the size of the AC cache. That's it. The next one was more interesting and that's what we're going to discuss. That is the 90 nanometer pithium core. That's the Prescott code. And this one was actually the last in the pithium family. So this came after Northwood and it redefined the concept of deep pithium. This is the first processor designed in industry with 90 nanometers in most process. A few micro-architectural enhancements over original networks. So this is what we will discuss. It had 125 million transistors so you can see that more than double compared to Northwood. So must have done something. On a smaller die, 112 millimeters square store buffer size is increased to 32 which was 24 again. A low on data cache size was doubled to 16 kilobytes and was also made 8-way set also. It has a 31 cycle branch from 20 equals to 31. So it really had a very deep pipeline. The first version of Prescott that came out had 2.8 kilohertz plus frequency and it was predicted to have 4 kilohertz by the end of 2004 but Intel could not achieve it because of power problems. So Intel stocked at 3.8 kilohertz and this is what you can buy in the market in the name of Intel 2024 instrumentation. That's Intel 2024 Prescott clocked at 3.8 kilohertz. You can go and read up this one. I don't know if it's still available though. It's an old link. You can read the story why you didn't have to kill the 4 kilohertz. So this was the termination of the Pentium line and we'll discuss what happened after that of course. But first let's try to understand what these enhancements are in Prescott. So to optimize the store to load forwarding so they found that it's impossible to have a complete address match between a load with all stores which is needed for two purposes. So if you imagine a single queue of load and store instructions. So let's look at a load here which has a bunch of stores before it, bunch of stores after it. So you need to care about the stores before you because you may have a match with the address. So you may have to get the data from here on the store instead of the cache. You may have to care about yeah, fine. So loads you have to care about the stores before it for this reason. And the stores may have to care about loads after it for a symmetric reason. So when a store issues it has to make sure that there is no load which has got a wrong match. So now the fundamental question here is that when you're issuing a load whether you should get the data from a store or you should wait or it can get the value from the load data. It can get the value from a store if the store has already executed and there is an address match. It must wait if there is probably an address match you don't know. And it can go if you know for sure that there is no address match and the data can be taken from the load cache. So for all these things the fundamental element that you require here is a match of address. So you have to compare these addresses with all the store addresses here. And Prescott found that since the target is very high frequency that they just cannot accommodate a full address match with their cycle. So you cannot compare a 32-bit address of the load with a 32-bit address here. So what they said was that well we will be happy with a 32-bit address match. So you pick up load qubits and just do address match. So the main point here was that the time constraint because store forwarding logic must not have higher latency than L1 data cache. So it should not be that of course you look up the L1 data cache parallel with weakness but this should not become the one. So L1 data cache latency and this ideally should match and this one should be smaller should not be bigger. However if we the latency of this particular match was exceeding the L1 data cache of qubits. So they speculated based on partial address comparison. So if the partial address comparison matches they say well the load should take a value from the store which may be wrong. And the other side could also be wrong. If the partial addresses don't match the load is safe to go and access the L1 data cache which could also be wrong. So you initiate full address comparison I'm sorry if the partial addresses don't match then of course you are sure that there cannot be a match. So initiate a full address comparison also and overwrite the previous comparison outcome if needed. So if there is an overwrite that means the load and effectiveness must be instituted to outcome here. They had a new logic to forward fights contained fully in a store data even if the address is misaligned. So essentially they finally addressed this particular case what they did was that they would actually do a rotate of the load so that they first align the addresses. So in this case essentially what would happen is the load should get these bytes only. So they would actually shift the store data to this side and take because the circuitry is such that it can only take the first few bytes. So the circuitry remain unchanged from your previous generation which will allow you to take only the leading bytes of the store. So here they kept that unchanged but what they did was that they actually shifted the store data by whatever amount they want. So that you can still continue to take this data but now this will become this will come here. So that requires a shift data. The other case that is here we are talking about the situation that the load so that there could be two possible effects of this speculation. One is that the load could have consumed the wrong data from a store because of a wrong partial address management. So that could be that the load consumed the wrong value from the LOR data cache because there was a store which could not compute the address yet but the load bypassed. Because what the load is doing here is that it could only pick up the stores from this side which have already computed address. There could be stores which have it executed. So in both cases there would be a re-execution of the load and the data. Then the static branch predictor store is not executed. Yes for the store issue it checks all the loads behind it. So recall that when you missed the BTB element was using a static branch predictor which was forward not taken backward yeah so north and then both. So the observation that they made was that not all backward branches are load branches and hence may not always be taken. So I should mention here that these are actually not very frequent cases. These are mostly totes and other cases where a compiler might be optimizing certain pieces of code and generating a backward branch which is not a load branch. So Prescott Team did a study and all that there is a threshold distance between backward branch and its target below which it is a load branch. So what it says is that if you take the distance between the target and the branch instruction if it is larger than a threshold it is unlikely to be a load branch which means a loop was a normal system. So what they concluded was that you predict only those backward branches are taken that have target distance below this predetermined threshold. So this threshold was actually hardware inside the processor. This was not really dynamic threshold. All the team observed that there is a correlation between condition time of a backward branch and its behavior. Backward branches with certain conditions are almost always not taken. So on a BTV miss you could just fall through in these cases. So this entire thing is a static predation there is no dynamics in here. All you do is you do analysis of your benchmarks fix up a threshold which separates these two points you know maximally and then you look at these conditions you decide statically which conditions should be tagged with loop branches and which should not be. And then you hardware that and do a static predation that's it. There is no dynamic learning going on. So then it enhanced the direct predictor also and particularly for indirect branches in this case because BTV is not good enough you must have seen it in your homework. So the observation was that data dependent branches have targets correlated with global path history. So you have already implemented one such predictor in your homework and hopefully you have seen that your normal BTV is worse compared to this particular predictor. So this idea was borrowed from a BTVM team which was part of the Sentinel Chief Set team. In addition to a conventional BTV essentially they had a table of targets tagged with global history. The global history direct branch predictor gets priority over the prediction. Some more micro-architectural advancements XOR is often used to load a zero value in a register especially if you do not have a hardware to register like in X86. So you typically come across this kind of code in X86 where you are essentially loading a zero in EX. And EBX is just sounding as some register here but it could be anything else. The problem is that consider an instruction that produced EBX before it. So there too many instruction which produced EBX. There will be a bunch of instructions that can be consumed in EX. This instruction is also appearing to consume EBX but that is not actually true. It has nothing to do with the value in EX. So it introduces an unnecessary dependence between previous producer of EBX and this instruction. The point here is that this instruction will probably be held up until the instruction produced in EBX has executed. Which does not make any sense. It is actually a dependent instruction. Same problem with EX. And whatever register you use you are setting up a dependence change. So, the Northwood scheduler could actually detect all these situations and ignore the dependence. So, the logic was very simple, they had a table of such op codes which could be problematic and just look at the op code of the instructions compared to the table and make sure that the both sources are same then what we do is let it go it is not holding it up. The press what expects the set of such instructions. So, x and y 6 has many more such instructions where we could do such interesting things where there is no dependence as such. The Northwood also used the floating point multiplier for doing integer multiplication and this had an overhead of moving integer sources to floating point in a path and back which was same in 206 for divide. So, last time we discussed about this we say that 206 for use of the floating point divided by moving integer. So, press got implemented and we get integer not required. Some more enhancements the general purpose and memory of queues are made longer, scheduler queues are made bigger to get larger selection window, 30 new SSC 3 instructions and 2 caches increase to 1 megabyte, enhance software prefetch Northwood used to cancel software prefetch is that cost pg and ptr releases, press got just continue. Enhanced hardware prefetch agentry low on MSHR, concurrent multiple pasted walks needed for hyper threading which we will discuss next week. Let it be done what happened after that. So, once the Pentium line got dominated until essentially took their Pentium M core which was designed for mobile devices and improve it to be used in their desktop, server and notebook market. And this was called the improved architecture was called the core micro architecture based on which you have all your processors starting from new or down to today the core ISM and everything that use this basic micro architecture. As such nothing but changed cannot change this fundamentally it is still an out of order issue processor some numbers changed fetch requirement with increase from 3 to 4. So, now that matches the pipeline was made shallower to save energy. So, press got had a 31 plus stage pipeline. So, it has a much smaller pipeline actually. They also introduced something called micro fusion this was very interesting. So, what the deal was that. So, they had actually two types of fusion macro and micro fusion. So, macro fusion actually takes multiple extrinsic instructions and fuses them into a single operation at runtime. So, this is done totally in hardware and micro fusion what it does is that it takes the micro ops of an instruction and fuses them to reduce the number of micro ops inside So, that actually saved a lot of energy and also same execution time. Enhanced rounding mode control in floating point pipe. So, actually well I should say that this was actually done because they did not want to change their decoder. Because they could actually could have it has a decoder to actually not generates only extra extraneous micro ops. So, after the decoder they put this optimizer which would actually examine the micro ops of fuses and fuses. So, they are made as rounding mode control in floating point pipe faster integer division on average. So, these two are very interesting that actually where we give a very high performance boost in their processors early detection of problem loads. So, this one is talking about loads that bypass store and later get caught because that was a mistake. So, I have seen it 2 out of 6 4 that they use a load weight table to remember such loads which has caused problem in the past. So, the core micro optimization actually incorporates one such predictor a similar such predictor which tries to identify the loads that continuously cause such problems. They do not actually issue them specularly they wait until all the stores before it has completed. They also incorporated a load data cache predictor which is instruction pointer directed. So, this one is very interesting what the observed was that when actually this has already been observed in several research papers that if you take a program and look at a particular load instruction. You find that the data addresses that the load instruction generates are highly coordinated. They will probably be sequential if you are accessing the array. So, I am what I am talking about is that. So, this one will be broken down into a load and a store and it are passed with an additional portion. So, if you look at the addresses that these load instruction generates there will be all sequential. So, and this is just one instruction that you are monitoring. So, that is fairly easy to prevent that is exactly what they did. They essentially maintain a table where for each load store instruction's address there will be a prefeature which would actually find out this particular parameter in this case the parameter is structure. So, the advantage of an instruction pointer directed prefeature over an address based prefeature exactly it gives you a lot of compressed information. Just for a single program bounded you completely compress the whole pattern instead of actually looking at dividing it into a bunch of streams and learning each of the streams separately. That is not needed yet which was. So, those prefeatures are still there because those are needed for catching other complicated patterns. But these one can catch simple patterns and can work fairly well with a very small storage budget.