 So, today what we will do is, we will start with something called a cache hierarchy, that is what we used in today's course. So, this what you will get by the observation is that ideally, I want to hold everything So, I completely want to avoid going to memory. But as we know, with increasing size, the access time increases. So, for any structure, if you make it larger, it will take more time to access. Which essentially means, a large cache will slow down every access. So, whenever we access the cache, it will take longer and more time. So, trade-off should be very clear here. If I have a very large cache, I have a large number of hits. So, I will never, probably I will never listen to the cache. You can make it large enough. But every hit will be extremely slow. So, if you look at the total time spent, that is essentially the product of number of hits. And the time to access the cache. Plus, it means how much time it takes to get the data from the memory. Except the total time spent in accessing the data. Whatever I want to get that you have. So, that is often called the average memory access time. That is the number of hits multiplied by time to access cache. Plus, time to sort the misses. Or that this also can be decomposed into number of misses times or miss. So, essentially a large cache nullifies the second term completely. An infinite cache will not have the second term. Well, not exactly true. Initially when you start executing, you have to bring some data from the memory. And once you have all the data in the cache, it will never miss again. So, nonetheless a large cache tries to maximize this particular term, number of hits. But this is going to be also very large. So, it depends on how these are balanced and a large cache normally is not very good. So, what people do is, they put increasingly bigger and slower caches between the processor and the memory. And the rational here is that, again the locality principles that we discussed last time is working here. That is when you access a folder to a data point. If you speculate that, your near future will access the data point once again as a temporal locality. So, what you are doing here essentially is that, you put a small cache close to the processor. You bring a data point, put it in the cache. And the hope is that, you will access it again in future and have a cache. But over time, this data will become old and will probably become useless. And that data will eventually get replaced by some other data. Again, that will not happen. So, this small cache close to your processor is a very fast cache. That is giving you most of the hits on the cache. Except the first time that you touch a data, it will have to product. Then you put a slightly bigger cache after it. Which will accommodate some more data. And it will be slightly slower and slower and slower. So, overall this is going to be better than this. And this is the punch line that is, keep this most recently used data in the nearest cache. Which is the register file. However, what you put in the register file is not really under the control of the hardware or the architecture. This is decided by the compiler, when you compiler program. What data goes into what register and at what time. The next level of the cache is the L1 cache. The level one cache. It is the same speed or slightly slower than the register file. But normally much bigger than the register file. Then you have the L2 cache, way bigger than L1 and much slower. Then you have the L3 cache, which is even bigger and even slower. And today industry is even contemplating an L4 cache. Beyond which there will be more. So, you put gradually larger and slower caches between the processor and the reward. Bridging this particular latency gap. So, here is an example. A couple of examples. So, Intel Pentium 4, it is an old processor. So, I am talking about Intel Pentium 4 had several incarnations. Gradually with increasing frequency. NetBurst was the first architecture that came out. So, here is a memory hierarchy of NetBurst. It had 128 registers, accessible in two cycles. So, in this class whenever I talk about registers, I really need physical registers. We talked about the distinction between physical and logical registers. So, of course, all Intel processors would have only 8 logical registers. Here I am talking about the physical registers inside the processor, not visible to the compiler, managed by the hardware. So, it has 128 registers, accessible in two cycles. However, again, even though this is under the management of hardware, remember that what data goes into what register is still decided by the compiler. Because there is a strict mapping from logical to physical register all the time. Then you have the L1 data cache, which is 8 kilobyte in size, 64 bytes line size, accessible in two cycles for individual loads. Here is the L2 cache, which is 256 kilobyte A2 associative, 128 byte line size, accessible in seven cycles. So, as you can see, as you go down, the latency increases size also. Intel, so again, Intel has several incarnations. So, this is the Madison processor. It has 128 registers, accessible in a cycle. L1 instruction of data caches, 60 kilobyte each. Four-way set of 64-byte line size, accessible in one cycle. Unified L2 cache, 256 kilobyte pipe cycles. Unified L3 cache, 6 megabyte porting cycles. Couple of things to notice here, that is you might ask, well, both of these processors have the same number of registers. This one is accessible in a cycle. This takes two cycles. The question is, both of these processors have the same number of registers. But here the register file requires two cycle access time. Here it requires one cycle access time. Why is that? In fact, cycle times are different. So, Intel Pentium 4 has a much faster clock frequency. So, the actual time is roughly the same. If you look at the total size of the register file here, how much is it? So, Intel Pentium 4, the one that we are talking about here, is 32-bit machine. So, each register is four bytes. So, how much is it? Amount to 512 bytes, accessible in two cycles. Whereas, L1 data cache is 16 times larger. 8 kilobytes has the same latency. How is that possible? So, at one point in time, we made a statement that in a memory structure, if you have more ports, it slows down. So, can you argue along that line? How many? What besides the number of ports in the register file? It depends how many instructions you want to make. That besides what? Number of required three ports. Who decides the right ports? What is that? What does it depend on? Number of instructions. What about cache? Number of ports in the cache? Who decides that? Number of read and write ports in the cache. Sir, write is only one bit. Like this? One write is required. One write is required? One write is required. Why? Cache plays. Yes, you are right. So, why do you write to the cache? Then, I am doubting my question. So, you write back from cache, you will write back to me eventually. That is a replacement. That is a right to memory. We are talking about right to the cache. Is there a cache miss? You have to bring in the new data. At that time, you write to the cache. When do you read from the cache? As instruction is required, it will read from the cache. What do you think? What would be the requirement compared to the register file ports? What would be the requirement in the caches about the read ports? It would be half of the register file requirement. Not really half. It is not installed both. We will read from the cache. Store it to some extra work. If you first read the data about, do the new modification and write the data back to the cache. So, data cache ports is a subset of the peak issue week of the bus. Because peak issue week will contain all types of instructions. And only a subset of that will go to the data cache. Whereas, the register file will have to be avoided all other instructions. Potentially all other instructions. So, it has a much larger number of ports. For example, Pentium 4. If you assume that the issue will be 6. Then your register file will require at most 18 ports. 12 read, 6 write. Whereas, your data cache will probably require a couple of read ports. Assuming that you can only send a pair of load stores and that was two memory operation services. So, that is what is impacting the latency here. If the register file has the same number of ports, it will have a much, much smaller latency than the cache. Just because of number of ports, the register file latency is smaller. So, that is about the hierarchy. Just notice how the latency increases. Usually, it jumps a lot. 5 to 14 is 3 cache. Of course, the size is also much larger. Today's processors, normally the high and server-grade processors have much larger than 3 caches today. Although in this course, probably you will have time how exactly these very large caches are organized. But just keep in mind that they are very large. They routinely cross 10 megabytes. Often, they are more than 20 megabytes. So, we talked a little bit about states of cache line when giving an example. So, let's try to open that up a little bit more. So, life of a cache line starts off in valid state. That is, the line is not in cache. And access to that line takes the cache list and pitches the line from main memory. If it was a read miss, the line is filled in the shared state. We may discuss it later if there is time. For now, just assume that this is equivalent to a valid state. So, shared state is just a valid state. In case of a store miss, the line is filled in a modified state because you know that you are going to modify the line. Instruction cache lines do not normally enter the m state because there is normally no store to the instruction cache. Instructions are read only. You read instructions and execute them. Eviction of the line in m state must write the line back to main memory. This is called a write back cache. Otherwise, the store will be lost. So, there is a second type of cache which is called the write through cache. Well, whenever you do a store, you not only update your cache, you update the main memory also. In which case, you won't require the m state because the memory is always up to date. You only require two states, valid and invalid. For write through caches, there are two flavors. One is called write allocate. One is called write no allocate. So, write allocate, write through caches, what you do is on a store miss, you allocate the block in your cache. And you do the store in both the places, in the cache and in memory. In write no allocate, you don't even allocate the block in the cache, in the store miss. You only send the data to memory and update the data in the cache. Also, the time will be dealing with write by caches. We'll touch upon a little bit on write through caches. And in your homework, you will actually get to compare a write by the write through cache performance. You'll see why one is built around. There is something called an inclusion policy in a cache hierarchy. So, how it is defined. So, normally the contents of level and cache exclude the register file. It's a subset of the contents of level and plus one cache. If this value is satisfied for all values of n, then you say that the cache hierarchy is inclusive. So, essentially what I am saying is that whatever I have in my L1 cache is guaranteed to be there in the L2 cache with inclusive hierarchy. Now, today's processors usually implement something hybrid. So, usually it is part of the hierarchy is inclusive, part of the hierarchy is non-inclusive. So, for example, in a three level hierarchy in Intel processors, typically L1 and L2 will not be inclusive. They won't actually satisfy this property. But L3 will be inclusive with respect to L1 and L2 contents. So, anything the union of L1 and L2 is guaranteed to be in L3. So, that's what it really means. But if something is in L2, if something is in L1, it's not guaranteed to be in L2. So, that this property is not satisfied for L1 and L2. However, in this discussion, we are going to assume that we are talking about inclusive hierarchy where the contents of L1 are guaranteed to be in L2. So, eviction of a line from L2 must ask L1 caches, both instruction and data to invalidate that line in present. The reason should be obvious, because if you don't, you are going to lose this property. Because the line is being evicted from L2, it must also be evicted from L1 caches at that time. Otherwise, there will be a line in L1 cache which is not in L2 cache. And then client's inclusion. Is this clear? Any questions? Yes. It's okay. A storm is, since the L2 cache line in M state, but the storm really happens in L1 data cache. So, we have L1 instruction cache, we have L1 data cache, we have the L2 cache, and we have our main memory. So, this is what we are looking at here. And this is my pipeline here, the processor pipeline, which would access the instructions from the instruction cache and will do read and write to the L1 data cache. So, what I am saying here is that, let's suppose that there is a storm instruction which will follow this path, right? We will first go to L1 data cache, then to the L2 cache, then to main memory, if you don't find the data. So, here I am assuming that I miss in all the caches. So, a storm is, a store access first go to L1 data cache misses here, then moves to L2 cache misses here, brings the data from memory, it fills the data on its return path in L2, and L1 data cache, right, both to maintain inclusion. Now, the question is, what would be the state of the data, that the data block in these two caches. So, what it says is that, the data block in L2 cache will be in states. Even though, the store actually happens in this cache. So, when you modify the data, you actually don't notify the new data to L2 cache. L2 cache is still having the old data. So, L2 cache does not have the most upgrade copy of the line. The reason is that it's not needed. So, implicitly, I am actually saying that, the L1 data cache is a right-back cache. Only when it is evicted, that particular block, the new data will be written back to the L2 cache. That's what it really means. So, eviction of an L1 line in M state, writes back the line to L2, and this is when the L2 cache gets the new data. Until then, it doesn't have it. Now, because the question is there, you ask, why did I do this then? Why should I fill the data in M state in L2 cache, even though the L2 cache is having the old data? The reason is this. If the line is evicted from L2 cache in M state, then you are in trouble. Because unless you send the correct data back to memory, there is a chance of losing the new data actually. So, whenever you evict an M state L2 cache block, you first ask the L1 data cache to send the most up-to-date copy, if any. Then, it writes the line back to the next higher level, that is, L3 or main level. So, why inclusion is important? It simplifies something called a coherence protocol, which we will not at all discuss in this course. So, there is something called a coherence protocol. We will discuss it a little, we will talk about it in two devices, but not much in detail. So, just keep in mind that there is a reason why this is done. Not that one fine morning somebody decided to do this. Any question on inclusion? Sir, why do we need to have the data block in M to L? Whenever it evicts the block, as it asks the L1 to evict the data block. So, at that time, using the statements of L1 cache, we can modify it. But you have to think about the low level implementation details. So, when L2 cache evits the block, it needs to know whether it requires the data response or not from the low level cache. If it does, it has to allocate a buffer to put the data. So, the purpose is reservation, nothing else. You have reserved some buffer for building the data that comes back from the low level data cache. This is like after every store, we have to modify any statements of L2 cache. Right, we have to tell the L2 cache that I am going to do a store. You have to go to L2 cache. Yeah, well, it is actually needed. That is why in a multiprocessor environment. Today, a multiprocessor is a multiprocessor. A multiprocessor, you can avoid doing that. If you are doing a store, you have to tell others that we are doing a store. Others may have the copy of the whole data. What is the first one? Yes, that is right. Just to clarify what they are actually discussing. Consider a sequence of accesses. The processor first loads the data in the valid state in the L1 data cache. Also in the L2 cache. Subsequently, the processor wants to write to that particular cache. So then what has to happen? What he is saying is that at this point, I must tell the L2 cache to move to M state. Because that is what it actually tells me. That is invariant. If you have a dirty block in L1 data cache, the corresponding block in L2 cache must be in M state. So he is saying that that is a big overhead. The answer is yes, it is an overhead. Which we actually cannot avoid in multiprocessors. Which I won't clarify here. But intuitively, why you need to do this is that you may have a copy of data shared by two processors. If one processor modifies the value of that data, other must know the modified time. So before one processor does a store to that data, you must tell the other guy that this data is no longer the correct time. So there has to be a notification, there has to go. Any other questions? Because otherwise, if the block gets evicted from L2, the memory may not get updated with the correct data. But it is not in M state. But what he has suggested, I will reiterate that. What he is saying is that in any case, an eviction from L2 cache will ask the L1 cache. All the time for maintaining inclusions. Then why have an M state? So it is just a matter of low level implementation. Because if it is in M state, L2 will know that there is a potential data response that can come from L1 data cache. So it may deserve some buffer space to fill that data. So it is just a low level implementation. If you want to ignore that, then yes. You don't need to switch the L2 cache to M state. Any other questions? There are actually many other states which I am not really talking about. Which does complicate matters but gives you some extra benefits. Like the case that we are discussing may actually be handled with some extra states. But I will not go into that detail. Alright, so with that, let's try to trace the sequence of events that happen when the first instruction of a newly started program executes. So you take the starting program counter. You access the instruction TLB. Because the program counter is a virtual address that is translated to physical address. You access the instruction. So you access the instruction TLB with the virtual page number extracted from the PC. And since it is the first instruction, you are going to have an instruction TLB miss. You won't be able to find the process. So you invoke the ITL miss handler. The ITL miss handler has the responsibility of calculating the page table entry address. If the page table entries are cached in L1 data and L2 caches, this one we discussed a little bit last time that you can do this. Look them up with the page table entry address. However, in this case, you are going to miss there also within the first instruction. Then you access the page table in main memory. And the page table entry is going to be valid in this case. Which means you take a page font. So you invoke the page font handler. Allocate the page frame. Read page from this. Update the page table entry. Load the page table entry in the instruction TLB. Restart the instruction page. So now you have that physical address. Because this time, of course, the page will get the translation from ITLB. So you can access the instruction cache. Which is going to be a miss. Because for the instruction, you won't have it in the cache. So you send a refill request to higher levels. You will miss everywhere. So in this case, we follow this path. You miss here. Then you look up L2. You are going to miss here. So you go to memory now. And you send the request to the memory controller. Which used to be called the north bench long back. Now it's not really relevant anymore. Because memory controllers are now inside the processor chain. So what guarantees that this particular instruction page will be found in memory? How do you know? Because let's say the next step is memory controller has is a passive device. It gets a request. It gets an address. It sends the request to the memory. That's it. What guarantees you that this particular page is in memory? How do you know the data that the memory will return to the memory controller will be correct? Because at this point, there is no check. The memory controller gets an address. Follows it to the memory. That's it. The answer is already discussed. In the last slide. So this particular action makes sure that the data will be in the memory. So you read the cache block from main memory. So now the cache block comes back. And while coming back on its way, you will fill the L1 instruction cache. And return the particular instruction word to the processor. So extract the appropriate instruction from the cache line with the block offset. And then the printer can start. So now the processor has the instruction finally. It can decode and execute. So this is the longest possible latency in an instruction data access. Is the sequence clear to agree? Nothing more happens. So a little bit more on the TLB access. The important observation to make here is that for every cache access instruction or data, you need to access the TLB first. So that's what gives you the physical access. So it puts the TLB in the critical path of every instruction. If you have a slow TLB, the instruction execution is very slow. So if you remember, the point is that you take the partial page number, look up the TLB. You get a physical page number. You cabinet that with the main offset. You get a physical access. And then you can look up the cache. So that puts your TLB in the critical path. The cache access cannot be done until the TLB access is completed. So this is the physical address. So I put a plus here. This is just a calculation. So this is the sequential operation. I look up the TLB and the cache in sequence. There is no overlap. So what I ideally want is to start indexing into the cache and read the tags while TLB takes place. So I want to do this lookup and this lookup in parallel. How can I do that? So the only way to do it is to index the cache with the virtual address. Because to begin with, I only have the virtual address and nothing else. So the question is, can I use part of the virtual address to index the cache? And I can have my tag in the cache derived from the physical address. Because by the time I look up the cache and the tag comes out, I would have the physical page from number from the TLB so that I cannot do a tag comparison. So this is called virtually indexed physically tagged cache. That is what is used today in all commercial buses, at least for the L1 cache. So you extract index for the virtual address. Start reading tag while looking up the TLB. Once the physical address is available, you do the tag comparison. It overlaps the TLB reading and cache tag reading. So it relaxes these latency a little bit more. Because now you can afford to make your TLB a little slower. You can afford to make your cache a little slower but still not use it. Because now these two will be going parallel. So here is the typical latency in the memory hierarchy. Don't take these as the authenticative numbers. These are just examples. Just to show you what the tabs are. So L1 heat latency is about nanosecond today. L2 heat latency is about 5 nanoseconds. L3 is 10 to 15 nanosecond. Main memory is about 17 nanosecond. We have access time plus bus transfer stretcher. So it gives you about 110 to 120 nanosecond. And if you have a more complicated system, things may vary more. So here I am talking about a variance of about 10 nanosecond. Variation of 10 nanosecond. The gap between minimum and maximum may be larger depending on the update. So the point here is that there is a very big jump from L3 to main memory. So L3 cache on the last level of the cache is your last line of defense. If you miss there, you are going to take a very big performance. That's pretty obvious. So your last level cache should be very intelligent. It should be able to identify which data blocks are important. It should retain them and make room for them by flowing away the useless blocks. So it has to do some form of prediction. In some way I think this block is going to be reused in future. So I will keep it. That block doesn't look like to be reused in future. I will show you and make room for some other issues. So your cache controller has to be fairly smart to be able to do these things. It's a very hot research topic because you can see why it is. There are good reasons for that. If your last level cache is a dump cache, your processor is not going to give you good performance. That's pretty obvious. Because this jump is enormous actually. So just to give you the problem a little bit more concretely. If a load misses in all caches, it will eventually come to the head of the ROB. You have the ROB. So this is my ROB. So this is the head. So this is where my retirement is currently running. I am retiring instructions from here. And let's suppose there is a load here which is currently executing. This one is currently looking up the cache. So this particular load instruction looks up the L1 data cache. Missiles in the L1 data cache. It goes and looks up the L2 cache. Remember that in the meantime the ROB is progressing. The head is actually moving gradually. It is retiring instructions. So the point is that if before I get the data for the load. If this head moves down here. I have to stop. Because I cannot retire the load at this point. So now not only I can retire the load. I cannot retire any of these instructions. Because the retirement has to be in order. So that's a big problem. Which means I have only this much of time to get the data for the load. To make sure the processor doesn't stop. Now if a particular load instruction misses in all the caches. And has to go to memory. Can you imagine the large amount of time that it has to. If it is going to take. So we did a calculation last time. I should remind you about the calculation. Suppose your processor runs at 3 DRs. Let's suppose that you retire 4 instructions per cycle. So what does it mean? Your cycle time is one third nanosecond. So if I assume that let's say let's be optimistic. Let's suppose that we have a 100 nanosecond memory time. So we do 100 nanosecond. Let's assume that I'll get the data back. So that's essentially how many cycles? 300 cycles. I retire 4 instructions every cycle. So I need an ROB of size 1200. To be able to hide the latency of this load. Which is impossible. Because ROB size is 200. So this is the problem. Why loads impose a problem? This is the reason. You cannot retire anything beyond a certain point. And if you have an ROB of length 100. It will take only 25 cycles. To get to a stall point. Which is a very small fraction of 300 cycle latency. So which is why today actually if you look at applications. That access big data. A large amount of data. You'll hear numbers like 90% stall time. This is the reason. So one way of resolving it. Is to have a smart last level cache. Make it smart enough. So that it can retain important data. So you won't have to go to memory 2.0. So gradually the pipeline backs up. Processor runs out of resources. And ultimately the feature stalls. Which severely limits IAM. So that's the basic problem. So we'll talk about some of the simple solutions here. That the processor industry has adopted over time. To alleviate this particular problem. So essentially what you need is memory level parallelism. So till now we have been talking about instruction level parallelism. Where you find out instructions which can be executed in parallel. And you let them execute in parallel. You offer resources and all that. So the same question we can ask about memory operations. So can I execute multiple memory operations in parallel? That's for memory level parallelism. So simply speaking. You need to mutually overlap several memory operations. Why is that useful? Because then if you have two memory operations. Executing concurrently. For both of them you'll see a latency of 100 nanosecond. It's not like 200 nanosecond. So that's the approach. You get overlap. So how do you achieve that? The first step is to have a non-blocking cache. What is that? You allow multiple outstanding cache uses. That's the first requirement. That is you cannot now say that. Whenever my cache gets the first miss. It is going to reject all subsequent requests until this miss is resolved. Then of course you're not going to get any memory level parallelism. That's all the question. Because the cache itself is acting as a blocking agent. So this is the first requirement. That the cache must allow multiple outstanding cache uses. Even when it has a bunch of miss and outstanding. It should still be able to accept more requests. Allow them to go through the cache. Maybe some of them will get. Some of them will miss. That should proceed. So you usually overlap multiple cache misses. Supported by all microprocessors today. For example alpha 21264 supported 16 outstanding cache misses. So today's numbers are roughly around this. How many outstanding cache misses you can support. The reason why there has to be a limit is because you have to remember somewhere that what are the misses that are still outstanding. There has to be a table. And that limits the number that you can actually set. And this table is actually pretty much on the critical path. So that's why it has to be small. It cannot be very large. So is this solution clear to everybody? That's the first step. That you have to rate the cache. So you cannot have a blocking cache. The second one is out of order load issue. That is issue loads out of program order. This one also we discussed earlier. Address is not known at the time of issue. So what you do is you first, first thing of the load issue actually computes the address. Comes back, computes its address with the stores before it. And then only it can go and access the memory. So how do you know the load issue before a store to the same address? So issuing stores must check for this memory order violation. So let's talk about it in more detail. So here's an example. I have a store here. And there are a bunch of instructions and then I have a load. So let's assume that the load issues before the store. Because R20 gets ready before R6 or R7. For the store to issue I need to make sure that both of these are available. R6 and R7. For load to issue I need to make sure that R20 is available. So let's assume that whoever was generating R20 completed first. And so on a load can issue. So load access to the store buffer. This is essentially the value field in the store key that we talked about. This is used for holding already executed store values before they are committed to the cache and to terminate. If it misses in the store buffer it looks at the caches and say gets the value somewhere. After several cycles the store issues and it turns out that these two addresses are actually same. Or they overlap. So now that means the load must have got a wrong value. So maybe you have forgotten so let me try to remind you. So let's suppose that we have this single issue queue. And this load instruction here is this one let's say. This is the load. And let's suppose this is the store. That store instruction. So previously we said that the condition for a load to issue is that all the stores before it must have computed their addresses. That's what we said. Now what I'm doing is I'm going to relax that. Because that we are going to say that this looks very conservative. Because many of the loads will not depend on any of these stores. Why are you delaying the issue of those loads? So now I'm saying that well I don't care. I'll issue the load as soon as its operant gets ready. This is R20. Which means some of these stores may not have executed yet. Which means some of these stores may not have computed their addresses. So what does a load do when it issues? It first computes its own address. It comes back, compares its address with all the stores before it. And here we are assuming that it doesn't match with anybody. And the reason is that this store hasn't yet executed. Because R6 or R7, one of them may not be ready. So then the load goes happily, accesses the cache, gets the data somewhere. Supplies the data to its dependence that is sitting here. They also start executing. Eventually R6 and R7 get ready. The store issues executes. And you find that these addresses are actually same. So load has not only consumed the wrong data, it has supplied the wrong data to many dependents. How do you recover from this error? Any suggestion? Is the problem clear to you? So the problem arises because we are trying to issue loads very aggressively out of the box. So how do I fix it? So the previous slide, there was a comment here. Issuing stores must check for this memory order violation. Somebody expand around that one. Is the load instruction still inside the processor somewhere? Or has it to do that? Why not? Exactly. So the load hasn't yet escaped. You can still catch it and fix it. How? So we may check if the load has loaded from the same memory. And I can fix it. I can issue some signals. So an issuing store should check all loads after it that have already executed. And do a comparison of addresses. And if there are multiple matches, what should we do? There will not be multiple loads that are executed to the same address. Fix all of them. Fix all of them. So by fixing what do you mean? Can you elaborate? Overwrite the value. Load is loaded into some register. That's okay. So we change the value. That's all? That's all. The instructions we are supposed to read to those registers have already read. And have executed. What about those instructions? So you have to also find out the dependence of these loads. Which have executed. So there are two modes of fixing. So the first step is of course that an issuing store must check all subsequent loads that have already executed. By checking I mean compares the address. Of course an issuing store cannot compare addresses. First it issues, computes its address and then only it can do that. So it checks all the loads. The simple solution is that it picks the oldest load that matches and removes all instructions after it. So it reexecutes everything after that. So of course it will do some extra amount of work. It will reexecute many instructions which do not depend on this opening load. But it's a simpler solution. So you don't have to keep track of dependencies. A slightly more complicated solution would actually maintain the dependence tree of each load. So starting from the load, you can think of it as a root. It actually stores the value to a bunch of instructions. Form a tree. You keep track of the tree. So that's what today's inter-process actually do. They keep track of the tree of every load. In the buffer. So that actually minimizes your wasted work. So that's precisely what the solution is. So this is called speculative memory design equations. So essentially we are doing one form of speculation here. Why do we issue the load? Because we are speculating that this load won't have any problem. Won't ever conflict. So computer architects are very optimistic about what they do. So here again we are being optimistic. And we are saying that there won't be any problem in the future. The good news is that it is correct most of the time. That's why it works. Otherwise, of course, you will drown in poor performance. So assumes that there will be no conflicting store. If the speculation is correct, we have issued the load much earlier. And we have allowed the dependence to also execute much earlier. If there is a conflicting store, you have to squash the load and all the dependence that have consumed the load value and re-execute them systematically. Turns out that the speculation is correct most of the time. And often this is called a blind speculation because you are not really using any property of the load. You are oblivious about the history of this load. Because in the past this load might have conflicted with the same store. But here you are not caring about that. You are saying that I will do a blind speculation. I will issue this load whatever may be its history telling me. So you can improve upon that. So that's what today's classes do. They use simple memory dependence predictors. Which predicts if a load is going to conflict with a pending store based on that load's past behavior. So you can actually make association between the pair of loads and stores. You can store the association on a table and figure out in future if this load actually is going to conflict with any of the stores that are waiting here. If it is, then you won't actually issue this load. You will wait until all the stores have or at least the predicted conflicting store has gone. So this is what the classes do today. I won't be able to discuss about the actual predictor implementations. If you want to read about that, I can give you some papers. So they look very much like branch predictors, but they have to be a little smarter. Because here you are talking about establishing an association of a pair. A load and a store essentially. But still the query is a binary query. So yes, no answer will come. Just like a branch predictor. You take a load and ask, tell me if it is safe to issue this load. So the predictor will give you an answer. So today microprocessors try to hide cash leases by initiating early prefixes. That's another solution. Hardware prefixes try to predict next several load addresses and initiate cash line prefige if they are not already in the cash. So I hope you are noticing that there are many flavors of pattern predictors that go into a processor. And various levels actually. So this is again another pattern predictor. That mines the pattern of addresses that are accessed by the cash and tries to predict the future. Tell me what address are going to come in the future. Can I fix them now? So by the time it is needed it can be in the cash. All processors today support prefige instructions also. So you can actually specify in your program when to prefige what. So the compiler can actually insert these instructions at appropriate places. This gives much better control compared to hardware prefixes because here you can actually see the whole program. And also you can maintain the semantic of the program. For example, you can see that it's an array access. So you know that it's going to be very predictable. You'd be accessing sequentially the array locations. The address are extremely predictable actually. Whereas this guy, the hardware prefiche that is a finance state machine, sitting there monitoring addresses and just making a blind prediction. So there is a bigger chance that it may be wrong. But of course the advantage is that it can see all addresses which the compiler cannot. For example, dynamically allocated memory addresses won't be visible to the compiler. Another solution that the researchers have explored is load value prediction. That is, I am loading a particular piece of data like here. Can I predict what this load is going to be done? The value. So that also has been explored. It's a much difficult problem because you're not trying to predict a 32-bit or 64-bit value. So the entropy is expected to be much higher. It's not very easy and accuracy is usually low. Which is why the industry hasn't yet adopted this particular solution. But there is a large body of research on this. How to predict the load values. And the good thing is that the programs often load constants. A lot of constants are loaded up. And the most loaded constant is here. So these values can predict extremely easily. Without much talk. Even after doing all these things, the memory latency remains the biggest part of this bottleneck. And that's called the memory war. It's a very hard research topic. And probably it will remain a hard research topic for a long time. Because there is no easy solution visually in the future. Processors are expected to get faster. Memory is not getting faster at the same date. So the speed gap is going to increase. So your 90% stall time may become 95% stall time in future. Okay, going forward. So I'm going to stop here. So next time we'll delve deeper into caches.