 So, today we will try to quantify the performance of cache and try to address the system components that affect cache performance. So, this is the basic equation that dictates average memory access latency 1 minus misread multiplied by hit time plus misread times misread. So, this much of fraction you hit multiply that by the time taken for each hit and the remaining one accounts for the time spent on this system. And overall execution time can be now framed as busy time plus memory stall time. So, busy time essentially is a time where processor is doing something which memory stall time is a time when processor is stalling memory. So, here I also include other stall times in the busy time. So, there could be other stalls for example, you may have a short period of time for which machine accessor. So, busy time is determined mostly by INP exploited by the processor. So, when I say exploited by the processor that of course depends on how rich the processor is in terms of you know machine accessor. So, if you have a full bypass network or not because of the partial bypass is there will be other stall cycles because of the many bypass and many different processes. So, they are not part of the memory stall time there is one part of this. So, cache misses become more expensive as processors get faster why is that. So, I have a fixed memory module and I have two processors, one runs a 2GHz, one runs a 4GHz. And I am saying that for the 4GHz processor the cache misses. So, cache misses become more expensive as processors get faster why is that. So, I have a fixed memory module and I have two processors, one runs a 2GHz, one runs a 4GHz. And I am saying that for the 4GHz processor the cache misses are relatively more expensive. So, if you look at this one if you increase the processor this is likely to go down. So, your portion of memory stall time is going to increase because this is constant. So, in the limit you will be only having memory stall. The limit of course that is not possible really to nullify this completely unless you have super fast processor which takes no time to do instructions that is not possible really. And the question is that when you when you are really talking about this it often becomes. So, it often becomes important to measure the memory stall time. For example, if you are analyzing a program a computer architect routinely tries to quantify the execution time into these two parts. The question is how do you really measure this. So, even if I give you a simulator the problem is that in an out of order issue processor there are many things happening concurrently. So, there may be a memory request outstanding energy cache has taken a miss which is already gone out in the mean time processor may be doing something else may be actually working on some individual instructions. So, how do you really quantify this particular component that is very important. Then only you will be able to understand the bottlenecks in your system. So, how do you do this? Is the problem clear why this is a question? Because there may be certain memory operations agency of which is partially overlapped with something useful. So, how do you quantify this? So, I will give you one answer to this although this is not the common answer you could come up with other techniques also. So, usually what computer architects do is that they look at the commit stage of the pipeline when the instructions are leaving the pipe last year. And let us suppose that you have a commit width of W you can retire W instructions every cycle. So, then what we say is that if you can retire at least one instruction in a cycle that is a busy cycle. If you cannot retire any instruction in cycle that is a small cycle and the small cycles may come from various reasons. Now, a memory small cycle is a cycle is a small cycle when the commit stage could not retire any instruction and the instruction at the top of the ROV is a memory operation. So, you know that the reason why you could not retire an instruction was because of a memory operation waiting at the top of the ROV which could not be completed for some reason it is stopped. So, that is the memory stopped and everything else is probably the busy. So, it is a very simple way of accounting cycles. So, you look at the requirement stage and then you account every cycle as a busy cycle or a small cycle. Among the small cycles you look at the top of the ROV that instruction is a memory operation that is a memory stop. Any question for this? Sir, in case we can retire one instruction that the commit bandwidth is a memory, still it will be granted as a busy cycle. Yes. The stall may be because we cannot commit the new instructions because of a memory. Because you are saying that suppose in a cycle I retire n instructions n less than w and I could not retire more because n plus 1 h instruction is a memory operation. Yes. You could go ahead and do that also but then the problem is that you might end up having a very small number here because it is very unlikely that you will have a lot of cycles where you can consume the full commit bandwidth. So, you have to. So, this is again you know I mean. So, here I am saying is that if you can do anything that is a useful cycle. If you could not do anything then it is a small cycle. You can come up with other techniques also. The only point is that you should be able to argue your way out that this is correct. You should convince yourself that this is correct. So, what I do now is take each of the components of this equation. Essentially there are three terms. One is miss rate, one is hit time and other one is miss rate. And you will see how to improve each of them. So, as you can see here the latency is essentially a monotonically increasing function of each of these. So, I should be the goal is to reduce each of these terms. So, let us see how to do that. So, let us first look at miss penalty. So, miss penalty is essentially the number of cycles that you spend on a cache mix waiting for the data to come from the next level of the memory error. It could be the next level of the cache. So, first question is which instruction is more affected loads or stores. So, we have only two kinds of those. So, which one is most affected by miss end when there is a store in the cache when there is a load in the cache which one is more important. Why? So, there may be other operations depending on a load. So, load value would be more critical for products of the program. Whereas, the value that is stored by a store instruction is not important. So, stores actually terminate a chain of dependencies in the last instruction. Whereas, loads usually is at the root of a dependence chain. So, you must execute the root of the dependence as already as possible. So, the loads are most affected by miss penalty. So, how to reduce miss penalty? So, here are some simple solutions. You can have multi level caches. We have already discussed that. You can have levels of caches in a hierarchy and you can also derive an equation that looks very similar to this for a multi level cache hierarchy. So, this is for a single level cache hierarchy that you have a miss rate of the cache, you have a hit time and you have miss penalty. You have two levels of cache you can derive a similar formula. So, essentially your. So, this could be the L 1 time in L 2 whatever L 1 has missed will now have some portion of it will be hitting L 2 and the remaining portion will go out as L 2 miss rate multiplied by the miss penalty of the next level. Now, when you have a multi level cache hierarchy, how do you compute the miss rate? That is a very important question. So, you have essentially that suppose you have two levels of cache L 1 and L 2. Now, what is the miss rate on this hierarchy? Is it the L 1 miss rate? Is it the L 2 miss rate? What is it? So, there are two terms that are important. One is called local miss rate that is attached to a critical level of the cache. So, you can talk about the L 1 miss rate and L 2 miss rate and the other one is called a global miss rate. That is if you absorb here the total number of accesses into the hierarchy that is the global miss rate. So, miss rate go out here accesses come in here something happens in between whatever miss rate you see divided by the number of accesses is a global miss rate. And each level will have a local miss rate that is L 1 miss rate will be basically the count of this divided by the count of this L 2 miss rate will be this divided by this. And also people use misses for instruction. So, if you are looking at the total number of misses here you can divide that by total number of instructions that is also right. It is only that you are not dividing it by number of accesses instead you are dividing by the number of all instructions. So, this is a better representative of performance that is you may have a very high miss rate. For example, your program let us say have your program has let us say 100 node store instructions. So, you are observing here you observe that there are 100 accesses coming in and you observe here you find that all 100 accesses actually miss. So, you may conclude that you have a 100 percent miss rate question is it good or bad that is not really enough to say whether it is good or bad. But this one is will actually tell you how much is going to impact the performance because what may happen is that may be you are running a program which has 1 billion instructions or of which only 100 nodes to operations. Even if you miss all of them in the cache it does not matter much. So, this one gives you a very nice metric for estimating performance impact of misses for instruction. When you have a multi-level cache hierarchy you have to decide whether they should be inclusive or exclusive or something somewhere in the middle. So, you talked about inclusion last time let me introduce exclusions a little bit here. So, inclusion was that you say that the contents of L 1 is a subset of L 2. So, that is what we said that is the inclusion when you have exclusion what happens is that essentially the contents of L 1 and L 2 are disjoint. So, whatever you have in L 1 will not be in L 2 how do you really guaranteed. So, the way to one way to guarantee it is that this is how processors implement exclusion. Whenever the processor access something into L 1 if it misses in the L 1 the request will go to L 2 as it is running. L 2 will be looked up if it misses in L 2 it will go to memory. On the return path it will not be filled in L 2 it will directly be filled into L 1. When the block is evicted from L 1 it will be allocated to L 2. So, you can see that every all the time there is the contents here is disjoint from the contents here. Now, what may happen is that the processor access something in L 1 misses in L 1 request in L 2 and the block is actually in L 2. So, then what you do is you deallocate the block from L 2 and copy it into L 1 that guarantees exclusion. What is the what are the what are the advantages and disadvantages of this inclusion versus exclusion. What is good about exclusion compared to inclusion? You see an advantage of exclusion over inclusion say. More data can be stored in the cache. More data can be stored in the cache why is that. Because both are disjoint. Exactly. So, you can make full utilization of the cache space given to the processor. There is no application of data. That is a major advantage of exclusion it offers more effective capacity. Any disadvantage of exclusion? So, inclusion so this is exclusion. So, when you have inclusion what you do on the fill path you fill it to both. When you miss when the block comes to memory you fill both the L 2 and L 1. When the block gets evicted from L 1 what do you do sorry. It is evicted. Well it is evicted. So, of course it is gone. So, you do anything extra what if the block is dirty. You write back to L 2. If the block is not dirty you just drop the block. What happens here? When you evicted from L 1. Has to be elevated to L 2 at that time. So, you see any problem with that. So, every eviction from L 1 in this case you have to allocate the block in L 2. In this case of every eviction from L 1 only if it is dirty you have to set the copy back to L 2. Otherwise you can drop the block. So, which one is better or resurgently better. One more eviction. That is fine. So, that eviction has taken place already in inclusion at the time of fill. So, there is nothing extra going on in terms of evictions. How much data is transferred between L 1 and L 2. In this case and in this case. You see any problem? On every eviction the data has to go from here to here. It will require a much bigger bandwidth. Whereas here only the dirty blocks will go from here to here. So, in exclusion that is a major problem. The bandwidth required between the levels of high level. Even if it offers a larger capacity it will need to be a bandwidth requirement. So, you decide based on your design whether to support inclusion or inclusion. We will talk a little bit more about exclusion a little later. How to resolve this particular problem, this bandwidth issue. It can be alleviated to some extent. So, when you have a two level hierarchy. How do you really design the second level of the cache? Like the L 2 cache. Should it be a larger? Should it be a larger or a faster? Because these two things cannot go together. Large ones will be slower. How about the block size and associativity? How do they compare against L 1? So, let us go one after another. Should we have a faster or a larger? We should have a larger. That is the whole point. It should be at least larger than L 1 cache. Otherwise there is no point in having a multi level hierarchy. In fact, if you have an inclusive hierarchy. If your L 2 cache is equal to the L 1 cache. Then your L 2 cache will contain exactly the same data as the L 1 cache. There will be anything extra. So, it should be at least bigger than your L 1 cache. Inclusive hierarchy. However, in exclusive hierarchy you can have the same size actually. There is no problem with such. However, you probably want to have a bigger one. So, that you can hold more data. That is the whole point of having a multi level hierarchy. How about blocks and associativity? Of the L 2 cache. It should be more than L 1, less than L 1, more than L 1. Why do we have blocks? Why do we organize the caches in terms of blocks? To just have small walls. Why do we have blocks? What advantage do we get? What is advantage? So, bigger blocks offer you spatial. You can bring larger computer in one shot. And hope to use that in your project. So, in light of that, can you answer this question? How should your box size? Box size can be larger. Can be larger. Should it be larger? Yes, it can be larger. Can be anything actually. What should we do? It should be larger. To get spatial information. Can it be smaller? More block size? What is that? Yes. Why can't we have 64 by day 1 blocks size and 32 by day 2 blocks size? What is the problem? Play problem or block size? Right. It is larger. Why? I can have gigantic number of sets. Basically, if the block size is bigger, it will be larger. Not bigger. The capacity of a cache is product of interest, given the number of entries. Why do you want to hold that process? Number of sets. They can make it larger. It can be smaller. It can be more interesting. Can you explain that? For example, 1 block size of 64 bytes and 2 blocks size of 22 bytes. How many entries do I need to solve to fill 1 in one cache block? How many entries do I need to solve to fill 1 in one cache block? 2. 2, right? Yes. There may be a need of 2 accesses. 2 accesses to 1 more cache. That doesn't make any sense at all. So, it should be at least as large as 1 block size. Now, do you see anything special about the exclusion? We need to consider something extra about block size. Can I have bigger block sizes in an exclusive hierarchy? If 2 cache block size is bigger than 1 block size. So, we have just concluded that the 2 block size should be at least as large as 1 block size. For exclusive hierarchies, they have to be there is more option. Why? They have to be equal in the sense that without introducing more complexity. Yes. I have larger blocks. Can you explain why? We have got that one L2 hit. I de-allocate the block from here and bring it to L1. I copied from L2 to L1. What happens if I have a 32 byte L1 cache block and a 64 byte L2 cache block? We have to add some portion of the junk. We have to evaluate it from L2. 32 bytes of L1 is not needed. Yes. It has to be L2. It is like something has to be done with the remaining data. So, either you introduce 2 tags for cache block to maintain status of both the halves, which essentially boils down to saying that the L2 cache block says 32 bytes. The exclusive hierarchies, the simple solution is to have equal blocks like this. Which is why if you go back to the history of Intel processors which actually implement exclusive hierarchies, we find that the blocks are longer. Whereas if you go back to the history of Intel processors, you will find that up to a certain point, like for example Pentium and so on, they had L2 blocks as bigger than L1 blocks. Recently they have switched to equal blocks as for some other reason, which we may not discuss in this case. So, the point is that for exclusive hierarchies, you cannot have bigger block sizes in the upper levels of the cache. What about associability? What do you think? L2 cache associability and L1 cache associability. Why do we need associability? What is the purpose? It improves the access. It improves the access speed. No, no, no. Let us understand the purpose first. Why do we need associability? What is the purpose? Can you use the kind of pieces? Not yet. Not yet. Correct. So, should we have more associability in L2 cache? Or less? What do you call it? In theory I can do anything, right? Pretty much anything. But practically what makes sense? What do you really want? Less than associability. Why? Because associability is comparison also. It is made up of a corporate category. That is right. So, if you are giving say a total associability of cache in that level, the number of parameters will be more expensive. That is right. As you make it more acidic, it will become more harder. Yes, that is true. But what are you computing next? The number of collisions will reduce as you go to that level. Sorry, number of collisions will? Reduce as you go to that level. Because the total like 2 excesses which will match for L1. It is less like they both will miss in L1 and then again in L2. Well, if they are both conflicting in L1, like they both will miss because they will keep on flashing L2. So, the purpose of L2 should be that they should not be flashing L2 also. So, you want to make it more associative. And also the point is that making a cache more associative normally slows it down. But the question is what am I computing against? What is my next level? That is a huge latency. So, anything you say ok. So, that is your multi-level cache hierarchy. Other technique for reducing miss penalty is critical word first. This is pretty much common sense. So, critical word is the word within a cache block on which the miss actually happened. Passes are wanted to access this particular word within the cache block and miss the cache. So, it actually makes sense to send this particular word first. And then follow it with the other words of the cache block. Early restart. So, this is very similar to critical word first except that you fetch in order. So, start from word 0 and go in sequence. But you start as soon as you get the critical word. So, that is the early restart. You do not wait for the entire cache block. So, what is the problem with this? The problem is that if the processor is going to eventually allow the cache block, it will probably soon stall the next part of the cache block. So, you have to take care of this actually. So, that is one problem. And the last solution for improving miss penalty is prioritize loads. Because they already consider that loads are the biggest loads are affected most by miss penalty. So, essentially what it means is that stores it right buffers for right through cache. And you send the loads first to the next level of the cache. And you could fill before spilling right back cache. So, this essentially says that when the block comes in what may happen is that you have to replace a block to make room for the new block. So, you complete the fill first before sending the right back to the new block. So, that gives you that allows the processor to start already right. This one. So, you this problem the problem is that. So, let us take an example right. So, let us suppose I have 32 byte block. So, it has 8 walls each 4 bytes. So, let us suppose that etcetera. So, let us suppose that we miss here. So, we want this one. So, what already starts says is that you fetch in order 1 or 2. And as soon as you get this one you start the processor. Now, what if the processor is actually accessing this block sequential. So, immediately after this it will probably stall here if it is not it free. That is what the problem is talking about. So, what it says is that if it is if the processor is doing a sequential access. This may not buy much that is all it is saying. It will stall very soon. And that problem is there here also. Critical work is the same problem. Any question on this? . Normally more. Much more. . Yes, you normally start here and different processors follow different protocols. Simple protocol is to ramp it out. Mix have a very strange ordering of words in the big box. The first one will be this one. But it will not be really circular. We get some some other ordering of the words. Any other question? So, continuing on this penalty. So, as we said here we wanted to de-paratize the stores. So, we put the stores in the right bound for right through cache. So, that the stores will be so, right through cache you will be writing all the values to the next level of the highway. So, the from the buffer to the next level of the cache will pick up the values one or another and do the stores. So, essentially what we are saying is that the L1 cache will not wait for the stores to complete. So, it has other implications which we will not discuss in this class. But this is what it is about. So, this one actually talks about merging write backs. So, what we are saying is that send writes to the same cache line together. So, why is it useful? Because it implies your bandwidth more effectively. Instead of sending you know words at a time you can send bigger chunk of data together. So, benefit depends on width of the data bus to the next level. The same technique can be used for gathering un-cached writes. So, these are the writes that do not go to cache. They go directly to the memory. You can apply the same technique for the memory bus evaluation also. Same thing. And finally, victim caching this is very common in today's processors. What you do is you put a small fully associated buffer containing last few evicted cache lines. So, essentially what we are saying is that between L1 and L2 you put a small buffer which can contain a few cache blocks. So, whenever something is evicted from L1 you put it here. So, the hope is that this block was evicted from L1 cache because of some conflict. So, it will be used very soon from this buffer. That is why you put it here. So, victim cache hit swaps the replaced line with it. So, whenever you hit in the victim cache you would actually swap this line with the line here in the L1 cache. So, it increases the effective associativity of the cache. That is what it is actually achieving. AMD as well has an 8 entry victim buffer. Pretty much this is the size of the victim buffer 8, 16. It is a small buffer. Usually effective for L1 caches not for L2 caches. Because L2 cache is already having a large associativity. So, because small victim cache the L1 caches are normally small associativity. So, putting a victim cache will help. Any questions for this? Not blocking caches we have discussed this a little bit. So, essentially the idea was that your cache should be able to handle multiple outstanding cases. And we talked about this in connection with memory level variables how we do that. So, these are also known as essentially make the loads non-blocking. So, what may happen is that the load instruction goes this in the cache will be sent to memory or the next level. In the meantime the cache does not really stop. It continues to start subsequent requests. So, it is easy to make stores non-blocking because there is nobody waiting for the store to complete. It is not that easy for loads because there are instructions so you have to stall those instructions even though the cache may be able to handle other memory requests in the meantime. So, it allows hits under misses and misses under misses. It is the table to remember outstanding requests one per line. This is called misstitutes holding registers or mishandling table. So, this essentially holds the outstanding misadverses which are the addresses that are pending. Whenever a reply comes back it will snoop the load queue to wake up the matching instructions. So, there will be bunch of load requests that are currently out there in memory let us say. When one of them comes back it will look up the load queue there will be at least one matching which is why the miss was sent. So, this instruction now wakes up and probably the next cycle it will retry the cache. So, this time of course it will be hit. Why MSH is needed given that the load queue already has the outstanding instruction. I just say that in the table here but then I am saying that the replies will come back and look up the load queue to find the instruction. Why do we disable that? Could there be a situation where the instruction was not found in the load queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .