 And welcome to today's lecture on cache optimization techniques. This is the second lecture on this topic and in the last lecture, we have discussed about the reduction of heat time. As you know, the average memory access time particularly from cache memory is dependent on three important parameters. Number one is heat time, second is miss rate, third is miss penalty. And to improve the performance, you have to reduce one or more of these parameters. That means, first technique that you can use is to reduce heat time and in my last lecture, I have discussed in detail various techniques by which the heat time from the cache can be reduced. Because CPU is reading instruction and data and writing also instruction and data from the cache memory, the performance is critically dependent on the performance of the cache memory. That is the reason why it is very important to discuss about various techniques by which the cache memory performance can be improved. So, today we shall focus on the second technique that means, reduction of miss rate. How we can reduce the miss rate? And in my next lecture, I shall focus on reduction of miss penalty. And so far as for reducing miss rate, you have to use some of the approaches which are listed here. Number one is you can use larger cache size, cache memory size, second is higher associativity, third is large block size, fourth is various compiler optimization techniques. So, we shall discuss them one after the other, but before we discuss this technique, let us focus on another very important aspect which will give you necessary background for reducing the various other techniques for miss rate. And that is an anatomy of cache misses, rather why cache misses occur, what are the different types of cache misses that we can call as anatomy of cache misses. So, to be able to reduce miss rate, we should be able to classify misses by their causes, why these misses occur. And based on the causes, the classification can be done into three broad categories. Number one is known as compulsory misses, these compulsory misses occur because you have to bring blocks into cache for the first time. As you know, when you are turning the power on, the cache memory is not containing any useful instruction or data. So, the cache memory is containing a kind of garbage. So, for the first time you have to transfer instructions to the instruction cache and data to the data cache. So, this is inevitable and you cannot really avoid compulsory misses. And these compulsory misses are also called as cold stirred misses or first reference misses. That means, first time when you are referring a particular block, as you know the cache memory is referred in terms of blocks, that miss will occur. So, it is quite obvious that these misses will occur even when you have got infinite cache. That means, normally we know that if we have an infinite cache, then the miss rate will be very low. However, this compulsory miss will definitely occur because for the first time you have to transfer from the main memory to the cache memory. So, this will happen. Second is known as capacity misses, cache is not large enough. So, some blocks are discarded and later retrieved. So, this is due to smaller capacity of the cache. So, as we know because of cost, we cannot have very large cache memory. And as a consequence, you have to, you cannot really transfer all the blocks from the main memory to the cache memory. So, you have to take only a subset of the blocks from transfer, a subset of the blocks from the main memory to the cache memory. And obviously, whenever a particular block in the cache memory is mapped by a large number of blocks of the main memory, quite often you have to replace that cache memory content and put a new block into it. So, this will happen because some blocks are discarded and later retrieved. So, misses occur even in fully associative cache. As we shall see, there are different types of associative memory you can use. And even when we use fully associative cache, then also then this will happen. That means, you can place it anywhere in the cache memory. So, it is not fixed. And even in this situation, even in this scenario, this will occur, capacity misses will occur. Last but not the least type of misses is due to conflict misses, known as conflict misses. So, blocks can be discarded and later retrieved if too many blocks map to a set. So, the conflict misses occur because as I mentioned, you will be mapping many blocks to a particular set. And as a consequence, you have to replace, bring in and so on. So, these are known as collision misses or interference misses. So, these are the three types of misses that can occur and particularly misses in any way associative. For example, as you keep on increasing the associativity and for a fixed size cache, this factor will change. Later on we shall discuss about it in more details. Although, I have broadly categorized the misses into three types or the misses can occur because of three reasons. There is another reason which we may call as fourth C. So, we have considered about three C's. This is the fourth C. This is known as coherence misses. This is caused by cache coherence. Later on we shall be discussing about multiprocessor based systems. You have got multiple processors having their own private caches and a shared memory. So, in such situation another type of misses will occur and that is known as coherence misses. So, later on when we shall be discussing about multiprocessor systems, this particular type of misses will be discussed in detail. Now, let us consider the three C's and find, let us see what is the miss rate for the different types of misses. As you can see, the bottom red line, I mean that is a very small portion here. Those are compulsory misses. That means we find that the compulsory misses occur because of compulsory misses their rate is much less compared to capacity and conflict misses. That means compulsory miss is insignificantly smaller compared to capacity and conflict misses. This white portion corresponds to the capacity misses, particularly even when you have got fully associative cache. These are the miss rates and obviously as you increase the cache size, as you can see the miss rate is decreasing. Conflict misses are due to different types of associativity. That means capacity misses are because of the limited size of cache. As you can see, as you increase the size, the miss rate will decrease because of capacity misses. However, if you increase the associativity, as you can see the miss rate decreases. The top portion, the blue portion is due to, I mean whenever you have got direct mapping. That means your miss rate is maximum whenever you are using direct miss direct mapping. As you increase the associativity from one way to two way set associativity, four way set associativity, eight way set associativity. As you can see the miss rate is gradually decreasing as represented by different colors. That means the blue portion corresponds to one way set associativity. That means direct mapping the red portion because of two way set associativity and the yellow one corresponds to four way set associativity and the blue one corresponds to eight way set associativity. The white portion as I have told, even when you have got fully associative cache, there will be misses because of capacity misses. Now, let us focus on how the miss rate decreases, I mean changes as you change the size. For example, if the size of the cache is 2 k B, we find that miss rate is a little more than 0.04. 0.04 that is the miss rate, that means about 4 percent. 4 percent cache misses occur when you have got 2 kilobyte cache memory. As you increase the size from 2 to 4, as you can see the miss rate is decreasing, it is less than it is little more than 3 percent. So, little more than 4 percent to little more than 3 percent, there is a decrease as you increase the size. And thumb rule is as you increase the size, if you make the size double, that means 2 kilobyte to 4 kilobyte, the miss rate decreases by 25 percent. This corresponds to, I mean this is corresponding to the full associativity. Now, let us consider whenever you have got direct mapping technique, in such a case when you have got 2 kilobyte cache memory, you can see the miss rate is little more than 0.1 that is little less than 10 percent. On the other hand, whenever you increase the size to 4 kilobyte, then the miss rate reduces from little less than 10 percent to little more than 6 percent. So, roughly about 7 percent. So, here also 25 reduction in miss rate occurs. Obviously, this reduction question arises, due to which miss does it reduce? This reduction occurs because of capacity misses, as I have already mentioned in my earlier with the help of our earlier slides. Now, earlier we have seen the different types of miss rates for different reasons, compulsory capacity and the conflict misses. Here, the miss rate is shown and actually the absolute relative miss rates for different types of misses is shown here. So, we find there bulk of the misses occur because of capacity misses and as I mentioned earlier, the compulsory misses which is represented by the red portion, you can see the miss rate because of compulsory misses is quite small and contrary to reduction in miss rate when the cache memory size increases, the compulsory misses increases because you have to transfer larger number of blocks from the main memory to the cache memory. So, the percentage of compulsory misses increases as you increase the size of the cache memory. The capacity miss rate as you can see is remains from 0 to 60 percent. So, this is the capacity miss and the conflict misses vary for different types of associativity. One way that is direct mapping, two way, four way, eight way. So, for different types of associativity, the percentage of misses are shown in this particular diagram. Now, we have we can see that the miss rate can be reduced by controlling three parameters cache size, associativity and increasing the size of the block. Now, how they affect the three types of misses that means three C's that we shall see in little more detail. So, first let us focus on larger cache sizes. We have seen as we increase the size of the cache memory, the different types of misses changes. So, how the three types of misses are affected as we increase the size of the cache is considered here. So, larger caches are obvious ways to reduce capacity misses. So, capacity misses will reduce as we increase the size of the cache. However, larger caches have higher hit times. So, as we increase the size of the cache memory, the cache memory becomes increasingly complex. As the increase becomes increasingly complex, the decoder portion will be quite complex and the delay for that access will be larger. So, hit time will increase as we increase the size of the cache. So, larger cache size of higher cost obviously as we put more and more cache memory, the cost increases. So, these are the factors which changes. So, L2 caches have become larger, not true for L1 cache. Particularly whenever we consider L1 cache, their hit time is a very important parameter because most of the time data and instructions will be read from the L1 cache. So, there we try to keep the size quite small, not very large. So, that is hit time is very small. However, whenever we go for L2 cache, then size is significantly larger compared to L1 cache. Second important parameter as I told is higher associativity. Higher associativity reduces conflict misses as we have seen in the diagram. If we go for fully associative cache, then the conflict miss will be minimum and as you reduce the associativity, go towards direct mapping the hit rate increases. So, higher associativity increases hit time. So, the reason for that is whenever we go for higher associativity, cache becomes more complex. So, direct mapping is the simplest type of cache memory. You can use conventional memory as the cache memory whenever you use direct mapping, but whenever you go for two ways, say associativity or four ways, say associativity or fully associativity, then the cache memory becomes increasingly complex and as a consequence, the hit time will increase as we increase the associativity. And another observation that we have seen from the diagram, associativity higher than 8 way is likely not useful. We have seen in our previous diagram, after 8 way set associativity, there is no significant reduction in miss rate if we increase the associativity. So, there is no point in going beyond 8 way set associativity. If we incorporate associativity in the cache memory, 8 way is the limit beyond that it does not we reach a point of no return. So, beyond that there is no benefit that is the reason why the associativity is restricted up to 8 way set associativity. Second observation is done is that there is a 2 to 1 cache thumb rule of thumb. What is that? The miss rate of a direct mapped cache of size n is about the same as 2 way set associativity cache of size n by 2. That means, one parameter is cache memory size, another is associativity. Suppose, the cache size is n and if it is direct mapped say associativity is direct mapped or we can say 1 way set associativity. This will give the same miss rate whenever you have got cache size of n by 2 and you have got 2 way set associativity. So, we find that we can achieve the same goal that means, we can have the same miss rate either by increasing the associativity and reducing the cache size or what we can do? We can go for direct mapping doubling the size from 2 way set associativity. So, these are you can say this is a tradeoff between associativity and cache size for the same miss rate. This particular thumb rule actually holds for caches of 128 k v and obviously, beyond 128 k v we do not I mean consider a cache memory size. So, this particular thumb rule is applicable cause for cache memory for 128 k v and under a smaller than that. Now, let us consider the block size. So, larger block size reduce compulsory misses. So, compulsory misses will be reduced as you increase the block size that is because of spatial locality. We know this is governed by the locality of references as you increase the block size larger block size the adjacent blocks adjacent words are taken from the cache main memory to the cache memory. And as a consequence the next reference whenever the next I mean consecutive address is referred it will be available in the cache memory. So, that is the reason why increasing the block size compulsory misses reduces because of better spatial locality. So, however, larger block size increases miss penalty miss penalty increases the reason for that is whenever a cache miss occurs. If you have got multiple words in the cache memory then you have to transfer all of them before you can resume execution of a particular instruction by the processor. That means, before the control is transferred to the processor you have to transfer all the words of a block from the main memory to the cache memory. That means, the miss penalty increases. However, this can be this problem can be reduced to a great extent by using critical word first. That means, although we shall be transferring all the words of a block, but the critical word which has been referred by the processor is transferred first. So, the processor can resume its operation only after the critical word is transferred from the main memory to the processor and also to the cache and subsequently other words can be transferred. So, this can be done which we shall discuss in my next lecture where we shall consider various techniques for reducing miss penalty. So, this will be one of the techniques and larger blocks may increase conflict misses. You see although it reduces compulsory misses the conflict misses increases the reason for that is whenever we consider a cache memory of same size without changing the size of the cache memory. If we increase the associativity then what happens? Then for a given cache the larger blocks means fewer block frames. That means, the number of suppose for the same size the number of frames is n for direct mapping. If we go for two way set associative although we have got two words in a same block, but the number of frames reduces it will become n by 2. So, as a consequence the number of frames reduces and that will lead to increase in conflict misses. So, therefore, there is a trade off the best block size must be chosen carefully. So, you have to since you can see there are conflicting outcomes. That means, it reduces compulsory misses increases miss penalty and also it increases conflict misses. So, you have to very judiciously choose the size of the block as it is evident from the diagram. You can see here as you increase the block size miss rate initially decreases, but ultimately after reaching a minimum value it again increases. That means, particularly when the size of the cache memory is small then we should use a relatively smaller block size. In other words the block size should never be comparable to the size of the cache memory. So, when the size of the cache memory is large as you can see then if we increase the block size it does not affect much. That means, when the size of the cache memory is small it is affected more if we increase the size of the block, but whenever we have got a large cache memory by increasing size of the block it will it does not affect much I mean the miss rate is not affected much. So, what we can say that the tradeoff is based on the available size of cache memory you have to decide about the size of the block. So, this is a simple I mean case study a real life processor, Intrinsity first mat processor that is an embedded microprocessor that uses MIPS architecture and this particular processor is used in many embedded application. It has got 12 stage pipeline 16 kb kilo by 8 cache memory each of 4 kilo words I mean that 16 kb cache means 4 kilo words and 16 word block. So, you can see here you have got a block that the block size is quite large because each block is containing 16 words and these are the different components I mean this is the address 32 bit address which is divided into different parts you can see. Now, we shall focus on various compiler optimization techniques. So, ways in which code can be modified have fewer misses. That means, so far what we have discussed those are those are to be implemented in hardware and programmers or compiler writers are not bothered about that either you will increase the size of the cache or you will increase the associativity or increase the size of the block. All these things are related to change in hardware and the programmers are very happy with that, but now we shall discuss some techniques which are actually based on optimization of the compiler and this optimization can be either be done by the compiler or by the user themselves and McFerling reported back in 1989 that 50 percent reduction in cache misses using 2 kilobyte cache can take place by using compiler optimization and 75 percent reduction in cache misses can occur on 8 kilobyte direct map cache by using by adopting compiler optimization techniques. Essentially, what is being done? Re-order instruction so as to reduce conflict misses. So, primarily by reducing the conflict misses because size has been kept fixed simply by reordering the conflict misses are reduced and that leads to reduction in I mean miss rate. So, we shall discuss several techniques compiler optimization techniques. First one these are the most popular techniques number one is merging arrays. We shall see how we can merge more than one arrays into one and that will improve spatial locality by simple array of compound elements versus two separate arrays that we shall discuss in detail. Second is loop interchange by changing nesting of loops to access data. So, there will be no change in the number of instructions to be executed simply by changing the order of the nesting of loops will be able to reduce the misses and essentially that will increase the reduce the conflict misses. Third technique is known as loop fusion. So, you may be having two independent loops that have the same looping and same variable overlap. So, the key parameter is that you are using some variables which are common in two different loops. So, instead of accessing separately in two different loops you can have one loop and those once the variable is transferred from the cache memory to the cache memory they will be used for both the loops. So, that is known as loop fusion. The last technique is known as blocking which improves temporal locality by accessing blocks of data repeatably versus processing the whole columns or rows particularly for processing arrays this will be very important. So, these techniques we shall discuss one after the other. So, this is the first technique merging arrays. You can see here we have got two different arrays first one is two sequential arrays one is two sequential arrays. So, one is your int this is the size of the arrays given here and another is another array that is your key. So, these two are these two arrays are accessed separately and you can see the order in which the way it is accessed first you are accessing the array value and then you are accessing key one after the other. However, after merging having a one array structure what you can do you can struct merge int val and then int key. So, in this particular case what will be done you will be accessing an element of val then element of key and element of val element of key in this way you will be accessing and as a consequence what will happen the by doing this the miss rate will decrease because the way it is done. So, it reduces potential conflicts between val and key and because it improves spatial locality. So, because of the improvement in spatial locality this will give you better result and the way you will be this way you know that you will be storing in this way and then accessing in this manner and that will improve the performance because it improves spatial locality. Second technique that we shall be using is known as loop interchange. In case of loop interchange you have got 2 D array initialization. So, int a 200 elements 200. So, it is a two dimensional array and for then you are accessing in this manner for i is equal to 0 j less than 200 j plus plus and for j is equal to 0 j here it is i j less than 200 j plus plus and for j is equal to 0 j here it is i j less than 200 and j plus plus. So, then you are writing into it a i j and is equal to 2. So, here what will it be done you will be in this case first you are reading a value taking a value of i and increasing the value of j from 0 to 200 and you are writing into it. So, that means the way it is done is for one i you are accessing all the elements of j and then you are taking another element of i all accessing all the elements of j. So, this is one way of doing it and alternatively what can be done instead of doing this you can do it this way int a 200 200 you will simply change the order for j earlier it was i for here it is j is equal to 0 j is less than 200 j plus plus. So, and then for simply you have changed the order say here it is i is equal to 0 i less than 200 i plus plus and then a i j is equal to 2. So, in this case what you have done you have simply the earlier j was nested in i now i is nested in j. So, because of this you know there will be lot of changes question arises which one will give you better result actually to answer this one must understand the memory layout of the 2D array the way it is stored in the memory because your memory is one dimensional you are accessing a two dimensional array and that is stored in a one dimensional memory as you know memory is organized in the linear way and that is accessed by the address. So, what you can do it reduces misses by improving special locality whenever you change the order and improves cache performance without affecting the number of instructions executed. So, the number of instructions executed will be independent of what way you do. So, that means I mean both the cases number of instructions to be executed is same, but the way it is stored I mean the way you it is stored here because the array corresponding to that ith you know i is in i 2 increasing from 0 to 200 and in fact it has been stored sequentially corresponding to i and as a consequence it will give you better special locality. That means the second option will give you better option better reductions that means the elements of 2D array are stored in contiguous memory cells the problem is that as I told in computer memory is in 1D way. So, therefore, there is a there must be a mapping from 2D to 1D. So, from 2D abstraction to 1D implementation is taking place and as a consequence the this particular approach will give you better result. You must understand the way the rows and column the way they are stored in the memory. So you have got a 2 dimensional array here it is i and j and here you have got different elements of a ij. So, these are the rows and these are the columns. So, here it is a say 0 and this is the a 0 0 then here the row changes. So, 1 j is equal to 0 and so on and a n minus 1 j is j is your 0. So, first row the first row and different columns will be columns and in this direction you have got different columns. Now the way they can be stored is one is known as row measure second is known as column measure. That means if the rows are stored contiguously if the row elements are stored contiguously in this manner then it is called row measure. On the other hand if the if you store in this manner this element then the next elements of the column then it is called column measure. So, you can see the column measures are stored contiguously first column then second column then third column then fourth column this is column measure and this is row measure. Actually the C programs that C uses row measure that means you are storing in terms of rows first you are storing this row then this row then this row and this row as you can see rows in memory as you increasing the address different row elements are there and these are the different memory lines in of the cache memory. So, matrix elements are stored in contiguous memory lines and they are stored in a row measure way and as a consequence whenever you try to access them in the in the row by in terms of row elements with the help of this program then you get good result. On the other hand if the if it is stored in a different way that means column measure way then this program will not give you good result as you can see there will be that locality of reference that special locality is lost since they are stored in row way and you are accessing in terms of column and as a result this will be this will give you worst result. So, the locality of reference or special locality will be much less as a result there will be many misses that will occur in this case compared to the first case where you are accessing row by row and elements are stored in the row measure manner. Another technique is known as loop interchange. So, here what has been done you can see you have got this is this is the your array k is equal to 0 to 100 and here j is equal to 0 to 100 and i is equal to 0 to 5000 that means you are accessing different elements of x i j and writing into it and doing multiplying with your multiplying with two then you are storing it here. Now, if you interchange i and j so if you interchange i and j and then say for outer one remaining same that k is equal to 0 to 100 and here it is i is equal to 0 to 5000 and lower it is i j is equal to 0 to 100 and x i j is equal to 2 into 2000. So, the sequential accesses instead of striding through memory every 100 words. So, this will give you sequential access of the for different elements of the array instead of striding through memory every 100 words. So, in this case you will be striding through every 100 words, but in the second case you will be accessing sequentially and this will give you improved special locality. So, if you perform this loop interchang, the second option will give you better result because of improved special locality compared to the first one. So, this is the loop interchange example then third is your loop fusion example. So, in this case separate sections of code that access the same arrays with the same loops performing different computations on the common data. As I mentioned earlier here you are having two different loops one here the first loop is let me write down the two different loops. So, first one is for i is equal to 0 i less than n i is equal to i plus 1 and for j is equal to 0 j less than n. j is equal to j plus 1 and a i j is equal to 1 by b i j into c i. So, this is the operation you are doing and you are performing in this manner and this is another loop you have got for i is equal to 0 i less than n i is equal to i plus 1 and for j is equal to 0 j less than n j is equal to i plus 1 and you are performing d i j d i plus 1. So, i j is equal to a i j plus c i j. So, you can see you are performing two different computations one is your this computation where a i that elements of a is being accessed elements of b is being accessed elements of c is being accessed and you are performing this computation and here is another computation where you are performing d i j and also you are accessing different elements of a and c. So, you find that elements of a and c which you are accessing are overlapping although you are performing two different computation in two different loops instead of doing it having in two different loops what you can do you can merge it merge the merge it into a you can have a single loop. For example, for i is equal to 0 i less than n i is equal to i plus 1 and then you have got another j is equal to 0 j is less than n i is equal to j is equal to j plus 1 and you are performing a i j is equal to 1 by same thing which you have done earlier 1 by a b i j into j is equal to j is equal to c i j and also you will do this computation d i j is equal to a i j plus c i j. So, here instead of doing it in two different loops you are performing in a single loop. So, as a consequence you are performing loop fusion and by doing this loop fusion you will be having better result because two misses per axis whenever you access them separately versus for two a and c that means you are accessing a and c which are common instead of two misses only one miss per axis will occur whenever you do by using this loop fusion. So, these two misses per axis will be reduced to one misses per axis because of improved temporal locality as you know in temporal locality you are accessing something you are retaining something which will be needed in near future. In fact, that is what is happening in these two cases. So, for processing this particular first computation and second computation you are accessing once transferring to the cache memory and reusing twice. So, this will improve temporal locality and give you better result. Then the final technique that we shall discuss is known as blocking. So, whenever you are performing dense matrix multiplication this is the code which you have which will be writing. So, i is equal to 0 colon i less than n i is equal to i plus 1 for j is equal to 0 to j less than 1 i j is equal to 1 r is equal to 0 for and you will be doing this computation for k 0 to k less than 1 k is equal to k plus 1 and then r is equal to r plus y i k i k into z k j and also you will be doing x i j is equal to r. So, this you will do and whenever you do this as you can see you have got two inner loops and you are reading all n by n elements of z reading all and all elements of z and read n elements of one row of y refitedly. So, this will be doing refitedly because these are in inner loop and write n elements and instead of doing that that means instead of reading the entire rows and entire columns what you can do you can compute on b by b sub matrix that fits into the cache memory. So, in this particular case it will not fit into the cache memory that means entire all the elements of row and all the elements of column will not fit and as a consequence you will be having lot of misses. So, instead of doing it this way you will be dividing into blocks of smaller sizes as it is shown here smaller blocks and the way it is happening is shown in this particular diagram. So, here as you can see when you are doing without blocking. So, without doing without doing whenever you are computing this is the x array this is y array and this is the z array and the way they are accessed is shown with the help of this diagram that white portion is not yet not yet touched then the light portion are correspond to older access and the dark portion is the newer access this portion. That means you can see the older accesses have taken place, but using only a small portion here similarly for the for array y again you are accessing you are using only a small portion and this is the older access. Similarly, for z you can see a large portion has been is corresponding to older access and this is the dark portion corresponding to the newer access. Instead of doing this if you use the blocking and by changing the code here as you can see you have used the blocking factor be called the blocking factor divided into blocks. So, instead of doing the computation of the entire arrays entire rows and columns you are doing block by block and this is how the different I mean the different access of arrays done is shown with the help of different loops and the capacity misses because of this blocking the capacity misses reduces from 2 to the power 2 into n to the power cube plus n square 2 n cube by b plus 2 n square. So, you can see you are by having the large number of blocks and you are dividing n cube by b and that will reduce the capacity misses significantly because you know that you are transferring and only a small block instead of transferring the entire rows and columns and as a consequence your performance your capacity miss will be much smaller. And this exploits a combination of spatial and temporal locality because computation is not really in does not involve simultaneously the entire arrays it is computation is done over a small portion of the array by exploiting that idea you are using exploiting the combination of spatial and temporal locality to achieve this blocking and this can also be used to help register a location and this is after doing the blocking how it is done is shown here in this particular case as you can see in contrast to the previous figure smaller number of elements are accessed earlier you are accessing more number of elements smaller number of elements access for array x, for array y and also for array z and you are doing the computation over this area only. So, one block then we will go to another block in this way block by block you will do the computation and performance will be will improve the total number of computation will not change in this case however the performance will definitely improve. So, with this we have come to the end of today's lecture we have discussed various techniques for reducing the miss rate and we have seen how we can use hardware by increasing the cache size by increasing the associativity and also by increasing the block size we can reduce the miss rate and also we can use various compiler optimization techniques that we have discussed briefly in this lecture which can be used to reduce the miss rate. So, in my next lecture I shall discuss about the reduction of miss penalty because that is the third important parameter which dictates the performance of the cache access. Thank you.