 Hello viewers, welcome to today's lecture on gas optimization techniques. In the last lecture, in the last two lectures, we have discussed about how you can reduce heat time and how you can reduce miss rate. And today we shall focus on another very important parameter that is miss penalty. So, we shall discuss about various techniques for reducing miss penalty. There exist a large number of techniques, actually this is one of the thoroughly researched area and there are many techniques have been I mean proposed by different researchers. So, I shall focus on a subset of them and try to give an overview of the various techniques. So, these are the various techniques multilevel cache, write buffer, victim cache, read priority over write on miss, sub block placement, early restart and critical word first, non-blocking cache, hardware prefetching of instruction and data, compiler control prefetching. So, these are the various techniques, I shall briefly consider in today's lecture. First let us focus on multilevel cache. So, whenever we go for multilevel cache, we have to use, modify our definition about miss rate. Actually, we have to introduce two terms, one is known as local miss rate, another is known as global miss rate. So, local miss rate actually represents the misses in this cache divided by the total number of memory accesses to this cache. So, let me briefly explain it. You see suppose your first the processor tries to access the L1 cache and if it is not present in the L1 cache, then it tries to get instruction or data from the L2 cache. Obviously, the number of accesses from the L2 cache is smaller and miss rate will correspond to the total number of memory accesses to this particular cache. That means, number of accesses that has been tried on L2 cache that will be used for the calculation of miss rate and total number of misses by the total number of accesses for this cache. On the other hand, global miss rate is the misses in this cache divided by the total number of memory accesses generated by the CPU. So, global miss rate obviously will be smaller for L2 caches and for the first level cache, it is miss rate L1 and for the second level cache, it is miss rate L1 into miss rate L2. That means, global miss rate for L2 cache will be equal to miss rate L1 into miss rate L2. And whenever we add a second level cache, let us see how the equation for average memory access time changes. So, average memory access time AMAT as you know is equal to hit time plus miss rate into miss penalty. Now, if this corresponds to L1 cache, we give a suffix L1, L1 and L1. Now, in this case miss penalty of L1, this miss penalty of L1 is now equal to hit time L2 plus miss rate L2 into miss penalty L2. So, we find here the miss penalty of L1. If we can substitute here, we shall get AMAT for a two level cache will be equal to hit time L1 plus miss rate L1 into you have to multiply with hit time L2 plus miss rate L2 into miss penalty L2. So, this is how the AMAT gets modified whenever we are having two level cache, second level cache. And at lower level caches L2 or L3, global miss rate provides more useful information, because that only shows how many accesses are taking place to that cache compared to total number of accesses. So, indicate how effective is cache in reducing AMAT. That means, how by adding L2 and L3 cache memory, the accesses to those L2 and L3 is getting reduced. So, this will tell that. So, question actually arises, what is the miss rate for L3? Actually, there is a possibility that miss rate for L2 or L3 can be higher than the miss rate of L1. So, who cares if the miss rate of L3 is 50 percent? It may be as high as 50 percent or more as long as it is 1 percent of the processor memory access over, I mean, never benefit from it. That means, if we get an overall improvement of performance with large miss rate of the L2 or L3 caches, it does not really matter. Our overall performance or AMAT is the most important parameter. So, let us consider, let me illustrate with the help of an example. Let us assume that ideal CPI is equal to 1. So, these are the assumptions. So, assumptions are number 1, ideal CPI is equal to 1. That means, 1 cycle, 1.0 cycles per 1. That happens whenever we get a cache heat in the L1 cache and clock rate has been assumed to be is equal to say 4 gigahertz and miss rate L1 is equal to let us assume it is 2 percent. Then main memory access time is assumed to be is equal to 100 nanosecond. So, whenever we have got 100 nanosecond, what is the find out miss penalty without L2? So, miss penalty without L2 will be equal to 100 that is 100 nanosecond divided by 0.25 nanosecond that is 4 gigahertz is the clock rate. So, this is 0.25 nanosecond. So, this will give you 400 clock cycles and obviously, your CPI will be equal to 1 plus since the miss rate of L1 is 2 percent to 2 by 100 into 400. So, this is the overall CPI without L2 cache. So, this is equal to 9 cycles, cycles per instruction that is your 9. Now let us consider the situation when we add when we go for a second level cache. Let us assume global miss rate for L2 cache is equal to 0.5 percent is quite small. So, L2 heat time obviously, will be much smaller than the main memory access time. So, that is considered to be 10 cycles compared to 100 cycles of main memory access time. So, in this case you are getting from the L2 cache and you are taking lesser time. So, L2 miss penalty is equal to 10 divided by 0.25 that is 40 clock cycles. That means, whenever the data is available in L2 cache you will require 40 clock cycles to get the data. So, total CPI cycles per instruction in such a situation is equal to 1 plus 2 percent into 40. So, this is the 40 and plus 0.5 percent into 100, 400 that was the clock cycle that we got. So, 400 now what happens your CPI this becomes equal to 3.8. So, earlier we have seen the CPI was 9 now we have got 3.8. So, the performance improvement is equal to 9 by 3.8 this is roughly equal to 2.37. So, this particular simple problem illustrate how the performance is improving by using L2 cache. So, if you are having only annual cache L1 cache your CPI is 9 and when you have got L1 and L2 cache then your performance is improving by 2.37 times. So, CPI is reducing to 3.8 cycles. So, with this benefit we may say that the speed of L1 cache affects the clock rate of CPU. As we know the CPU is directly communicating or interfacing with L1 cache. So, most of the communication between processor and memory that takes place with L1 cache and that is the reason why the heat time of L1 cache affects the clock rate of the CPU. So, clock rate is primarily decided by the heat time of L1. On the other hand speed of L2 cache only affects the miss penalty of L1 as we have seen by this calculation. So, the L2 cache will reduce the miss penalty and we can have two different policies. One is known as inclusion policy which I have already mentioned in the beginning. So, many designers keep L1 and L2 block sizes same. So, that means the L1 and L2 you can have the same block sizes the number of words number of words that is present in the block is identical in L1 and L2. And if that is not true I mean otherwise that means if L1 and L2 block sizes are not same then there will be several L1 blocks may have to be invalidated whenever there is a L2 miss. So, if there is a L2 miss several L1 blocks may have to be invalidated if inclusion property is not followed. Another alternative is multilevel exclusion in this case L1 data is never found in L2. So, normally L1 I mean includes L2 includes L1, but that is not preserved. So, that exclusion property is violated in some situations. So, that means L1 data is never found in L2 that means and some commercial processors have used this for example, MD Athlon follows this exclusion policy. So, that inclusion property is not used in memory hierarchy. Now, let us consider another technique that is known as use of write buffer. So, write buffer is a queue that holds data to be written into the memory as we know whenever we are writing data into the memory then if it is not in the cache or even if it is in the cache you have to write it into the cache you have to also write it in the main memory to satisfy the inclusion property. So, and to avoid that what can be done we can add a victim buffer what is the role of the victim buffer let us consider the situation of a write through policy. So, in case of write through the data is written into the cache memory I mean without cache miss since it is available in the cache it is written into it and obviously, we will take one clock cycle. And if we use write through policy and without using this victim buffer then the number of I mean CPI will be equal to 1 plus 100 into 10 percent assuming that 10 percent is the miss rate of l 1 cache. So, in such a case you will require CPI of 11 cycles. So, if we this will not be required if we use this victim buffer. So, we can see here the data is written into the cache memory which are present here and then it is also written into the victim buffer and in the offline the data writing takes place in the lower level memory. So, CPU is not concerned with that because CPU has already got the data from the I mean CPU has got it from the I mean from the l 1 cache. So, it is written into it data is transferred and writing takes place in the cache, but it is also written in the victim buffer, but it is done in the offline. Another approach can be used along with this write buffer which is known as write merging. So, addresses of the new data matches the address of the valid write buffer entry. Here what is being done this write merging suppose a particular entry of a block entry of a block is modified. So, this is being modified and this is being modified. So, this is available in the write buffer and you may have multiple words in a single block. So, this data has been modified. Now, there is a possibility that another data is also you are also writing in another data and in fact address of the new data matches to the address of the valid write buffer entry. So, in the write buffer it checks whether it is present in the write buffer. Then what is being done it is you are not writing into the buffer you are modifying it here. So, for that you will require additional valid bit present in the write merging technique. As a consequence you will have valid bit for here, but it will definitely save the number of tags you will have a common tag that is the reason why for the same address you are getting in the write buffer. So, write merging is an extension of this write buffer approach where you check the data in the write in the victim buffer first before you try to read the main memory. And write back policy is much more complex to implement whenever we use write buffer. So, that we shall not discuss in this lecture. Now, we shall focus on another technique which is known as victim cache. So, reducing miss penalty by using victim cache, what is victim cache? So, you have got your cache memory here. Now, in addition to this cache memory you have another small cache memory which is known as victim cache. Why you are calling it victim cache? The reason is that some data add a fully associative buffer to keep recently thrown out discarded data from the cache. As we know whenever you have to replace a particular line you have to throw it out. Now, instead of throwing it out you are writing it into the victim cache that means the which is the victim block to be replaced that particular block is written into the victim cache. So, victim cache will contain the recently thrown out data. So, next time if you try to access it you may get it here instead of reading it from the main memory. That means here we are trying to combine the first hit time of direct map cache. As we know in case of direct map cache the advantage is it is very simple and it is you have got the directly read it from the cache and there is only one word per block. In this case in victim cache what you are doing? You are reading it, you are storing it and then subsequently you can read it from here. So, you add a fully associative buffer to keeps recently thrown out discarded data from the cache and it has been found that a 4 entry victim cache removed 20 percent to 90 percent of conflicts for 4 kilobyte direct map data cache. For example, it is used in decalpha HP machines AMD uses 8 entry victim buffer. That means you have got not only one you can have 8 entry victim buffer and this will help you reducing the miss penalty. So, this is very useful and used in many situations. Let us consider another technique which is called read priority over write on miss. So, we are aware of the read after write conflicts and whenever we are performing a read operation after write this particular approach read priority over write on miss leads to read after write conflicts with main memory reads on cache misses. Normally a dirty block is stored in the write buffer as you have already discussed and usually write all blocks from the write buffer to the memory and then do the read. So, you are writing into the write buffer then all the words of the buffer is transferred to the main memory then you do the read, but instead of that what can be done? Copy the dirty block to a write buffer then do the read then do the write that means what you are doing here only the dirty block you are writing into the write buffer and then you are performing the write then you are doing the write. That means you are not performing the write of all the blocks from the write buffer into the main memory you are not doing that. So, in this case if you stall is less since restarts as soon as you perform the read operation. So, this is how you can avoid I mean you can avoid read after write conflicts with main memory reads on cache misses. So, this approach is also used in some processors. Now let us consider another approach which is known as read priority over write on miss. So, a write buffer with a write through that we have already discussed it allows cache writes to occur at the speed of the cache. That means as we have seen it writes into the cache memory and data is also written into the buffer and in the offline writing takes place from the write buffer into the main memory. So, however the so far as the processor is concerned it does not take more than the time needed to write into the cache memory. So, this is the benefit of write buffer whenever you are using write through which I have already discussed in detail. And write buffer however complicates the memory access what can happen they may hold the updated value of a location needed on a read miss. That means after immediately after write you are trying you are there is a you are this may be hold in updated value of a location needed on a read miss. That means after write a read miss is occurred. So, this let me illustrate this with the help of this example. Say let us consider this program store word 512 r 0 r 3 then load word r 1 1 0 2 4 r 0 r 3 then load word r 1 1 0 2 4 r 0 load word r 1 load word r 2 512 r 0. Now you can see here from r 3 you are writing in the memory and then again from the same memory location you are writing into the in the register. So, you can say this is write and after that you are reading and transfer reading from the memory. So, in this case you are reading data you are write a register content into the memory and then you are reading it here. Now, what can happen by using write buffer this this word is written in the memory was stored in the write buffer data was transferred to the write buffer, but if this read takes place. So, there is a miss here and if you read it from the memory you will get the previous data. So, in such a case you will get a you will not get the updated data. That means the content of r 3 and r 2 may not be same. So, r 2 and r 3 r 3 may not be sorry same as r 2 content of r 2 and r 3 may not be same. If the from the write buffer if the writing has not taken place into the same memory location because these two are referring the same memory location. So, this can happen. So, this may hold the updated value of a location needed on read miss. So, this you have to take care whenever you perform read. So, you have to give read priority of a write on a miss. So, in such a situation you have to take proper precaution. So, that r 3 is the content of r 3 and content of r 2 are identical. So, whenever this type of situation occurs necessary precaution has to be taken. So, this is reducing miss penalty by read priority over write on a miss. So, this is a write through with write buffer read priority over write check write buffer contents before you read. So, if no conflicts let the memory access continue. That means what you are doing you are checking the write buffer content before you perform the read operation. So, in this case say before you perform this read operation from the memory you will check the write buffer what is the this value is present in the write buffer. So, if it is present in the write buffer then you if there is no conflict only then you continue the memory access. That means first write buffer content in checked and if it is present in the write buffer you read it from there and write priority over read. In this case waiting for write buffer to first empty and can increase the read miss penalty. In such a case what you are doing you are you are that whatever has been written into the write buffer it must complete writing in the main memory. So, after the writing is complete that means your write buffer is empty only then you perform the read. So, this is the write priority over read. So, you have to you can you have two different approaches in the first case read priority over write you are giving priority to read over the write into the write of the data that is present in the write buffer into the main memory. But in such a case you have to check the write buffer first before you read it from the main memory. In the second case you are waiting till the write buffer is empty and in such a case obviously read miss penalty will increase because the processor has to wait till the write buffer is becomes empty and data is written into the main memory. So, this is the situation of read priority over write on miss then let us consider sub block placement. So, in this case single word per block requires large number of tag bits as we know if we are having a you may recall the basic organization of cache memory. In the cache memory you have got the different valid bit then you have got those I mean check bits various other management bits read only write only and so on. Then you have got the tag field and here is your data instruction whatever it may be now there is a possibility of having only one word per block. So, as you know you can have only one word now to reduce the size of the tag as you have already discussed you can have multiple words. So, in such a case multiple words on a single block. So, this part will be same, but your number of data words of a single block will be more. So, this is your tag field and you have got multiple words on a single block. So, we find that by you are able to reduce the number of tags required in the memory by having multiple blocks in a multiple words in a single block in a single block having multiple words. So, and also we have seen that we have a single tag field and valid bit field. So, there is a single valid bit field and single tag field that is being shared by all the words in a single block that is what is the situation. Now, in this case this requires load of full block on a miss. So, whenever there is a cache miss as we know the entire block has to be transferred all the words are to be transferred. So, in such a case what will happen the miss penalty will be more. So, what can be done this problem can be overcome by having a valid bit per sub block. So, in this case what you can do you can add separate valid bit here in each of these for each of these words. So, in such a case what can happen you will instead of transferring the reading the I mean instead of loading the full block you can load only a particular word. So, by having valid bits per sub block and that means you will transfer if there is a miss for this you will transfer this. So, this will be one, but the other valid bits may be 0. So, this will reduce the miss penalty by having sub block placement. So, instead of reading the all the words from the main memory you can read only one word and then modify the valid bit, but this will definitely increase the number of valid bits, but the benefit is that your miss penalty will be significantly reduced. Another technique that can be used that is known as early restart and critical word first. So, in this case do not wait for the full block to be loaded before restarting the CPU. So, in this case you are reading these words one after the other because as we know the bandwidth the width of the processor memory bus is usually one word. So, you can transfer read one word at a time. So, you will read one word then this word then this word then this word one after the other. Now, in this case what you are doing as soon as the requested word of the block arrives send it to the CPU and let the CPU continue execution. Suppose the requested word is this one. So, you modify you make it one and immediately you transfer this particular word to the CPU do not wait for the other words to be read from the CPU. So, this is known as early restart and critical word first. So, this is the early restart technique and we shall discuss the critical word first after this. So, this is how you can again reduce the miss penalty, but obviously it will complicate the situation little bit and the processor has to take care of the situation. That means, it has to read the other words from the memory after sending the sending it to the CPU. So, then comes the critical word first. So, request the missed word first from memory and send it to the CPU as soon as it arrives. Let the CPU continue execution while filling the rest of the words in the block and also this is also known as raft fetch and requested word first. So, in this case you are not reading sequentially you are reading the critical word first. So, critical word first that means, if this is the requested word you read this particular word first you can generate the CPU will generate the address accordingly and it will read this word and transfer it to the CPU and subsequently the other words of the block this one this one and this one will be read one after the other and in the meantime the CPU will continue to execution and the miss penalty is significantly reduced. So, this particular technique is also used in some cases and this is particularly useful whenever you have got large block size. So, if you have got one or two words per block then this is not very effective, but whenever you have got a large number of words in a block this particular technique is very useful and special locality is a problem it tends to wait next sequential word. So, it is not clear whether there is a benefit from the early start what you see what we are doing in this particular case we are reading this particular word. This is the critical word and then you may be reading this then this then this, but what the processor can ask for after this word the processor may ask for this. So, again there will be a cache miss. So, the processor will read and word three then word one then word two then word four, but the CPU may ask for word four after word three. So, again this will lead to a cache miss. So, this is actually special locality is being violated because we used reading from sequential locations to improve the special locality, but since it is disturbing the special locality it is not known whether there will be much benefit from this early restart technique. So, this is an example MD Athlon has 64 byte cache blocks L2 cache takes 11 cycles to get the critical 8 bytes to fetch the rest of the block 2 cycles per 8 bytes. So, MD Athlon can issue 2 loads per 2 loads per cycle. So, compute access time is time for 8 successive data accesses. So, you can compute the access time for the 8 successive data accesses the solution is 11 plus 8 minus 1 because 1 you have already read into 2. So, this 25 clock cycles for CPU to read a full block cache for this particular problem. Now, without critical word first it would take 25 clock cycles to get the full block. So, after the first 8 bytes are delivered it would take 7 by 2 that means 4 clock cycles. So, total of 25 plus 4 29 clock cycles are needed in this particular case. So, this shows the access time for 8 successive data accesses. So, if you read the critical word first then this will what will happen. If you read it in this manner this is the time and if you read it in this way 25 plus 4 29 clock cycles. So, this particular example highlights the critical word first reading word I mean the problem that arises whenever you do critical word first. Now, let us consider the case of non blocking cache memory. So, this non blocking cache allow data cache to continue to serve other request during a miss. So, normally as we know whenever there is a cache miss. So, processor is blocked in the sense there will be some weight cycles or there will be some clock cycles that will be when the lapsed before you the processor can resume the execution. So, that is the reason why it is called blocking. Now, if you are having a multi programming environment. So, in such a case if a particular execution of a particular process is blocked the processor can start execution of another process. So, that is what is being mentioned here. So, this is meaningful only with out of order execution processors. So, whenever you are doing out of order execution processor and in such a case this is very useful and obviously this will require multi bank memories and Pentium Pro allows 4 auto standing memory misses. So, even there are memory misses it will continue execution of out of order that means instructions are not executed in order need not be from multiple processors, but here out of order execution is taking place of different instructions. So, the way the instructions are present in the program memory and it will not be executed in the same order and as I mentioned this will require multi bank memories and Pentium Pro allows 4 out standing memory misses and non working cases to reduce stalls on misses. So, in this case heat under miss reduces the effective miss penalty by working during miss. So, heat under multiple piece or miss under miss may further lower the effective miss penalty. This can happen because of multiple overlapping of multiple misses by overlapping multiple misses and significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses. So, cache controller will become very complex whenever you allow this multiple misses and it will definitely require multiple memory banks which I have already mentioned that means without multiple memory banks this will not be effective. So, the address will be generated and data is available from different banks and then you can read it at a faster rate. So, the cache performance for out of order processors is difficult to define miss penalty. We have been so far discussing miss penalty in the context of I mean the way the instructions are executed, but whenever you do out of order execution out of order processes it is very difficult to define miss penalty to fit in out of order processing model. Processors is much reduced average memory access time and overlapping between computation and memory accesses will take place in this particular situation and we may assume a certain percentage of overlapping. So, we may assume some percentage, but this percentage will vary from application to application. So, in fact is the degree of overlapping will vary significantly and that is the reason why it is difficult to define miss penalty whenever you are having out of order processors. Now, we shall discuss another technique which is known as prefetching and in fact this not only reduces miss penalty it reduces miss rate and the prefetching can be done in different ways. First one is known as hardware prefetching. So, in this case you prefetch both data and instructions. Instruction prefetching is done almost in every processor. This is a very common technique and processors usually fetch two blocks on a miss requested and the next block. So, what you are doing because of special to exploit special locality normally as you know whenever there is a cache miss the entire block is entire block comprising multiple words multiple words and the entire block words are fetched. Now, essentially to exploit special locality. Now, we are going further what you are doing we are fetching more than one block. Although the second block say you are fetching for the first block and also the second block. Although the requested word is present in the first block this is the requested word, but you are fetching this word this block and to the next adjacent block. So, this is known as prefetching. So, processors usually fetch two blocks on a miss requested and the next block and there are several commercial processors like Alta Spark 3 compute strides in data access. So, prefetch data based on this approach. So, this technique may help in reducing the miss penalty because of the locality a special locality. So, next time whenever you try to read a word from the second block it is already prefetched. So, it will not lead to any cache miss. So, this is how the miss penalty is reduced and also it reduces the miss rate because next time miss will not occur. So, it will reduce the miss rate as well as miss penalty. Now, another technique is software prefetching the previous one was hardware prefetching the hardware automatically does it. In this case you can prefetch data load the data into the register as it is done in case of HP power is loads or you can perform cache prefetch loads into a special prefetched cache. So, instead of loading into the register you are loading into a special prefetched cache as it is done in MIPS 4 power PC Spark version 9 and special prefetching instruction. Obviously, whenever you are doing it by software you will require special prefetching instructions. So, this special prefetching instructions cannot cause faults a form of speculative fetch. So, you are doing a kind of speculative fetch that means assuming that a word or instruction from the next block will be required you are doing the prefetching by using special instructions. So, you will require the special instructions are provided for the purpose of prefetch. Then third approach based on prefetching is compiler controlled prefetching. So, in this case compiler inserts instructions to prefetch data before it is needed. There are two flavors one is known as binding prefetch request to load directly into the register. So, must be correct address and register that means whenever we are writing into a register this is a I mean you cannot it is a strong binding and you cannot really change it because it is already in the register. And non-binding prefetch when you are doing into a loading it into the cache. So, this can be incorrect and whenever a fault occurs you may have to discard the cache and you have to reload it. That means replacement has to take place can take place in this particular case, but this cannot happen in the binding prefetch. So, whenever you do prefetching whether done by software or done by compiler with the help of a compiler prefetch instructions will incur these are additional instructions obviously there is an additional overhead. So, what you have to do whenever you are using prefetching. So, these additional prefetching instructions will take will make the execution time longer. So, you have to see whether that benefit that means the loss because of these additional instructions is being matched by the reduction of the miss penalty. So, it is the cost prefetch issues less than saving in reduced misses. So, if the miss reduction miss penalty is reduced because of this prefetch instructions or prefetched data then it is fine. So, this is a kind of trade off that you have to check that means you may not get benefit by prefetching. So, that has to be checked. So, we have discussed various techniques for cache memory optimization. Now we shall summarize them very quickly to give an overview of the different techniques that we have discussed in three lectures. So, first one is simple small and simple cache that has been used to reduce the hit time. However, it may increase the miss penalty and complexity is not much. So, complexity is 0 and avoiding at this translation whenever you do at this translation then again hit time is reduced and it has got little bit complexity. So, complexity level is 2, then third is pipeline in writes. So, writing operation can be pipeline to reduce the hit time and complexity is level is 1, then simultaneous tag comparison and data reading which I have discussed it reduces the hit time having complexity of 2. Then using suitable write strategy write through a write back and we have discussed about the advantages and disadvantages of write through and write back policies and by using suitable write strategy you can have the complexity of 2. That means normally write through is used and with suitable precautions. Then wave prediction and pseudo associative cast that you have discussed which also reduces hit time and complexity level is 2 and then avoiding at this translation that is your virtual cache that also have a complexity of 2. Then you have got various techniques for reducing the miss rate in this case larger block size whenever you go for larger block size the miss rate is reduced. However, miss penalty increases because of larger block size you have to read them all the blocks and complexity level is 0 and then higher associativity you can go for that also reduces the miss rate miss penalty is increased whenever you go for higher associativity because of the complexity of the memory and it is little more complex pseudo associative cases can be used with complexity level 2 to reduce the miss rate. Then various compiler techniques that you have discussed which are used to reduce the miss rate. However, in such a case there is no change in hardware so complexity level is 0. Then we have discussed various techniques for reducing the miss penalty second level cache complexity level is 2 that reduces the miss penalty right buffer it reduces the miss penalty complexity level is 1 victim cache that also reduces the miss penalty complexity level is 1 priority read message as we have discussed has the complexity level 1 sub block placement. We have seen it reduces the miss penalty as well as hit time which has the complexity level 1 and early restart and critical word first has is quite complex complexity level 2 and it reduces the miss penalty non-booking class cache memories that we have discussed which has very which is really very complex. So, complexity level is 3 and hardware prefetching of instruction and data that reduces both hit time and miss penalty with complexity level 2 and compiler control prefetching which is reduces both miss penalty and hit time has the complexity level 3. So, this table gives you the various widely used techniques in commercial processors which are used to reduce the either the hit time or the miss rate or the miss penalty. So, these are the various cache optimization techniques that we have discussed in today's lecture and by next lecture we shall focus on the main memory. Thank you.