 lecture on hierarchical memory organization. In the last couple of lectures, the important issues of hierarchical memory organization has been discussed, particularly we have focused on cache memory. Today, we shall start our discussion on how the performance of the cache memory organization be improved. This is the outline of today's lecture, in fact these topics will be covered in a couple of lectures. First, I shall consider a case study which will crystallize the idea of hierarchical memory organization and it will also highlight the important steps through which the memory access will take place. Then I shall discuss about memory system performance, that is the main topic of today's lecture and we shall introduce a concept called average memory access time by which the performance of memory system is specified. Then we shall see that the cache performance is related to three important parameters, number one is cache heat time, second is cache miss rate, third is cache miss penalty and obviously, to improve the performance of a memory system, hierarchical memory system, it will be necessary to improve one or more of these parameters. What you have to do? You have to reduce cache heat time, you have to reduce cache miss rate and also you have to reduce cache miss penalty. So, we shall discuss various approaches that can be used for reducing these parameters values and obviously, this cannot be covered in a single lecture, in few lectures we shall consider that. First let us consider a case study with the help of the DEC alpha 21264 processors cache memory, particularly the data cache. As I have already mentioned in all present day processor, you have got two separate memories, particularly the L1 cache, there is separate data cache and instruction cache. So, we are focusing on data cache, the operation is somewhat similar in case of instruction cache. So, it is not necessary to discuss both and DEC alpha processor presents a 48 bit virtual address which is translated to 44 bit physical address. So, this particular concept will be introduced in detail when we shall be considering virtual memory, but for today's discussion, I mean so that you can consider it with proper perspective, let me briefly introduce what you really mean by virtual address and physical address. In fact, there are several levels of memory hierarchy, we have introduced only the cache memory hierarchy and similarly, main memory may also be considered another level of memory hierarchy. So, in that situation what is being done the CPU generates an address which we usually call virtual or logical address. So, the address that is generated by the CPU is directly does not go to the physical memory and that is the reason why virtual and logical address is introduced. Then this address is translated, there is some translation mechanism or mapping that is done in the virtual memory system, this is translated to physical memory, physical address. That means the virtual memory system translates the virtual or logical address to physical address and that address goes to your main memory. And since nowadays we are using cache memory, the same address also goes to the cache memory. So, we find that there is a translation involved from virtual address to physical address and that physical address is used, so far we are focusing on cache memory. So, these aspects we have not discussed, this translation from virtual to physical address which I shall discuss in detail later on when we shall consider virtual memory concept. So, in case of decalpha processor, 48 bit virtual address is generated, then this is translated to 44 bit physical address. You see the number of address lines that is present here and that is present here may not be same. So, it may generate m bit and it may generate n bit, usually this m is greater than n. The reason for that is the size of the virtual memory is much more than the size of the physical memory. So, that is the reason why m is larger than n and that translation is done with the help of memory mapping which I shall discuss later. So, in this particular case the translation is done, we are assuming that translation has been done and we are getting a 44 bit physical address comprising 29 tag bits, 3 index bits and 9 offset bits. So, the cache contains 64 kilobyte of data in 64 byte of blocks. So, in case of decalpha processor, each cache, the instruction cache and data cache comprises 64 kilobyte and the size of block is 64 byte and cache is two way set associative. So, it does not use direct mapping, it uses two way set associative mapping and with this background, let us now see the cache memory that is present particularly the data cache that is present in your decalpha 21264 which is one of the very important processor where many innovative novel concepts were used and which was one of the fastest processor when it was introduced commercially made available. So, that is why you will find this particular processor is referred in many case studies because the new features or new innovative techniques were introduced in decalpha many such things. Now, here as I have mentioned, this is the physical address 29 bit tag field, 9 bit index field and 6 bit offset field and since it is two way set associative as you can see, there are two blocks in a single set which are shown here and since 64 bytes are present in the data that is also shown here and you can see the tag and valid bit part and data part has been separately shown particularly data exchange take place with the CPU and the memory and the main memory. On the other hand, this valid bit tag field these are used essentially for the purpose of finding out whether there is a cache miss or cache heat for that purpose and so the operation of read. Read operation can be considered as taking place in four steps. So, I shall discuss about the four steps which by which the read operation is performed. Let us see how it occurs. So, in the step one the CPU generates a 44 bit address, the address is split into 29 bit tag, 9 bit set index and 6 bit block offset. So, since you have got 2 to the power 6 64 bytes 6 in the power block. So, you can get I mean 2 to the power see you will get this figure 2 to the power 9 from the if you divide 64 kilobytes by this 2 to the power 6 64 bytes you will get this 2 to the power 9 and also you have got two sets I mean in a single set you have got two blocks that will provide you this 2 to the power 9. So, and by this we are getting the index field of 9 bits. So, this block this address this physical address is applied to the memory. So, this is that it goes to the cache memory and that is performed in step one. Now, as you go to step two as we know that index field identifies the write set by index field is applied here and it identifies which particular set is related to this particular address. So, this is done in step two and this you can see this is pointing to the set which has been identified by this index field. So, the write set is selected using this index bits and in the then in the step three what you do the tag is compared to both tags in the set as we know in case of two way set as you see memory will have two tag fields and you will require two comparators for the purpose of comparison and these two comparisons will take place in parallel that is what is happening here. So, this is one block and this is another block and you can see the you have got two comparators where the comparison is taking place in parallel parallely and the tag is compared with both text fields. If a match then and valid bit is one of obviously your valid bit has to be considered because in the beginning the valid bits are zero and only as a consequence after the power is turned on valid bit will be zero and subsequently the data is transferred from the main memory to the cache memory only then the valid bit is valid bit is one. So, it is necessary that valid bit also should be considered for the purpose of identifying whether it is a miss or hit. That means, if valid bit is one and if by comparing the tag fields there is a match then it is a hit otherwise it is a miss as we know. Then we go to the step four if there is a match select the matching block and return the byte at the write offset. So, you can see there can be obviously there cannot be match with both there may not be any match or if there is a match then match will take place with one of these blocks. In a single set you have got two blocks with one of the blocks there will be match and then using a multiplexer you have to select the write block and that will be provided to the CPU. So, here the CPU goes to the data in. So, this is the case for read for write it will be little different because you have to write the data into the proper block. So, that is little different from read. So, since we are considering read that is quite simple from here it directly goes to the CPU. So, data is raised in case of a hit of course, in case of a miss then you have to read it from the lower level of memory and in such a case it will involve additional steps. Let us skip those steps for the time being. Now, let us focus on the memory system performance based on this based on the various discussions that you have done and the case study that I have considered in this lecture. You see there are three important parameters which captures the performance of a memory system. Number one is latency, number two is bandwidth, number three is average memory access time. What do you really mean by these three parameters? First one is latency the time it takes from the issue of memory request to the time the data is available at the processor. So, this is the latency. Latency is dependent on various factors like the technology that is being used, size of the memory and so on. So, various and whether it is on chip or off chip all these factors will decide what is the latency, how much time it will take from the issue of the memory request to the time the data is available on the processor. On the other hand the bandwidth the rate at which data can be pumped to the processor by the memory system is the bandwidth. The rate at which the memory system can provide data to the processor that again depends on the organization of the memory whether it is direct mapping, whether it is set associative or two-way set associative and other parameters that will decide the bandwidth. Then the third parameter that is your average memory access time that is the average time it takes for the processor to get a data item it request. So, you are fetching instruction of data continuously particularly instruction fetching takes place continuously one after the other and that is what the processor does after it is turned on. And the time that is required to get data or instruction from memory is not uniform that will be dependent whether it was heat, whether it was miss how much time it is required for heat, how much time it is required for decide whether it is a miss and what is the penalty from where it will be read and so on. And particularly this time it takes to get a requested data to the processor is a variable parameter. So, it is variable due to the memory hierarchy that is being used in the system and the average memory access time can be expressed as amt amt average memory access time is equal to cache heat time plus miss rate into miss penalty. That means, whenever there is a heat then the other two factors are not involved it is decided by the cache heat time. So, cache heat time will be whenever there is a heat you will get the data from the cache and cache heat time will be taken into consideration in case of heats. But whenever there is a miss then that miss will be decided by the rate miss rate occur since we are considering the average we have to consider the average number when there is heat and when a miss occurs that also you have to take into consideration, meet rate has to be taken into consideration and also the miss penalty whenever a miss occurs what is the penalty when a miss occurs how many clock cycles is necessary to get the data whenever a miss occurs that is actually the miss penalty. So, these three factor parameters together will give you information about the average memory access time and let us see with the help of an example I mean let us see how it is being done in terms of probability. So, let us assume heat ratio is h i that is the probability of finding in a memory hierarchical memory system in the level i. So, miss ratio will be equal to 1 minus h i and average memory access time considering you have got only one level of memory hierarchy that is equal to h 1 into t 1 that means h 1 is the heat ratio for the level 1 and the access time for the level 1 and for this is the miss miss ratio and obviously the access time for the second level. So, that you have to multiply and that will give you the average memory access time. So, we are also interested in the effective cost we are as I mentioned in the beginning our objective is to have a very large memory as large as the largest capacity memory it should be first as the fastest memory at the same time cost should be optimum. So, that is why the effective cost you have to take into consideration that is c 1 into s 1 plus c 2 into s 2, c 1 is the cost per byte and s 1 is the size of the memory in the level 1, c 2 is the cost per byte in the second level and s 2 is the size of the memory in the second level. So, let me consider an example let us assume cache memory is 1 kilobyte, main memory is 1 megabyte. So, you can see 3 orders of magnitude difference is there between the cache memory and main memory and in fact, it may be more and how fast it is let us see. So, assuming that cache heat is 95 percent and 10 nanosecond is the access time of the cache memory and obviously the main memory heat will be I mean that will be whatever is the means that is 5 percent 1 minus heat ratio that is your 0.95, 95 percent. So, 5 percent is the main memory heat and that you have to multiply with the access time of the main memory that is 100 nanosecond. So, it has been assumed that cache memory access time is 10 nanosecond and main memory access time is 100 nanosecond. So, the average memory access time is equal to 0.95 into 10 plus 0.05 into 10 plus 100. So, as you can see whenever there is a miss you have to take the two times that means the first of all you have to check you have to access the cache memory then you have to access the main memory. So, it will involve accessing both. So, taking consideration together you get average memory access time is 15 nanosecond. So, we find that this 15 nanosecond is very close to the access time of the cache memory. So, it is closer to 10 nanosecond than 100 nanoseconds that means we are getting the average memory access time which is close to the fastest memory in the system that is one of our objective. So, far as the size is concerned as I mentioned to the user the size of the memory is equal to the size of the main memory. So, he does not know whether cache memory is present is not because he can write a program and which can be stored in the main memory I mean as long as it can be accommodated in the main memory it is good enough that means to the user size of the main memory is very important and that is the capacity of the main memory is effectively the size the programmer or user visualizes. So, that is closer to the higher memory hierarchy. So, let us now focus on the cost. So, assuming that the cache memory cost is rupees one per byte and main memory cost is 0.01 per byte this is a typical example nothing to do with practical things because as I mentioned with time the cost is decreasing I have in the beginning I have shown you a figure where shows that with time the cost of main memory cost of cache memory all are decreasing. So, this is just a representative thing and so you should not take absolute values, but if you consider the average cost that is C 1 into S 1 plus C 2 into S 2 you get average cost of 0.1 0.1009. So, that is the average average cost byte and we find it that it is very close to the cost of the main memory not the cost of the cache memory. So, we can see our objective of fast as the highest the fastest memory as large as the largest memory and cost closer to the cost of the slower memory that is being achieved and that is clear with this example. Now, let us focus on the performance and as I have already mentioned performance of a cache is largely determined by cache hit time the time between the sending of address and the data returning from cache miss rate the number of cache misses divided by the number of accesses and cache miss penalty the extra process of stall cycles caused by access to the next cache. So, these three we have to reduce these three parameters to improve the performance. So, we shall focus on the cache hit time first how we can reduce the cache hit time. So, we shall consider each of them one after the other, but first let us consider the cache hit time the reason for considering cache hit time first is you see most as we know the probability of hit is much more 95 percent or it may be 99 percent depending on the size and other parameters. So, let us consider the common case first common case is hit and whenever a hit occurs what is the cache hit time and how can you reduce the cache hit time. So, for reducing the cache hit time there are several techniques many techniques have evolved and even today research is going on those who are working in the area of computer architecture and they are working on how the time this cache hit time can be reduced. So, I shall consider some of the popular techniques which have been implemented in some of the real life processors. So, first technique is small and simple caches second is simultaneous tag comparison and data reading third is use of appropriate write strategy fourth is wave prediction and pseudo associative cache fourth is fifth is avoiding address translation which is known as virtual cache fifth then the last but one is five line cache access and finally trace caches. So, let us discuss these techniques one after the other first one is small and simple cache. So, what do you really mean by small and simple? By simple we mean direct mapping as you have already seen that the cache memory is simple whenever we use direct mapping the reason for that is in case of direct mapping you can use standard of the self memory there is no need for modification of the memory whenever direct mapping is used it will receive address then based on the access time after that access time it will produce the data whether you access I mean store instruction or data it does not matter it will produce instruction if it is used for instruction cache data will be produced if it is used for data cache. So, you can use standard of the self memory and they are simple in organization. So, the time required for access that is access time will be smaller compared to other types of memories say if you use whole associativity you have to use content addressable memory which is quite complex and that content addressable memory will involve comparison with the content of each and every memory location and obviously that comparison parallel comparison has to be done and it will require lot of hardware and whenever you put lot of hardware it will involve delay more delay that means access time will be longer and that is the reason why small and simple is direct map and then question of small why do you say small why the access time will be different for small and large memory. The reason for that is you will see that whenever we use a memory the first step or first component that is in memory is the decoder. So, you are applying the address. So, n bit address is being applied and this decoder will generate 2 to the power n lines and the complexity of this decoder is dependent on n that means if the size is large then this decoder will be quite complex because it has to produce many lines and this hardware will be more and more complex that means the access time will be heavily dependent on the complexity of the decoder and those who are studying those who have attended a course on VLSI circuits they know that for example, the delay of a 2 input NAND gate is much lower than the delay of a say 3 input or say 8 input NAND gate. The reason I am not going into the details of that so as the number of lines increases the decoder will require gates with larger number of inputs fan in and fan out you have to use multi level circuit to realize the circuit. In other words the decoder will be taking longer time. So, that is the reason why small and simple cache is advised whenever we go for to reduce the heat time. So, this is possibly the reason why all modern processors have direct map and small L1 cache. So, if you I have already given you some example there you have seen that L1 cache is always direct mapped and usually the size is small. For example, in Pentium 4 L1 cache size has been reduced from 16 kilobyte to 8 kilobyte for later versions. So, when it was introduced they had introduced with a larger size, but subsequently the size was reduced to reduce the heat time. However, the L2 cache can be larger and you can use set associative 2 way set associative or 4 way set associative and the tags on another technique that can be used the tags can be on chip and data can be off chip for L2 cache. So, this is there are some variations in which to improve the performance the tags are kept on chip. So, that the comparison can be done quickly since the tags are of the L2 cache is on chip. On the other hand the data of the L2 cache is off chip because you want larger size. So, that is the reason why that in the earlier diagram I have shown the tag part and the data part separately because the tag part may remain on the chip on the data can be off the chip. So, this diagram shows how the access time changes with the size as well as with associativity. So, you can see this is the case for one way that is direct mapping. So, as you are increasing size from 4 kilobyte to 256 kilobyte the access time is increasing for may be from 3 nanosecond to this one is roughly 9 nanosecond. So, it is becoming 3 times as you are increasing the cache size from 4 kilobyte to 256 kilobyte and similarly whenever you are increasing the associativity you are making it complex you can see compared to direct map the one way 2 way set associative this is 4 way set associative and this is fully associative. You can see as you are increasing the complexity as you are going for I mean for the same size for higher associativity the access time is increasing. So, that is true irrespective of the size. So, for all sizes for example, if you are having 256 kilobyte of cache memory this is the access time for direct mapping that is little more than 8 nanosecond it becomes more than 11 nanosecond for 2 way and 4 way set associative it becomes close to 14 nanosecond for fully associative. So, you can see as the size increases as the complexity increases the access time increases that is the reason why smaller size and simple cache is advised in the L for the L1 cache to reduce the hit time. Now, another way of reducing the hit time is to perform simultaneous tag comparison with the data reading. We know that we have to first compare the tag to check whether it is a hit on miss. If it is a hit then we read the data from the cache. If it is a miss then data has to be read from the main memory. So, that means we do it sequentially first tag comparison and if there is a match or hit then we read it from the cache memory otherwise we read it from the main memory. So, this is done sequentially on the other hand to improve the hit time what you can do to reduce the hit time what you can do tag can be compared and at the same time block can be fetched. So, this is your cache memory. So, you can do tag comparison and the data from the cache memory can be read simultaneously. So, you can start initiate both of them together and if it is a hit there is no problem. So, no harm done. However, if it is a miss then you have to undo this reading and you have to read it from the main memory. So, in any case you have to that will involve more time whenever you read it from the main memory then from the main memory it will come you have to load it in the cache memory and also you have to transfer it to the processor. So, you can see this is how the hit time reduction can be done by some simultaneous tag comparison and data reading. Now, reducing hit time by using appropriate write strategy. So, there are actually it has been observed that there are many more reads than writes. So, for example, all instructions must be read. So, 37 percent are load instructions that is essentially read and only 10 percent are store instructions. So, you see that read is much more than the write and the fraction of memory accesses that the writes are only very small. So, 7 percent. So, that means, 10 by 1 plus 0.37 into 0.1. So, that gives you 7 percent. That means, the fraction of data memory accesses that are writes are also that is the fraction of memory access for instruction is 7 percent. The fraction of data memory access for writes is 21 percent. So, we find that for write the number of accesses is small. Now, that is the reason why in the beginning more attention were focused on read, how to reduce the read time which I have already discussed. So, for writing lesser attention was given. However, you have to remember that fundamental principle I mean make the common cases fast that was done, but we have to focus on write because if the writes are extremely slow, then Amandals law tells us that overall performance will be poor. So, writes also need to be made faster. So, what I am trying to tell here instead of considering only the common cases fast which is essentially read, we have to focus on write although the number of writes is small. So, let us see what are the how you can really make writes fast. So, there are several strategies exist for making reads fast which I have already discussed simultaneous tag comparison and block reading. Unfortunately, this cannot be done for write. You see for read you can simultaneously read I mean even not making any modification in the cache. So, if you read from the cache and do not use is no harm because if it is a miss you will not use it, you will ultimately read from the main memory, but in case of write that cannot be done. Reason for that is if you make any change in the cache and if it is a miss, then you have made something which cannot be undone. So, that is the reason why you cannot use this approach for write. So, unfortunately making writes fast cannot be done in the same way. So, tag comparison cannot be simultaneous with block writes. So, for block read it can be done, but it cannot be done for block writes. So, one must be sure one does not overwrite a block frame that is not hit. So, as I have already explained if it is a hit there is no problem, but in case of miss it will lead to problem. So, let us see we shall consider two write policies which are used. One is write through another is write back I have already introduced these two concepts earlier. In case of write through as you know all writes update cache as well as underlying memory and cache. So, this can always discard cache data and most updated data is in memory. So, whenever you are using write through you are making change in the cache memory as well as in the main memory. So, in such a case there is no problem whenever you are doing write through. However, the cache control in this case also you will require only valid bit, but whenever you go for write back all writes only update cache. That means, in case of write back scheme as I have already mentioned in such a case you are updating you are writing only in the cache. That means, you are having cache memory and main memory, you are not modifying the main memory, you are modifying only the cache memory whenever you are performing write. So, in such a case how do you keep track of this, because at the time of replacement you have to read the write the modified data from the cache into the main memory. You have to transfer the cache memory to the main memory at the time of replacement. So, main memory write during block replacement has to be performed and you will require one additional bit update bit for housekeeping purpose. That means, along with valid bit as I explained in my earlier lecture you will require one update bit. So, if that update bit is 0 and whenever replacement is taking place it is not necessary to write the cache copy the cache into the main memory. You can avoid that. On the other hand if the update flag is set that means you have modified the data I mean the cache content. So, it is necessary to copy the updated data in the main memory. So, the update field has to be checked and you will require this housekeeping, I mean this additional bit for housekeeping purpose and whenever you use this write back scheme. So, this will definitely make the cache hit time faster. So, write back scheme will make cache hit time faster, but with additional complexity. So, relative advantage is that write through memory always has the latest data simple management of cache. On the other hand write back policy requires much lower bandwidth. So, it is faster since data often overwritten multiple times and this is based on the observation that once you transfer a block of data from the main memory to the cache memory many times write will take place. Why each time write it back into the main memory? Only when it is replaced then write back that is what is being stated by this particular sentence and it gives you better tolerance to long latency memory. So, whenever the memory has longer latency that means if you use write through main memory is very slow then each time you perform write your time required for write hit will be longer. On the other hand if you use write back then that will be smaller because you are dealing with only cache memory. So, write back has got better tolerance to long latency memory. In fact, there is a kind of tradeoff between the difference in access time of the two memory hierarchies. For example, if the difference is small suppose cache memory access time is only 1 nanosecond and main memory access time is 10 nanosecond not much difference not 100 to 100 nanosecond or 1000 nanosecond. In such a case write through approach may be used on the other hand instead of 100 nanosecond if it is say I mean 10 nanosecond if it is say 100 nanosecond or longer then it is better to use write back. Another write policy is write allocate versus non-allocate. So, what is the basic idea of write allocate? Allocate a new block on a miss. We have seen that whenever there is a miss you have to transfer from the main memory to the cache memory. So, what you do? You use a separate block new block. So, each time whenever a write miss occurs you allocate a new block and as you do that this will imply that you have to do read miss to fill the rest of the cache line. That means whenever you are using write allocate a block will have a number of words. So, you have to read more number of words and whenever to fill the cache line that will lead to read miss. So, this is one approach. Another approach is write non-allocate or write around. So, this simply send the write data through the underlying memory or cache does not allocate new cache line. So, in such a case there is no problem it uses that conventional approach write non-allocate. So, these two approaches are available just like your write through and write back and depending on the application depending on the requirement you can use one of them. Obviously, each of them has its own advantages and disadvantages. Now, let us consider the fourth technique for reducing heat time that is by using wave prediction and pseudo-associative cache. You may recall that we discussed about branch prediction and whenever we do branch prediction what is being done we predict that a branch will be taken or untaken and accordingly we proceed to fetch instructions in that direction. In a similar way what is being done it predicts the next block to be accessed. So, extra bits are associated with each set and it predicts the next block to be accessed. So, in this way the prediction is done and as you do the prediction then the multiplexer could be set early to select the predicted block only and a single tag comparison is required. So, this is in the context of two way or set associative memory. As we know you have got a single set will have more than one block. So, you have got valid bit, you have got tag bit, you have got data. So, you have got valid bit, tag bit and you have got data. So, normally as you know you will require comparison here, you will require comparison here with the tag bit that is available from the tag field of the address that is coming from the processor. So, this will be compared and this will be also compared. So, two comparisons are required and sorry tag field is this one not this one. This will go to the index field. So, index field is different this is index this is tag. So, tag field has to be compared. Now, we find that we have to perform two tag comparison it is if it is two way set associative and also you will require a multiplexer. These two data will go to a multiplexer and you have to select one of them and you have to apply the from this block offset you have to apply this value to select the appropriate data to this. So, this is how it will proceed. However, if you do the prediction then you can apply this branch offset early. There is no need to wait for this compare because and also there is no need to perform two tag comparisons. If this is the predicted way not this one then you will only compare this one and you will only select this data. That means whenever you are assuming that this particular I mean if it is a two way set this particular block will be used that is predicted. So, you will fetch the data from the predicted by selecting this multiplexer and also only this comparator will be checked not the other one. And if you have got say four way then you can avoid four comparisons or if it is a eight way you can avoid eight comparisons only one comparison is required and you can apply the appropriate value here to select quickly the data. That means in such a case multiplexer could be set early to select the predicted block and only a single tag comparison will be involved as I explained and a miss results in checking the other blocks. However, you cannot always expect that your prediction will be correct as you have seen in our branch prediction techniques. So, here also the prediction may turn out to be wrong and whenever it is wrong then a miss can occur and that will involve checking of other blocks that is present in a particular set. So, it has been used in real life processors or a decalpha 21264 uses wave prediction on its two way set associative instruction cache. If prediction is correct access time is one cycle. So, what is the penalty for this? Let us have a look that means if this prediction is right then any will require access time of one cycle and if it is wrong then it will involve two cycles because you have to consider the we have considered only this block other block also has to be compared now whenever the prediction turns out to be wrong. So, that will involve two cycles because the decalpha chief has got two way set associative. So, experiments with spec 95 suggest that higher than 85 percent prediction success that means for 85 percent of the cases you will require one cycle and only for 15 percent of the cases you will require two cycles for accessing the cache. So, this is a substantial gain and that is the reason why this wave prediction technique is used not only in decalpha 21264, but also in Pentium 4 processor. Now, let us consider another approach that is known as pseudo associativity in case of what you really mean by pseudo associativity. We have seen that the direct mapping gives you fast heat time on the other hand two way set associative gives you slower access time but it gives you lower conflict misses that means it gives you higher performance heat will be more in case of two way set associative. How can we combine the advantages of both that is being done in case of your two way set associative approach. So, what is being done it divides the cache into two parts. So, on a miss check the other half of the cache to see if it is there and if so you have got a pseudo heat. That means if it is a heat you have got two blocks let us consider instead of two way set associative you have divided into two parts this is one half and this is another half. You check this half first and if it is heat then your heat time will be smaller as you can see heat time is represented by this. This is smaller and if it is heat then there is no need to check the other half of the cache but if it is miss then you have to check the other half of the cache and that will lead to additional time known as pseudo heat time and of course there is a possibility that it may not be heat even in the other half that means the data may not be present in the cache. In such a case it will involve miss and that will lead to miss penalty. So, drawback is that the CPU pipeline would have to use slower cycle time if it takes one or two cycles. As we know the clock frequency the clock rate of the processor is decided by the speed of the memory and since it can be either one clock cycle or two clock cycle that means the processor accordingly has to be organized and the CPU pipeline cycle time has to be adjusted. So, this is suitable for cache not tied directly to the processor that means this is not suitable for the on-chip cache. It is suitable for off-chip cache and that is the reason why it is used in MIPS R 10000 in L2 cache and also it is also used in Alta Spark processor. So, you can see how you can reduce heat time by using a concept of virtual cache. As I have already mentioned that we have this virtual memory and virtual address and physical address. Now, if we assume that the same address will be used for the physical address then it is called virtual cache. So, in case of physical cache, physically indexed and physically tagged cache as you can see this is the physical cache. So, we have got virtual address and a translation is done to generate the physical address from the virtual address with the help of a hardware known as translation look-aside buffer which I shall introduce later on. So, this translation look-aside buffer does the translation, we get the physical address and physical address is used for the purpose of tag comparison and also for the purpose of getting the address. So, that physical address index part is used and tag field is also used. So, this is the physical cache. Now, in case of virtual cache what is being done, you can see we are using the virtual address for the purpose of indexing and also for the purpose of comparison of the tag field. So, virtual index and virtual tag cache though in such a case the must flash cache as processor switches. That means whenever there is a context switching, process switching takes place. In such a case you have to flash the cache memory and that is taken care of with the help of additional bits known as process identifier tag PID. You have to add PID field as part of the cache memory. So, this also must handle virtual address or aliasing to identical physical address. This aliasing approach arises because you know sometimes the operating system for operating system and user you have got different virtual address, but the same physical address. So, in such a case problem arises and that problem has to be overcome. So, cache both index and tag check using virtual address. So, this is known as virtual cache. Virtual to physical translation is not necessary in case of cache heat and there are several issues involved in it. First one is how to get page protection information. You know whenever you use virtual memory, you will see the page protection bits are available read only, write only, access only. So, for different users it can be different. That type of protection has to be provided page level protection information is checked during virtual to physical address translation. This has to be done and how to process context switch. As I was telling that you can have same virtual address, but they can be mapped to different physical memory at different instances of execution as it happens in case of multitasking computer systems. So, in such a case this has to be tackled. So, how to process context switch that has to be addressed different physical address for the same virtual address. Third is how can synonyms that as I told you that operating system and user programs may have two different virtual address for the same physical address and this has to be handled and the solution to this can be provided by hardware and the approach of that hardware based approach is known as anti-aliasing and there is also software based approach. We are not going into the details of that and you can see for virtual address cache indexed using virtual address, tag compared using physical address, index is carried out while virtual to physical translation is occurring. So, the various issues that I have told no PID needs to be associated with cache blocks no protection information needed for cache blocks and synonym handling is not a problem. So, this is the situation for virtual indexed physically tagged cache here the motivation is. So, you can see virtual address, but physically at this tag. So, first cache hit by parallel TLB access. So, you are doing parallel TLB access with the physical tag for reading the data. So, this avoids process ID to be associated with the cache entry. So, you do not require the PID field here as you can see. So, these are the techniques and finally another approach that can be used that is known as pipeline cache access. So, L1 cache access can take multiple cycles. So, you can use pipelining and the cache memory can be pipelined. So, whenever you do that then it gives you fast cycle time and slower hit, but lower average cache access time. So, Pentium takes one cycle to access cache by using this pipelining and Pentium Pro through Pentium 3 takes two cycles and Pentium 4 takes four cycles for the pipeline approach of cache memory access. So, with this we have come to the end of today's lecture where we have discussed various approaches to reduce hit time, cache hit time and in my subsequent lectures we shall discuss about the other two techniques for reducing other two parameters. Thank you.