 we discussed two pipeline stages, fetch and the decode will combine. So, moving on today we look at the remaining wide stages of the walk. So, the third stage is the is the issue instruction issue stage and there are three issues that we have mentioned last time. We will talk about the issue points and the address key. Three issue keys are in particular and it is a collapsible format type of 7 memory. So, this allows you to discuss last time this allows you to compare a value with all the entries and select 7 entries. So, this is bigger for comparing the registrar identifiers to wake up instructions and essentially it is such as for reading which instructions among this is the issue at most 2 instructions to a use and it can issue back to back to the instructions. So, this is possible because you can pick up the values to the bypass. The address queue is slightly more complicated it is not a collapsible can. So, this one is collapsible because you can issue instructions out of the program orders which means whichever gets ready will issue and that will create all holes in between the queue entries. So, to free up the slots towards the tail of the queue you have to collapse it. So, that the holes can move to one end where you can keep allocating. This one is a P4 can meaning that instructions cannot issue out of the program they will only issue in order and when a load or a store is issued the address is still not home. So, we have also discussed this problem earlier that what kind of complications that can arise because of this. So, to simplify matters what RTK does is that it issues loads and stores in order. So, which is why this is a P4 can it is not a collapsible can. So, that is the address queue there are couple of more logics associated with the address queue. So, there is a load with retry unit. So, let us try to first understand what we require retrying a load instruction. First of all is a data cache miss that is a initial load instruction that misses the data cache. So, what we essentially have to do is that you large of this request. So, that the data comes back from somewhere in a higher level of space in the hierarchy and when the data comes back the load will have to retry possibly. So, we issue a load and hopefully this time we can get the cache. And there is a memory address and the second situation where we want to retry a load is a memory address conflict. So, this one we have a discussed when discussing caches. So, let me try to explain it now. So, what happens is that suppose you have a cache miss. So, what you need to do is you issue the miss request to the next level of the load hierarchy and also you start selecting an replacement candidate. So, it is where the data all that why I am doing the same. Now, the problem arises if while that miss is outstanding you get one more request to the same cache miss which also misses the cache. So, now of course you can say that well I can continue doing my replacement run request policy without the already selected replacement candidates. So, essentially if I have a 4 way cache I have 4 candidates to select for replacement. So, I say that the first cache miss already selected this one from replacement. And before this one comes back there is another request that goes to the same same cache miss. So, what are the options do I have well I can say that you cannot use this one for replacement. So, this one is already kind of result you can only select one from these three. So, it is not doing that which r 10 k does is that it is allows the second request. The second request will have to be retried once this miss result. We essentially say that you can have at most one cache miss request path cache set outstanding at any moment. So, that is one example of memory address conflict. The second example is easier where suppose you have a load instruction we missed in the cache. And the request has gone out to be fetched. And you get a second load instruction which maps to the same cache that has just missed because the cache log is several bytes. So, I can have two load instructions. So, if this is my cache log I can have one load instruction loading from here another load instruction that is loading from here. So, these are maybe 4 bytes each whereas I have a cache log of say 32 bytes. And then suppose that this one is this one happens first this one goes second. So, when this miss happens you have already sent a request for this cache log. So, of course then it does not make any sense when you have this miss you send a request again. So, in this case what we will do is that we will reject this load and we retry for the cache log conflict. That is another example of a memory address conflict it is not exactly a conflict it is essentially a request mapping to a cache log which is already outstanding. So, for both of these you retry the load instruction later. So, there are two 16 cross 16 matrices that track address dependence information and rows and columns are address geometry because there are 16 entries in these fields. The first matrix avoids unnecessary thrashing by allocating one way in a set to the oldest conflicting address geometry. So, it makes sure that the oldest entry always makes progress by allocating a replace with cabinet. The second matrix records load store dependency at 64 to drain and carries out load following. So, we saw we discussed earlier that if a load overlaps with an already outstanding store then the load can take the value from the store if it overlaps completely. So, that is what is tracked by the second matrix a returning refill. So, a refill is essentially a message that happens in response to a cache list. So, you send a cache list request eventually the refill comes back to the cache. So, returning refill slopes the address queue and wakes up all matching instructions. So, there may be multiple load store instructions waiting for this particular cache log. When that comes back it will essentially compare the cache log address with all the pending addresses the 16 entries to the queue. And anybody that matches will retry to a dedicated cache log. So, the this port allows one retry per cycle. So, we will retry one of the parameters. So, that is pretty much the address queue logic. So, as you can see there are two simplifying factors that help the design. One is that the address queue is a P4 queue. So, we issue load storing order that takes care of your loads by passing a store the same address and getting a long time. And the second simplification is this one that it allows only at most one cache list for cache index. So, this matrix essentially keeps track of your index conflicts this one. And if there are two address queue entries that require the same cache index it will make sure that the oldest entry gets a way result. So, that will make progress. If you did the other way if we allocated the younger entry to this particular way the older entry may actually replace it before the younger entry comes back. So, they keep on replacing each other it does not make any sense. Because remember that between a refill and a retry there is a window of time where the refit block may actually be replaced and the retry may generate one more cache list. And you definitely won't worry about that because this may lead to a life loss. They keep on happening forever. There are two entries that are pretty grossly thrashing each other. The problem is more severe in mid certainty because there are N1 caches at two ways in our system. So, that makes it very vulnerable to these problems. If you have more ways then it becomes much easier. Any more questions? So, this is one point about the dependent instructions that take value from a node. This is essentially talking about a situation where you have a node instruction may look like this. And then you have let us say an add instruction that confused all of them. So, this is a dependent instruction of the node. So, the load state two cycles may execute. During the first cycle the address is computed. In the second cycle the data key will be the data caches. So, essentially if you look at the pipeline of the load instruction it will look like this. So, first stage is fetch, second stage is pick out rename. Then it issues. Then it I will call it address generation and then the memory stage and then it will commit some time. So, this is roughly what your load pipeline looks like. So, ideally I want to issue an instruction dependent on the load. So, that the instruction can pick up the load value from the bypass just in time. So, if I want to add instruction I would basically want my add instruction execution stage to be positioned here. So, that it can pick up the value from the bypass from the memory stage. So, moving back what it means is that I must issue the instruction here. Because for add instruction this is essentially the execution stage. And then you know decode rename. So, assume that a load issues in cycle 0. So, this is cycle 0. So, time moves in this direction. Compute address in address 1. So, this is cycle 1. And looks of cache in cycle 2. So, this is cycle 2. And remember that this particular stage may take several cycles depending on whether I hit or miss in the cache. What I am showing here is the best possible scenario. But I will give you the cache. So, I want to issue the dependent in cycle 2 as I have shown here. So, that this can pick up the value in the execution stage. So, that it can pick up the load value just before it is coming in cycle 3. So, this coming may not be exactly here maybe further away. So, the load looks of cache in parallel with issuing of the dependent. So, this is that is what is happening in this cycle. While the load is looking of cache I am issuing the dependent. So, what I am assuming here essentially what I am assuming is that this load is going to hit the cache which is why I am issuing it here. So, this is essentially a speculation we have talked about this earlier. So, dependent issue event before it is known whether the load will be in the cache. This is called load hit speculation. And essentially what may happen is that your speculation may go wrong in cases when the load misses the cache. Now, you might ask I mean what is the big deal anyway we are going to use a large number of cycles since the load has missed the cache. What is important is that instead of issuing this load if I knew this load is actually going to miss the cache. Instead of issuing this load in this cycle I could have issued an independent instruction which would not have wasted this cycle. Because essentially I am now wasting this particular cycle this particular issue is not because of this type of miss speculation. The good news is that this particular speculation is correct most of the time because hit rate is normally high in the caches. So, for most programs you can easily assume that your hit rate will be in excess of 90 percent. So, with the likelihood of 0.9 more than 0.9 I would be assured that this will not miss the cache. All right is it is it clear basically this particular speculation. Now, we will look at Pentium 4 very soon probably next week sometime we will find that it has actually multiple issues stages because Pentium 4 cycle is much deeper than this point. So, now you can imagine what is going to happen here this dependent issues here all right and it will go through several pipe stages before it goes to execution and based on that decision dependent of that particular instruction. So, for example, there could be another instruction here which consumes dollar 3 I will assume that this add will actually pick up the value in the right time and I assume that I will issue this instruction also before even going anything what is going to happen this structure. So, here we are talking about wasting just one issue slot where you find that this particular instruction. So, here when you come to that point we will just take any question for this. So, Arten Kassar does not do anything actually to improve this particular speculation activation this always assumes blindly assumes that the load will be over time. You could actually have predicted here which could learn the behavior of a load. For example, it is known that certain load instructions miss heavily the cache. They are just missing the cache. So, for those dependence you could actually not do this. So, but it does not really do anything of that sort. We will look at one such predictor we will discuss the answer next. So, this is a summary of the functional units sorry the slide is very dense. So, right after instruction is issued it reads the source operands taken by the physical register number from the register file and forward the instructions execute. So, let me see. So, in that in this example I skipped one stage here just for the sake of explanation. So, actually if you have one more stage in between. So, if you could rename the issue the register fetch then you will execute. So, this is essentially the stage four right. So, stage one stage two right. So, this r f and r f and execute is where we use together stage one ok. So, this is actually ok. So, there are two ALUs branch and shift can execute on ALU one multiply divide can execute on ALU two. All other instructions can execute on any of the two ALUs. So, ALU one is responsible for triggering rollback in case of yes because it executes the branch instructions and this prediction recovery we have already talked about this marks all instructions after the branch is washed restores the register map of correct branch stack entry sets the fetch PC to the there are four coding points one dedicated for coding point multiplying one for coding point divide one for coding point square most of the other instructions execute the code for the code. And there is the local unit has two address calculators. So, result of one is actually selected. So, I will also explain why this is done in this way. Data DLB is fully associated with 64 entries and translates 44 bit virtual address to 40 bit physical address. Physical address is used to match data cache tags which is virtually indexed data. So, your pipeline looks like this issue register fetch then execute or what I call the address generation stage here. The memory and then we commit some time. So, any question on this? So, can somebody guess why this is done like this? There are two address generation units. But, you select one of them why should I have two there is a 64. Under what circumstances you would actually do that. It is clear that you are doing something on two areas. And the reason why you are doing using both the areas is because you probably do not know which one you produce the character means the later point you get to know which one should be selected. So, essentially you are doing some kind of a speculation. What to that? Program counters. So, this ALU these two ALUs always receive load store instructions. So, there is no ambiguity here. Program counters we have routed to this particular area. Well actually branch targets are complicated in the second stage itself will come to branch targets. That was a problem. This ALU only computes a comparison of them. So, the ALU is computing that these two ALUs produce two addresses one of which is connected. So, branch is connected. Sorry. One of them is connected based on taken branch? No, no, no nothing like that. Now, I first of all do not understand what you mean by that. I mean the load is. So, I have an instruction, a load instruction. Load instruction. No, there is no PC there. We are mixing the branch target. This is a load of the two ALUs. So, you will have a base register with an offset. But it is trying to say it is like if we have a branch that changes your base register. How can it? For example, if you have a if structure, then your base register only gets modified. It can get modified in two ways. Like in if part and as part. That is right. So, with base register change you could have to get anything different. And where is the load instruction after the set structure? After the set system. That is not what is done here. So, the answer is that with starter key has two different formats of load instructions. And this format is not decoded until the load reaches this point. So, you sent the instruction to both the ALUs. One ALU will assume one format and complete ALUs. Other ALU will assume the other format and complete ALUs. In parallel, you decode the format for the load instruction. And finally, you put a multiplexer which will select one of these based on the output. So, let us see how many read ports and write ports are there in the register file. There are seven read ports. So, you have got that we have two separate register files. One for integer instructions one for integer values So, what are these seven read ports? So, remember that we have two integer ALUs. We just discussed in the last slide. So, for those you require two read ports each because there will be two operands for each of these functions. That makes it four. There are two read ports for the address generation ALUs. So, there are two ALUs each of them may require one register. So, that makes it six between store because store will require store will require reading one register for the value that is trying to store it. JR and JAL are they will be they will need to be one register to know the branch target because these are indirect branches and move to floating point register file. So, these are these are your MTC instructions move to the processor. So, that will require one point. So, these three actually share a port. So, it means that there will be a scheduler which will hopefully make a fair scheduling between these three types of instructions to give that give access to that protocol. So, that makes it seven. So, again notice one interesting design rate of here. He could ask well why didn't they have my reports. So, then I could actually get rid of this protocol sharing. The point is that these three types of instructions are not. So, it doesn't make any sense to dedicate one port for them because they will be very rarely used. So, they say that well even if I have one port very rarely I will have a cycle where two of these actually are containing for this port. Most of the time this port will be given to the instruction there will be only one instruction containing for this port. Three write ports one for each ALU. So, these two ALUs are talking about address direction ALU doesn't need to be write port because it doesn't write to the register point it generates an address which is used to look up the TLP and the cache. And one shared by load JL and JLR and move from 40.5. So, these instructions would need to write the return address to a register. And again the same argument actually follows that these are more frequent instructions ALU instructions these are not so frequent. And again the second point is that rarely you will have a cycle that would have you know two of these instructions ready and they will contain for the same way. We talked about the predicate predicate bit last time we talked about conditional move instructions right. So, there is a 64 bit predicate vector attached to the integer file needed for executing conditional move on zero instructions. So, there is a bit attached to each integer register. So, of course the way the implemented is that there is a separate 64 bit register which is treated as a vector alright. And i th bit corresponds to the i th predicate value which essentially says that whether register i is 0 or not. So, that is what we mentioned. So, whenever you produce a value in register i you also compute this particular predicate at the same time and set the floating point file. So, let us see what these are to each for adder and multiplier. So, they would require two opponents to it and one shared between store and move. So, why do we need any need for for loading instructions floating point loads sorry. At the same. At the same. At the same. At the same. Exactly. So, the only thing that we need for a load instruction is an integer register which is required to compute the address. The address is an integer value. But it will require a right code. So, that is what is mentioned here. So, three right quotes one each for adder and multiplier. So, these two and one shared between load and move. So, here this move instructions that we are talking about. What are these move instructions? Oh, I am sorry. So, move from the 20.5 to the integer file. And this one is essentially move from integer file. The floating point move instructions that moves from one 40 point register to another 40 point register. That execute on the on the remaining. Yeah, on the remaining. So, one for multiplier one for divide one for square root and add, subtract, move and everything else negate, etcetera, etcetera would execute on the last. So, floating point stores will require a floating point value. So, they would be so, the address same address and making the same. Yes, floating point and integer load store would share the same address. Okay, as I will write back this execution the result is given back to the register. So, as I just mentioned no need to wait till the has guaranteed that is basically associated with a unique instruction also the results are from the to inputs of is guarantees that the can be issued back to back and still they can receive for example, here is the load bits this guy will pick from the bypass only because you are doing this. So, as soon as the instruction completes you launch the result on the bypass. Yeah, so for example, here right I can issue these two instructions back to back. So, if you look at the pipeline timing for these two instructions so, the first add instruction etc and I want to position the second add instruction so, that you can pick it up from the bypass. So, I will be issuing it here essentially what it means is that on two consecutive cycles I will be issuing the two add instructions one after another without any bubble in between and still the second add will get the value on time. So, if the second add will actually read read R3 from register file here which will be a wrong value which will get overridden here from the bypass. So, retirement or commit means this is the last phase of the instruction immediately after the instructions finish execution they may not be able to leave the pipeline because you have to guarantee in order retirement which is necessary for precise exception. When an instruction comes to the head of the active list R10K retires four instructions every cycle. So, we discussed that last time which is why you are the active list is a four way band. So, what does the retirement involve? It updates the branch predictor and frees the branch stack entry if it is a branch instruction it moves the store value from the address queue entry to the L1 data cache which is a store instruction. So, remember that the store values because you do not really know if the instruction will retire because there will be a branch report the store instruction which may be spread the store may not even retire So, this has an implication which is that the address queue entries must have a field to hold the value of the store. It frees the old registration physical register and updates the register free so this one we discussed last time an internal floating point instruction frees the issue queue entry immediately operate issues but a load store instruction holds the address queue entry until it retires. So, why is that? What is the difference actually? So, what I am saying is that this add instruction for example would free the issue queue entry right here as soon as it leaves the issue queue but if it was a load instruction what is the reason for running this? It has to retry and it has to store the value of store So, stores I am going to understand but retries so load instructions they can be Yeah, sure So, I can say that well those which miss the cache may immediately once they complete on their issues and things but that is still earlier than on it is like delaying it longer what is the reason? So, you do not know that actually which is fair enough So, the reason why it is here is that is a purely multi processor what may happen is that some other processor so let us say you are loading from address X some other processor may modify address X So, what may have happened is that you have loaded the value from address X because the load has completed but you have not yet retired because you have not yet moved to the care of the activist So, while you are sitting in the activist waiting to be retired some other processor modifies address X So, we will do that the question is am I is it ok to retire this load without any concern because remember that this load must have loaded the previous value the value will not reflect the value that has been produced by the other processor So, essentially I have two processors P1 and P2 this one had a load from address X this one had a store to address X this load happened before the store but it retired after the store So, in certain situations it is not correct to retire this load you have to retry even after that So, to catch that you maintain this Q entry and actually grow when the store actually comes up in this processor actually it does not happen in that way but roughly speaking that is what happens and any processors Q holding this address would actually retry the load and all dependents so the load must get the new value dependents must read it So, this is the reason why you need to hold address Q until you retire and the reason why you cannot do this search on the active list is because the active list is not a searchable entity it is not designed to be a searchable and finally you free the activist entry itself that completes the life of an instruction any question So, all the way left to it is the memory hierarchy which we have already discussed So, there are on chip instruction data caches both are two ways data cache has 32 byte line size while instruction cache has 64 byte line size Why? Instructions usually have higher special quality and that is because usually execute sequential unless we talk to a branch which takes you somewhere else So, this is the reason why you have a longer cache block size instructions Both the caches are virtually indexed and physically tagged we need that the cache index is not physically attached So, this has some implication on how you actually design the cache So, let us try to understand that So, with a 4 kilobyte page size data cache runs into a synonym problem So, this one we discussed in the class what a synonym problem is So, just to remind you quickly imagine that this is my data cache which is virtually indexed meaning that if I this is a virtual address So, let us try to figure out how many bits of the address So, data cache has 32 byte line size So, block offset is 5 bits how many bits is the index is the L1 data cache So, you can verify that 9 bits and remaining things are tagged So, we say that I think it is a 44 bit virtual address Yeah, 44 bit virtual address translates to a 44 bit virtual address So, page size is 4 kilobyte So, my page offset is somewhere here 12 bits So, now imagine two virtual addresses V1 and V2 belonging to two processes and they want to share these these two virtual addresses So, essentially what it means is that the pages containing these two virtual addresses will actually point to the same physical page Now, so what this means is that So, remember that the translation when you translate the virtual address to physical address these 12 bits will be unchanged because it is the offset within a page So, that means if I take V1 and translate it to the physical address what will happen is that these 12 bits will remain unchanged and since these two addresses share the same physical page is guaranteed that whenever they are referring to the same cash lock same physical cash lock these 12 bits will be identical that is guaranteed what is not guaranteed is about these two bits they may or may not be identical So, let us consider the situation when they are not identical what will happen is that V1 will map to some cash index and V2 will map to some other cash index So, now the first process modifies this particular cash lock to V1 and gets context switched out The second process reads from this cash lock it is a wrong map So, essentially the sharing in variant is not wrong So, this is called a synonym problem So, essentially these are these should be synonyms which was not guaranteed Is it clear to you today So, that is the problem is mentioned here upper two bits of the index creates a problem So, that is the first problem that you have to worry about The second problem is if you think about the tag the tag of the tag So, we are saying that it is a physically tagged cash So, each tag of the cash lock comes from the physical address So, this virtual address gets translated to the physical address and essentially what we do is that you take the upper 30 bits of the physical address for the tag So, let us see if that can lead to some problem So, imagine two physical addresses V1 and V2 upper 30 bits are going to be there So, these two cash locks are going to have identical tags Now, I will have a real problem if these 14 bits in the virtual address are identical because then what will happen is that the same cash index they have two cash locks with exactly identical tags they have no way to distinguish them they have the same index they have the same tags Is it clear to you today So, I am talking about two two addresses two cash locks that have upper 30 bits identical in the physical address and the lower 14 bits are identical in the virtual address that can happen So, there are two problems So, how do you face it So, it talks about the solution to the second problem It says MIPS Art and Key uses the complete physical page number as a tag It uses this part for the tag 32 bits Does this solve the problem Does this solve the second problem What do you think If it does not There are two physical addresses that have identical 32 bits upper 32 bits in the physical address and identical lower 12 bits in the virtual address that is what we are essentially saying but there are identical addresses because translations do not change the lower 12 bits So, this one solves the second problem if we use the complete physical page number as a tag which only extract two bits for tag So, what is the usual solution for avoiding synonyms So, all the pages together which could have all the synonyms All the synonyms in this way we ensure that they match the same same No, can you explain what is the what does it do So, you have these two bits here in the virtual address So, you divide your virtual pages into four bits based on these two bits and whenever two processes request for two virtual pages to be shared they are picked up from the same column but can it is that these two bits should be identical So, that does take care of this problem but there is an additional issue that is and after this we have an L2 cache and we said that when we have an inclusive cache hierarchy ok if you replace a block from L2 cache you must invert the block in the L1 cache to maintain inclusion and after it has an inclusive hierarchy So, now imagine the problem your L2 cache is physical index physically cache So, when you replace a block from L2 cache you can derive its physical address from the index of the cache So, now you send a physical address here to the L1 cache asking that invalidate this block L1 cache has no way of beginning of writing this block is not a physical address because it is indexed by the virtual address but if I give you these two bits of the virtual address you can actually derive the virtual index because these 12 bits are going to be same in the physical address as the virtual address you take these two bits of the index in the L1 cache and then you can invalidate them So, this is the reason why it maintains these two bits of index in L2 cache So, that for the inclusion these invalidations can be performed when you go to multiple processors there will be many other messages that come from the L1 cache So, remember the addition to this you have to do page analysis this does not allow solving However, this takes care of designing an inclusive hierarchy where your lowest level cache or the innermost level cache is actually possible Is it clear? So, our data cache has four ports what are these four ports one is used by labels one is used by the issued address one is used by the load retry and the one is used by store graduation So, remember that when a store commits it will write to the cache the retry unit has a dedicated port and these are the traditional addresses that come to the cache from the address generation unit so, again I have erased that and refills are coming from the next level of the address it reads the data RAM from both the ways speculated so, what it does is as we are talking about the L1 cache it has two ways as it says that So, it actually reads the data from both the ways in parallel and selects one or zero based on the tag RAM option what was the other option the other option would be that you first do a tag RAM lookup compare figure out if you actually have a hit and then look up the data RAM based on the data RAM So, you save time by doing this in parallel So, look up the data RAM and the tag RAM in parallel both come out together you do the hit check on the tag RAM and then figure out whether to pick one or not to pick anything from the two outcomes of data RAM So, you save time what you sacrifice you come to more power I definitely so, the power consumption will be actually more than double because in many cases and even on a hit I will be accessing only one here in all cases I will be accessing two So, this is often called parallel tag data lookup and traditionally used in L1 cache to save time So, that is the main purpose and since L1 cache are normally have small associativity the energy consumption is still within limit but you are really targeting a low power design then probably you will be able to do this you will be sacrificing in terms of latency but you will be saving in terms of energy I think this was the last slide which I will comment next time or maybe I will probably take for it because let us try So, it is a two way pseudo setup cache configurable at good time so, what is two way pseudo setup one of the ways in the second you look up one of the ways yes which is in the second way so, there are so, you serialize the lookups to the two ways so, there is a fast hit and a slow hit so, why was it done normally why is it done so, you go say now sorry we speculate that only one of the ways will be more use why not put the more you should get into that way you do not have to look up the other ways so, what do you think time so, the fast hit latency is same as a direct map cache radius so, that is what you say we have certain hit did not do serial series because of that and guess why that is a hit so, imagine what you have to do if you have to look up both the tags compared to looking up one tag at a time what you gain what you lose yes, but they did not have serial series for the reason of having direct map cache latency on fast hits there is a separate reason for that why they did not have a two way serial series cache they do gain latency if it is a if it hits in the first way but that was not the primary reason and the different company reason and that has to do with the off chip part of the entry cache so, your main processor does not have entry cache inside it is outside so, there is a separate package which has the entry cache so, you read the tag bring it here I think the more bandwidth so, what if we are succeeding more so, there will be less data every day yeah, so why is that important 32 bit scan right sorry this is a 32 bit scan so, yeah so, I say it saves what why is that important more web more web outside outside outside outside more web more web outside the chip on the board why is that important take about the reason why we multiplexed nascent cache in memory what was the reason so, this is a pin limitation that was the main reason you need pins to get the data right because this is off chip so, you need more pins on the chip to get the get two tags in parallel because the nascent cache addresses because the data is not only pins exactly same as we have just the limitation of pins authentically say that we cannot accommodate so many pins to have two tags in parallel we have to do so again so, there is an MRU way selection RAM that is maintained on chip so, it is a it is kind of a predictor you can say what it maintains is that for each into cache set it says which way is the MRU way and they predict that this is the way where you are likely to get this time because it was the MRU way last time so, in the first cycle the 16 data bytes of selected way is read in parallel with the tag in the next cycle next 16 data bytes of the selected way is read in parallel with the tag of the alternate way actually we are talking one extra address bit so, as a tags and I am on chip they are compared heat on first tag returns predictor data in the same cycle so, you have a 32 byte cache line size here you can see that actually because L1 cache 32 bytes so, whenever L1 cache request request data to the N2 cache it is always in chunk of 128 data pins here and however many bits you need for the tag that is determined by the capacity of the cache