 discussed how to reduce these three parameters. So, today I want to do is I will cover some discrete topics in memory systems before you move on to the discussion of the data. So, this one we discussed last time a little bit about exclusive hierarchy. So, essentially we are talking about two levels of the cache hierarchy which are exclusive. And the reason for this was that inclusion has a couple of drawbacks. So, what is that there is replication across levels of the two levels of cache in one and the other two then L 1 will be a subset of L 2. It means necessarily there will be a waste of space. The second problem is that when you when you evict a block from the outer level cache, when you evict a block from here you have to inherit the block in L 1 also. Otherwise L 1 may not remain a subset of L 2. Now, it is a little problematic because you are essentially taking away a block from L 1 cache simply because you are evicting something from L 2 cache. But it may very well happen that the block in the L 1 cache is heavily used by the processor. So, if you examine it a little more carefully what is going to happen is that consider a cache block which is very heavily used by the processor. So, what will happen that the processor pipeline is here and it will essentially access the block from L 1. It will have several bits which the L 2 will not get to know because we do not really access L 2 if I get the block in L 1. So, what will happen what may happen is that this block may eventually become the L R U block in L 2 cache because there is no access if I get to cache. Eventually the L 2 cache may actually evict the block because it is the L R U block and when it evicts a block to maintain inclusion it will also invalid the block from L 1 which may actually hard performance because actually this is a very hard block when you access by the processor. So, these are called back invalidations triggered by L 2 eviction. So, what will happen when you evict such a block immediately the processor will actually access the block have a cache miss and bring the block back into L 2 L 2 and this cycle will be more repeated. Again if you access the block from L 1 L 2 will think that this is a code block to become L R U eventually and then again evict. So, ultimately what will happen is that because of the back invalidations you will see extra cache misses which you would have if you had not have to worry about inclusion. So, these are called back invalidations or sometimes called inclusion over. Just because of inclusion you are introducing these extra cache misses is that clear. If you have an exclusive hierarchy this problem both are way both of the problems actually. You do not have any limitation across levels and you do not have to worry about this back invalidation whenever you evict a block from L 2 you can just evict it because there is no guarantee about the subset all you have to make sure that these are exclusive in the desktop. So, whenever a block is in L 2 it is not in L 1 you know that. So, you have to block from L 2 you can just evict it without modifying it. So, what is the tradeoff between inclusion and exclusion this also we discussed last time that exclusion buys you capacity because there is a notification, but it requires more bandwidth because whenever a block is evicted from L 1 it has to be allocated to L 2. So, it has to traverse the pass or interconnect whatever that is between L 1 and L 2 and it will go to L 2 and get L 2 on every eviction problem which is not present in inclusion. Now, it is possible to make exclusive caches bandwidth efficient essentially the idea is that when you evict a block from L 1 you put a pattern learner here which will monitor the blocks that are getting evicted from L 1 and basically figure out which blocks are not likely to be reused in future and those blocks will not actually go to L 2 they will just be talked because there is a point keeping these blocks in the category. So, I am not going to detail of this particular gadget sitting here you can essentially the idea is that you monitor the accesses inside L 1 of these blocks and based on that you decide whether the block is dead or be lying. Now, there is a middle ground that often processors take that is called non-inclusive non-exclusive what is that essentially the idea is that your L 1 is not a subset of L 2, but it is not also disturbed. So, what happens in this case is that when you bring a block from memory you fill it to L 2 fill it to L 1 both. So, but when you evict something from L 2 you do not inherit the blocks from L 1. So, essentially what I am doing is that I am taking away this part of the problem is back in the relations I still have a window where the blocks are duplicated across the levels of the higher L 1. So, when you evict a block from L 2 you do not modify about L 1 and this is where the subset property gets violated. So, essentially what you get is either exclusive non-inclusive. So, as I said I will essentially goes through some of the discrete topics today. The second one is trace cache this is used in digital processors it was introduced in Pentium 4. Essentially the idea was that you need to supply good instructions quickly into the pipeline. If you buy good instructions I need you most almost always you want to remain on the correct path actually you have to make correct branch predictions and you want to remain on the correct path. Because if you are not then you are essentially processing instructions which will be eventually thrown away you are wasting time. So, of course, one way to achieve this is to have smart branch predictors another way is to have a trace cache. So, what is this? So, often predicted target of a branch falls in a different cache logs and maybe in the middle of a block. So, this also we talked about last time. So, essentially you are currently executing an instruction in this particular cache block and let us suppose you have a branch instruction somewhere here and the target of this instruction falls in the middle of some other cache block. So, you execute this instruction and the next instruction execute will be this one. So, clearly this part of the cache log is going to be wasted. So, it wastes the tree part of the cache log having the branch this part it wastes the header part of the target cache that is this part. And the worst part is that you cannot face from a target in the same cycle even if the prediction is correct. So, the simple reason is that you usually face from a single cache log in particular cycle you cannot really face from multiple cache logs in a cycle. You can have multi coded instruction cache. So, it is to talk about multi coded caches, but normally it is very hard to design a high frequency because additional force usually slow down the instruction. So, what Pentium 4 did was it built traces dynamic. So, the branch prediction foundation becomes part of the hit cast. So, what it does is that on the fly whenever you have a branch instruction like this one you find the target and you prepare a trace of instructions where this particular instruction appears right after this instruction. So, essentially now you are preparing a trace cache block where this instruction appears right after this instruction. So, you are preparing these traces on the fly as and when you encounter these branches and targets. So, it was introduced in Pentium 4 which did not have an involved instruction cache, but only had a trace cache. And while building the trace lines Pentium 4 also translates to IEA 32 instructions to micro operations which is of the critical part. So, that it actually saves lot of decoding time there. So, it is the IEA 3 and what is happening. So, I am executing like this at a branch. So, then I jump here and then I again execute like this. Maybe I encounter one more branch here then I jump to some other cache block with the target here. So, what will happen is that what will trace what will the trace have. So, let us mark this one is my this is this portion is a this is D this instruction is C this portion is D this is E this is F ok and so on. So, trace will have A B C D E F etcetera ok. So, the trace will not have this part it will be dropped completely and this trace will be prepared on the fly and this will basically become a trace cache block and that will be stored in your trace cache. So, next time whenever you encounter A you will get out this whole cache block out ok. But, however now the question is next time when I encounter B will it go to C again that may not be the case actually it may go to some other place. For example, it may be actually go in this direction through the fall through path. So, that is the point here branch prediction validation becomes part of the fifth test. So, the trace cache block not only checks the tag of the block, but it also checks the branch predictions that are there inside the trace ok. They also have to match otherwise the trace will be rejected and it will actually fall back to the traditional execution. Is the concept clear all trace caches. Now, one major problem of trace cache was that it may contain duplicate partial traces why is that how do you get duplicate partial traces. So, same piece of code may be part of two different paths right you can execute through this path you can execute through this path and you can have a common portion of both of these traces. So, they may appear as two different trace lines in two different portion of the trace cache ok. So, this is really very hard to avoid duplicate partial traces. Anyway, so I will keep it to that only. If you really want to delve deeper into trace cache let me know I can give you a paper which talks about it also if you want to know more about exact architecture of the vacuum force trace cache let me know I will give you a paper. So, when is the trace cache looked like. So, at fetch time yes at term of the instruction cache. So, if it misses in the trace cache only then the instruction cache would be accessed. That is right. So, interestingly Pentium 4 did not have a ill one instruction cache it access A to cache. So, what is the mode. So, self modifying So, the traces would have to be flushed out of the cache. Any other question all right. So, we discussed a little bit about multi ported cache here. So, let me try to offer some more detail here. So, what is a multi ported cache essentially the idea is very simple you want to support multiple accesses to the cache in the same site. So, essentially you want to have multiple parallel accesses to the cache how do you do that you need one port for access. So, that I can access for example, if I have a dual ported cache I may be able to access two different cache blocks in the same site. So, instruction caches are normally not multi ported. So, then the question is how can I fetch multiple instructions and we say that this is one important aspect of the super scanner processor. So, one simple solution is that you have multiple instructions in a cache block. So, you can just bring one cache block and hopefully you will have many instructions within the cache block. If you really want to have the effect of a multi ported cache at least to some extent you can have a bank interview cache. For example, if you have two separate banks of the instruction cache you can access both the banks in parallel and you can make each bank single port you have to draw one. So, that will still give you some benefit, but it will not give you the full freedom because for example, if you find that you want to access two cache blocks on the same bank that will not be allowed which would be allowed in a true dual ported cache. So, that is what you use actually. So, to the external world a bank interview cache more or less looks like a multi ported cache, but memory cells are actually not multi ported. So, we are just single ported you have bunch of single ported banks that is all. So, you can get some amount of concurrency, but not full concurrency. So, that is about instruction cache the question is now do we need multiple data cache accesses for science. An answer is normally yes, because memory is the bottleneck. So, it always helps to have multiple concurrent accesses with the data cache. So, it is good to have better data cache throughput. So, normally you find that data caches are world ported. At most world port normally they are not have more than two ports. These are we have already discussed what does the store do when it is used. So, I am not going to get into that. You might want to remind yourself that is why this question is here. So, that is about multi ported caches. So, giving minor instruction caches are normally not multi ported. We will revisit this question again when discussing high quality trading that we pick up after the break. So, there we will see another motivation of having multi ported instruction caches and we will see how to circumvent that. Data caches are normally multi ported, but number of ports is fine. So, a little bit about input output. So, I am just trying to touch upon the different things that interact with the very hierarchy. We have not yet discussed IO, but I have assured that all of you have seen at least some of IO input output. So, essentially the point here is that the many federal devices that the computer talks to. Like a keyboard, your display device, printers, CD-op drives, USB drives, speakers, etcetera etcetera right. So, the question now is the question that I am trying to ask here is as follows. So, let me give you an example. Suppose I am reading in some input from a keyboard. So, user is typing in certain things. So, how does it work? Usually your data that is being cleaned will be stored in the keyboard controller for the small buffer. And when the buffer is stored, it will send an interrupt to the CPU. So, the CPU then picks up the data copies and data from the keyboard's internal buffer into the memory. So, the question now is that it copies it into the memory or to the cache. Where does the file, where does the data go? So, once the data has been copied now the CPU can work on the data. It can parse the data. It can interpret the data in whatever way it wants. So, for example, when you say scan F or send D and for send X. So, what does ultimately what does it do? It reads in an integer, puts the value at that particular address whatever is specified there. The question is this address caching or not? Does it go to the cache or does it not? Because it is not. Next what I may want to do is I may say that X plus plus right. That is the CPU instruction and X will be copied into cache and you have to be copyrighted in all right. So, this particular buffer. So, essentially there are 2 copies that are going on. First one scan F will involve a system called which the operating system will copy the data from your 2 volt buffer into a kernel space buffer. And then the kernel space buffer would be copied to this protocol address the content of the buffer. The question is where is the kernel space buffer? Is it inside the cache one? Because this address is of course in the user space this will be cacheable it is an arbitrary address. Why is the kernel space buffer? What does the user take? Is it cache from buffer used for doing IO? It is just answer maybe it is there in the cache. Sorry say again. If we use the x just answer. No x is an e. So, a percent x is a user space address. So, it is of course cacheable it will be inside the cache. The question is so that there is an intermediate copy that is going on right. Copy the data from 2 volt buffer into a kernel space buffer that is going on right. And then the kernel space buffer contents will be copied to that protocol address and that completes the scan. The question is where is this intermediate buffer? Is it inside the cache or is it not in the cache? It has to be in cache. It has to be in cache. Why? There is no way you can avoid a cache. The processor is getting certain address. What if I mark this protocol page as cacheable? It can be like that. It can be like that. The hardware does not know but the operating system knows. Operating system may choose to make the IO buffer un-cacheable. So, it can generate a tlb translation where the attribute bits will say that this is an un-cacheable. That is possible. What do you do if I make it un-cacheable? It will take a long time. It is going to take a long time exactly. So, here I am talking about a single variable but you may take a bunch of inputs. For example, you may be needing for a file. So, you have to take a long time to copy from the kernel space to the user space. So, normally the kernel space buffers are cacheable. So, now that leads to a big problem. So, after a while I suppose I do. So, I am fine. I am happy. I have read x. I continue. So, after a while I say scan it and person D and Y. So, remember that the kernel space buffer is now cached. So, when I do that input, it is going to cache it. It is already in the cache. The problem is that now when I read, when I try to read Y, I may actually get a stable value because x may get copied to Y from the old kernel space buffer, the copy that is inside the cache. So, this is called a coherence problem. So, that is what is being summarized here. The data will come in the stream and you read from the head of the stream. So, once you read the stream will go to the next element. Now, what the stream contains will be copied into an internal kernel space buffer page or processor page. This buffer must be written before. So, there is a copy from the keyboard controller to the kernel space. So, that must have that must execute before this scan. So, that is exactly what is being said here. I O through cache poses no characterizations. So, essentially the point here is that when you put the buffer in cache as he has already mentioned, there will be an I O space, there will be a kernel space right to this buffer which is already cached. So, when the processor will try to copy from that buffer, it will get the correct contents and put the value in the address of Y. But direct I O to memory may leave still data in cache. So, here the point is that consider this one here. Suppose, the buffer actually was in memory, it was not cached, but processor has to copy x from the buffer. So, what does it do in the process? It will actually cache the buffer because what will generate, what kind of instructions. It will say load some certain address from the buffer space and then store that value into this program. So, there will be pair of load and the load will go to the cache. So, the point is that what you have done is that you have essentially copied the buffer into the cache from memory. And then when you do this I O as I said suppose the I O happens directly to memory. So, kernel will actually copy the keyboard buffer contents directly to memory buffer without quantifying the cache. And all the processor already does the same process. It starts copying from the kernel space buffer to this address, but now it is going to take a cache it right. So, let me show you maybe I am. So, this is my memory area, this is the cache. So, let us talk about the first cache. So, let us talk about this one directly to memory. So, what does the kernel do? kernel copies the keyboard controllers buffer contents into memory. Let us suppose this is my input buffer. Here it puts the data process keyboard controller buffer into this. Then the processor is job. So, now the system called handler essentially has done its job. It has read from the keyboard into the kernel space buffer. Now, the processor has to copy the content of this buffer into that particular address. So, what does it do? It will generate bunch of instructions. Two of which the most important ones would be a load word which will load from this address into some register. So, address underscore IPM quality and then what you will do? It will do store word dollar 2 into address of x that completes the first scan. So, what is the effect of this? Essentially by these two instructions I have done two things. I have made a copy of address IP inside my cache. The cache block containing that and I have made a copy of address x inside my cache. Now, the second scan happens. So, what does the kernel do? It reads the keyboard controllers buffer copies the contents into here. Remember that I am doing direct I O 2 value. And the system called handler's job copies it. Now, what does the processor do? The processor has to copy these contents into a percent Y. So, it starts the process again. It does exactly the same thing except that this will now be Y. So, now what is going to happen? As soon as it executes this instruction it will get a cache it. Whatever the value is here will be loaded into dollar 2 which is actually a wrong value. And then it will execute this store. So, essentially address Y will get the value of x whatever the value of x is. The actual contents are actually sitting here. So, it is a problem here. The direct I O 2 memory may leave stale data in cache. On the other hand if the I O actually happened through the cache then this problem would arise. Because the kernel space input buffer would actually be cacheable here. I will always get the correct content from the cache. I would not be making two different copies. So, if you divide this it is close enough cache also. They can without any problem. You can make the I O device but actually to solve this problem if you actually choose to do this eventually it will do that. Because if you say that well I do not allow my I O device to interfere with the processor caches. But then you have to come up with the solution for this. So, how do you solve this? So, here are couple of solutions for direct memory I O. The sys call handle must ensure that cache lines belonging to the input buffer space are flushed to memory before starting the I O. So, essentially what I am saying is that this particular cache line should be flushed from the cache before the input output operation starts. So, processors come with privileged selective or full cache class instructions. So, you can make use of them to do this. The second solution is that hardware must check if some I O device is reading or writing to a memory line that is currently cached. So, this one requires some extra hardware support that is whatever address that flows between the I O device and memory must be smoothed by the processor cache. And if it finds that the that particular address is currently cached it has to remove that cache block. If yes, the cache line must be validated in the entire cache hierarchy and get back in case it is not. So, this is very attractive to the operating system designers because it does not interfere with the processor cache. I O does not interfere with the processor cache. Because there are many pieces of data that the processor cache should not have many kernel space data. That can be avoided completely if you do I O in memory. The downside of that is that you have to do this. Same problem. So, I will give you an example of input here. Same problem can arise now of this also. Exact same problem. You may end up printing on the display device something which you did not want to print actually. Some old data you may end up printing. So, we will soon pick up input output in more detail. So, there again we will come back to this problem with more examples of data. So, any question on this? Is this clear? So, here the take away point is that even in a single processor system you have to worry about the problem of cache coherence which people often talk about in context of multi processors. But even on single processors you have to support some minimal amount of cache coherence to be able to do I O correctly. Because the I O devices can be seen as individual processor entities. Essentially now you have a main processor. You have certain other I O devices that can independently manipulate data. So, you have to have some solutions for coherence. Question? It is cacheable. You can cacheable. You can make it to the way. So, depending on the operating system it actually varies exactly which kernel data is cacheable which is not. Usually that comes as part of your page table attribute. So, page table will say whether this can be cacheable. So, I wanted to touch upon multi level paging. How many of you have seen this from the operating system course? Multi level paging. So, that is good. So, I will just quickly go about this since I discussed paging. So, there is a problem with handling large page tables. So, let us try to understand what the problem is. So, in alpha 21 to 64. So, this example is taken directly from your book. So, you can go back and read in more detail. This is just a summary of that. So, alpha 21 to 64 implements segmented paging, but we are going to ignore that point. We will assume it is a flat paging scheme. It has 8 kilobyte page size which means it is a 30 bit page offset. Each page table entry is 8 bytes wide. So, it has a 32 bit virtual page number. So, that gives you a 43 bit virtual address. Actually it is more. Now, for bits are resolved for some other purposes. So, a segment is normally paged. So, for our purpose, what is important is that we have 30 bit virtual page number. So, with 8 byte paged table entries, this leads to an 8 gigabyte page table. 2 to the power of 30 multiplied by 8 is gigantic. So, you can imagine just to support virtual memory, you have to devote 8 gigabyte of memory space for storing the page table. This does not make any sense. So, you cannot hold the entire page table in memory. What you do is you must swap in and out page table sections. So, what it means is page the page table. So, simple solution is to treat the page table as normal memory and page them also. So, 2 out of 6 provides 3 levels of page tables. Page table base register holds the base address of n 1 page table. Each level can access 1024 page table entries. So, this is what it looks like. So, each of these point to the levels of your tables. So, the page table base register holds the base address of your n 1 page table. So, each of the tables can hold 1024 page table entries. So, essentially these 10 bits will tell you where in L 1 I should index. So, what does this entry tell you? Here it is not there. What should this actually hold? What do I need to know to continue the process? The page table for the next 10 bits. The base address of the page table, the base of L 2. This one tells me the offset into it. This one holds the base of L 3. What does this one contain? Base address of the page. Base address of the page. Base address of the page, the physical page number. These are page table entries. So, that is how you access. So, essentially all you need to do is you have to keep these three tables in your memory. How big are they? Each one contains 1024 page table entries. Each page table entries how much? 8 bytes. So, how big are these tables? 8 kilo bytes. How much is that? A page. So, I just need to hold three pages. That is it. So, usually the page tables are sized in such a way that they become a page. They are just a page. So, at any point in time you have to hold one page of each of these three levels. Instead of holding 8 kilo byte of page table, you just need to hold three pages. Now, alpha 21 to 64 is a 64 bit processor. This only gives you 43 bits of the virtual address. What about the upper 21 bits? So, remaining 21 bits of the virtual address are used to identify the segment. It supports three segments. So, there are many unused bits towards the end. The three segments need to be identified as 21 bits. So, you can read up about this example in your book. So, it has more details about alpha 21 to 64 virtual memory. So, now we will switch a little bit. So, we are essentially moving a little bit out of the processor and then take a look at main memory. So, what do you look like? So, the outermost level of the cache hierarchy talks to main memory on a cache list. So, for example, if I have an L1 and L2, only on L2 miss, I will talk to main memory. Otherwise, I am happy with my cache hierarchy. I can get the data from the cache hierarchy. And of course, the IO devices may have a separate path to main memory. Now, the interface to memory is normally through a memory controller. Memory controller is connected to the bus interface unit, which in turn connects the system bus. Also used to be known as front side bus to the outermost cache controller. So, let us suppose that this is my L3 cache. I have L2 and L1 before it and this is my memory controller and this is my main memory. And the L3 controller talks to the memory controller. There may be other paths through the IO devices talk to the memory controller. So, this is the bus, which connects the L3 and the memory controller. The cache controller, that is the L3 cache controller, usually puts an request in a buffer and the bus interface unit grants the bus to the request at some point. So, talking about this particular bus here. So, there is a queue here where L3 will put the requests. Eventually the request will be picked up by the bus interface unit and usually carries out an arbitration algorithm among various types of buffers connecting the memory controller and cache controller. For example, there will be a queue for cache misses, there will be a queue for evicted cache blocks. So, there will be many other queues in the bus interface. The request ultimately gets queued into the memory controller input queue. So, this queue or this request ultimately gets transferred to a queue here. And the memory controller's job is to decode the request and take appropriate action. So, you can think about this memory controller as a very simple processor. It essentially picks up request from this queue, decodes it, records out the address, records out the command type and launch the request to the memory controller. So, here the steps involved in cellular cache misses from autonomous level beyond the miss detection. So, you queued the miss request in BIU buffer. So, these are BIU buffer, bus interface unit buffer. So, how many request types and hence how many different logical queues. So, that depends very much on your architecture, but you can assume that there will be one queue, one request type for cache request type for edited cache request. The bus interface unit schedules a request and switches the address control and data according to the bus protocol. So, essentially when it picks up a request from here, it will launch the address from this bus. It will launch whatever control that is needed that is command type and request type. If it is an edited cache block, there will be data also distributed, which will be distributed. The request is built at the other end by the memory controller and it is put in a queue. So, this queue. The memory controller picks the request out of this queue normally in order. There is no order sharing going on here. Decodes request type an address and sends it to the DRM. It is your dynamic random access rule. So, this is the DRM. The DRM access involves decoding row, decoding column arranged as a two-dimensional matrix of bits. So, there is a row ID, there is a column ID and at the intersection point, whatever the data is, that is what is meant by decoding row, decoding column. The data replies based by the memory controller. Eventually, the reply will come back to the memory controller. It will be printed by the memory controller and the bus interface queue for schedule. So, there will be a queue on this side also. So, that will go in. Eventually, we schedule and the processor, the F3 cache controller will be notified that the data is arrived. So, this particular bus is often called a channel that connects the memory controller to the DRM. So, there are two important parameters when we talk about main delivery latency and bandwidth. So, latency essentially the address and command is sent to the DRM to the time the memory controller gives the data back. That is the latency of more access. And the bandwidth essentially how fast was the data, when the first data comes out, how fast can you send out the remaining pieces of data. So, for example, when you talk about a cache log, it is the session that we have 64 bytes cache log. Typically, the DRM interface would be 64 bits. So, this particular bus, this channel would be, for example, 64 bit bytes. Typically, that is the bits, which means to transfer a 64 byte cache log in between transactions over the bus. So, the point is that if you had a wider bus, you could actually reduce the number of transfers. That is what the bandwidth is. So, once you have read the data, how fast can you transfer it to the memory controller. That is what the bandwidth is. Now, there are typically two types of applications. We look at the software programs. One class is called compute bound, where given a particular data point, you do a lot of computations. These are compute bound applications. And typically in these processors, you do not require much memory bandwidth because given a data point, the program will spend a lot of time working on the data. So, essentially the request will be spaced out to memory. Whereas in memory bound applications, you typically do small amounts of computation on each data point and then you start a large amount of data. So, there will be very much bandwidth bound by memory. So, here this category typically contains game like applications. There is a lot of data that are needed to be transferred from the memory to the graphics processor. So, similarly there are other classifications like latency bound applications and bandwidth bound applications. Similarly, we define that the applications. So, these applications would not ask for too much of data, but when they ask, it is always needed to respond very quickly. Otherwise, their performance may be affected. So, these are called latency bound applications. Whereas bandwidth bound applications would ask for a lot of data in a very short span. So, in their case the bandwidth is more important than latency. So, it is easy to improve bandwidth. For example, you can have wide memory and bus. You could have interleaved banking in memory, so that you can access multiple banks in parallel in the data. Or you could have a smart scheduler at this queue. You could actually figure out when we talk about some of the scheduling algorithm which request should be scheduled already. So, if you have memory banks your goal should be to utilize as many banks as possible. So, you should not be scheduling requests which go to the same bank at the same time because they cannot be serviced in parallel. They have to be serviced one of parallel. So, the question is how many banks to support? Is there a relationship with access latency? If I tell you that one request requires n cycles how many banks should I support? Is there a relationship at all? What if I have n banks very more than n banks if one request requires n cycles? So, if I have n banks every cycle I can send a request to a different bank potentially. If I have a request for every bank I will be good. So, but this relationship is not very useful because usually the access time is very large. You cannot cover that by having as many banks. Because today high density data integration points to a small number of banks. That is what today's data market is. They are 4 to 8 banks. So, you have to deal with this particular problem. So, you have to have a schedule which should not try to pick requests or fetching in a bank. So, the goal of the schedule should be to pick up requests going to the different banks. Which means you may not be able to do in order schedule in a bank. You may have to pick up a request and look for requests that go to independent banks. So, typically today what happens is that this particular queue will actually get scattered into queues for each bank. So, memory control will pick up a request, decode the request, find out the bank number and put it in the corresponding bank queue. And then it is up to the bank controller to schedule the request one or another from a bank queue. So, how to improve latency? There are architecture techniques. Mostly it involves around hiding techniques. We have talked about many of these already. For example, one is prefaging. If you try to prefage to hide the latency of the latency. There are lower level techniques that try to improve latency, but they deal with dinner technology which will not cover this. But there are techniques which can improve latency with technology improvements.