 talk about the general system of architecture, that is we have high energy of caches inside the processor. So, we said that the L3 cache controller has at least 3 queues, 2 of them are going towards the memory controller, one is a miscue, another one is a write back to the memory controller. So, this one carries the requests that miscellaneous cache and this one carries the requests, this one carries the dirty evicted cache blocks to be written back to main memories. And there is a queue coming in this direction where you get the responses to the misses. So, I will call it a response queue and there is an arbiter which schedules these 3 queues according to some protocol. And whenever it picks up one of these 2 queues, it will pick up the head of the queue and put it in the memory controller scheme. And memory controller's job is to schedule this queue, so pick up requests, decode them meaning that it will look at the address, it will look at the command and on the address it will figure out the bank number. So, we will today see how you decode that bank ID from address and there are bunch of queues one for each bank of the DRAM and we will put the corresponding request into the corresponding queue and the bank controller will schedule this request according to some protocol. And it will send the request to the DRAM and eventually the response will come back into this queue and the response will be sent over to the bus into the entry caches response. If it is a write back request, then of course there won't be any response, the data will just be written back to the corresponding bank. So, is this overview creative? Alright, so today what we will do is, assuming this general architecture we are going to open this up and see what goes inside. So, to start with at a very high level the DRAM is organized as rows and columns of bits. So, there are bunch of rows. So, as we mentioned last time the DRAM will have several banks inside. So, we are looking at one bank, we are just opening up one bank. So, it will have bunch of rows and bunch of columns. Now, the way the DRAM organization is defined today that column is not in a column of bits, it is usually multiple bits. So, for example, this could be a typical column, it contains several bits. So, this is the intersection of a row and a column, column is not a single bit, it is several bits. Now, how do we really read a particular bank? So, whenever a request goes as I said the memory control will figure out the bank ID and put the request in the corresponding bank request queue. The first command that will go from the bank request queue to the DRAM bank is called a row address row or a RAS operation. So, what does it do? It essentially activates the corresponding row with the data where the data starts. So, again the row number can be decoded from the given address. So, we will talk about that also. So, essentially what you do is in that when you activate a row, you read out this entire row into a row buffer. Each bank gets a row buffer. So, you read this entire row into a row buffer. So, and then the next operation that happens is called a column address row where you send the column address that tells you which column would contain the requested data. And essentially the job of the DRAM is that take the column address and take out those bits out of the corresponding row. And now within this particular column, you may not actually require all the bits. So, there is a column offset also that is decoded from the address. So, certain bits may be needed from this column. And finally, those bits go out from this particular bank. So, that is how you read the DRAM. Now, one thing to observe here is that the content of the row buffer survives until you get a request to a different row. Which means if this bank request scheduler is hot enough, you would actually cluster all the requests going to the same row together. You send them one after another. That would save your RAS operation. You can actually do a RAS followed by a sequence of gases satisfying all the requests. So, that is a very common scheduling technique used in all DRAMs today. Where the bank request scheduler prioritizes the requests that go to the same row of the bank. Now, what happens when you get a, so eventually of course, it will happen that you do not have any request in the bank request queue that fall on the same row as the currently open row. So, these sort of currently open row. So, then what do you do? Of course, you have to take the extra latency. So, now there are three things that happen. So, that is called a precharge operation. So, RAS was essentially we said it is an activate operation. And then you wait for certain number of cycles before you can send a cache. That is called a row to column access delay. So, first thing you do is you, so you are talking about a case where currently open row is x and the bank controller has scheduled a request to row y. So, clearly your row buffer does not contain the desired data and closes this particular row. Then you do the same thing. You activate row y, the new row which reads the row out into the row buffer. And then you wait for this amount of time to follow the access delay and then you issue the column access delay. So, the first one that is a particular request mapping to the current open row is called a row hit which has a latency equal to t gas. Just time to do a column access delay. The second scenario where a currently requested row is not open is called a row conflict. And it has a latency equal to t precharge. Sometimes it is called t r p that is row precharge plus t r c d that is row to column access delay plus t gas. So, you can now clearly see that row conflict time is going to be much higher than the row hit time. So, naturally a smart scheduling technique would be to walk this particular bank request queue and cluster all the requests go into the same row y. So, then for the whole cluster you will do this only once and it will do t gas for the subsequent. No, there is no separate time activate is just a command. However, the row buffer is not stable until this much of time you have to wait for this much of time. Now, there is a third thing that is called you might be wondering why did I use row miss and row hit. So, there is a third thing which is called row miss. So, this slightly different from row conflict. So, in case of row miss what happens is that there is no open row currently. So, essentially what you do is you do not do any precharge all you do is you do an activate and a gas. Now, you might wonder why should this at all happen ever because I always have an open row right which corresponds to the last open last execution request. At the starting only right. However, there are certain DRAM controllers which follow something called a closed page policy. We basically says that if the DRAM controller can infer that this row will not be needed in future. We can actually close it early. So, that you can hide this particular latency for the next equation. So, precharge time is not in a critical path. So, that is called a closed page policy. So, an extreme of closed page policy will be that after every request you close it. So, which is of course not going to be very good if you have locality in the accessory. So, essentially then what will happen is that every request will take this much of time row miss. And the other one is called open page policy where you keep the row opening until you run into a conflict. So, in that case what will happen is that you will first time you check a row miss. But otherwise you will either have row heat or row conflict. And this particular policy is exercised by the memory controller. So, the memory controller designer will figure out what to do. So, if the designer can design a good prediction policy for closing the row buffer it will actually use the closed page policy. So, by the way I am using the word page here because often this is also called a page buffer or DRAM page this particular one. Do not get confused with your waste page it has nothing to do with the waste page. So, is this clear to everybody? How I access the DRAM what actually matters when I accessing the DRAM? What sharing the decisions that influence my DRAM latency. So, these are very simple overview. There are many other time parameters that you have to take care of. He asked whether there is a time required for activating a row. So, there are many other time parameters. But this is what is a first order effect of accessing a DRAM. For example, there are other second order effects like when you write to a DRAM bank that is you write to row buffer. How much time must lapse before you can read the row buffer again? There are certain constraints on that you cannot immediately read the row buffer. So, if you are interested I can send you a link you can read about DRAM vendors data sheets you can see many more details. If you I do not know if you have ever tried to explore DRAMs, but if you do you find that DRAM latency is normally mentioned as something like 10, 10, 10, 3 numbers that we are usually. So, those are these 3 numbers. So, this is what you actually need. So, what I will do is I will walk you through one example where I take a typical DRAM and actually look at how the addresses are decoded into bank rows columns etcetera etcetera. So, this is one thing that I can show you in this course. Unfortunately, other things are all inside the chips. So, I brought this in cards. So, and if you have seen these things memory cards. So, this is what they look like. So, you will see these black chips right these are DRAM chips and they normally appear on both sides. They can appear on only one side also. So, this one has this has 4 chips on this side 4 here there are 8 chips and these are DDR2 520 megabyte memory card and these are the pin connectors you can see these copper bearings. And this one is a DDR connector these are DDR2 these are not DDR2 right same capacity, but the layout is different. So, one way to figure out whether it is DDR2 or DDR is that if you align the DDR card with the DDR2 card ok. You find that these two notches here are not aligned actually ok. So, the DDR2 notch will appear already that the DDR in DDR notch. So, upon one is a DDR card the lower one is a DDR card. So, you can read about exact layout of DDR2 and DDR why they do not align etcetera etcetera. So, what I am going to talk about is I am going to talk about one such card what exactly it means. So, you can see that 9 chips are on this side 9 on this side there are 18 chips out of which actually 16 are data chips 2 are error correcting chips. So, let us try to see what actually goes inside this. So, this particular write up is also post of the course message. So, you can read about that this is just a summary of one particular DRAM from micron. So, what you do is we look at one. So, this is called dual online memory module this one both of these dual online memory module or DIM these are called DIM cards. There used to be a single online memory module or SIM card long back in 90s they had less number of pins, but today you won't be to buy that they are not in the market or you get the DIM card. And remember that it has nothing to do with whether the chips are here on one side or both sides there has nothing to do with SIM or DIM. So, we are going to look at one 4 GB ECC DIM with 2 gigabit X8 DRAM chips. So, I am going to explain what this quotation means. So, this one stands for ECC. So, each DRAM chip is 2 gigabit in size. So, how many chips do I need to cover 4 gigabyte 16 this is 4 gigabyte this is 2 gigabits. So, normally whenever you see a DRAM chip capacity they are normally mentioned in this. So, there are 16 chips, but this DIM card would actually have 18 chips 2 of which are ECC chips. So, what is the ECC density how many ECC bits per byte of data. So, usually the simplest ECC protocol is that you store the XOR up on the 8 bits in a single bit. So, why is XOR 1 bit flips then I should be able to catch it. So, we have 18 chips 2 of which are ECC chips. So, 16 are data chips. So, each side of the card has 9 chips then some small control chips also on the card, but this XH here means that each DRAM chip can provide 8 bits of output. Whenever you send a particular command to the DRAM chip can either write or read 8 bits. So, usually these DIMs have a 64 bit interface 64 bit interface with a memory controller. So, how many chips do I need to fill up 64 bits each chip can provide 8 bits 8 chips. So, whenever a request goes normally I will activate 8 DRAM chips of 16. Of course, one ECC chip will also be activated and then what will happen is that these 8 chips will give me total of 64 bits of data. So, let us see how we actually does that. Each side of the chip forms a rank sorry side of the card. So, in this case so this is just an accident that one side of the card forms a rank. In general a rank means that a subset of DRAM chips that participate in generating a data packet. So, in this particular case it happens that you need 8 DRAM chips to generate a data packet which is on one side of the card. Other you can easily come up with a different you know DIM architecture where you can find that only half of one side will be one type. So, sorry so this one is actually 72 bit interface which is 64 bit data plus 8 bit ECC. So, we have two ranks here each side is a rank. Now, so now the memory controller when it receives a request of this side you will receive a request for a cache. Now, let us assume that we have a cache block side of 128 bytes. Now, we get 64 bit data in one request from the DRAM. So, how many how many how many bus transactions I need here? 16 right and we have 16 such transactions to fill up one cache. So, this is often caused the burst length of the DRAM. So, burst length is configured when you boot the memory controller. In this case it is going to be 16 burst length. Now, definitely what we really want is that the first request that is the first 8 bytes request for the cache block may have a row conflict or row miss depending on the protocol. But the remaining 15 transactions should be row hits that is what I would expect that is how we must put our row and column ID in the access. So, that a cache block we have 15 row hits and maybe at least I mean at most one row is a conflict or all 16 could be row hits. So, let us see and this one is called a channel that we discussed last time and you can have multi-channel memory controllers as well that is also possible. So, in this case what will happen is that you collect one thing to one channel and another thing to another channel. So, there will be two separate things connected to two channels and that will give you better rendering in bandwidth and let us assume that each DRAM chip be 8 way bank internally. So, and each bank looks like this it has bunch of rows bunch of columns. So, that means when I want to access a piece of data what I need is for this particular one I need to tell the row ID. So, first of all I need to choose my bank ID within the bank I will tell the row ID I will tell the column ID and also I will tell the column offset that is from where these 8 bits should come within the column because one column will not be 8 bits it will be longer than that. So, for this particular 2 GB 2 gigabit DRAM chip let us assume that we have number of rows equal to 32 K we just told you column 15 number of columns equal to 128 and column width is 64 bits. So, that uniquely defines my bank organization. So, how big is the row buffer how is that 1 kilo byte row buffer per bank. So, let us see how the address is important. So, this is my physical address now whenever memory controller sends a request to the DRAM it aligns it to 8 byte boundary why is that because the interface is of 64 bit data it always talks in terms of 64 bit charts. So, it aligns it to 8 byte boundary which means the last 3 bits are 0. So, any physical address that the DRAM sees will have last 3 bits are 0. So, which means it can be 0. So, after that comes the column offset which tells me within a column which bits should be these 8 bits. So, how many bits do I need column offset column width is 64 bits why 3 bits. So, I have 8 possibilities of getting 8 bits output from the 64 bit column. So, this is 3 bits column offset then comes the column id which is 7 bits less than the bank id which is 3 bits then comes the bank which is 1 bit and we need 15 bits for the row how much does it come to 32 bits right. And that is what we would expect to access a 4 bit of my id 32 bit. So, when the memory controller gets the particular physical address its job is to do this partition of the address into these different fields removes the last 3 bits and tells the gear I am about the following things the column offset the column id and the row id. And it sends these 3 things the corresponding rank to the corresponding bank. And the request gets broadcast to all the chips in that rank that would actually provide the data. So, 8 chips would now work concurrently and within these chips each of these chips will actually operate on this particular rank and would access this particular row this particular column and within the column this particular offset. And each chip will provide you 8 bits and finally you get out 64 bits of data and 8 bits of data. And then you communicate that over the channel to the memory controller. And then the bar's length is set to 16 which means that here I will now every cycle will keep on providing one such fact until the bar's length is exhausted at which point the request opens. In case of write of course there will be any communication memory controller the write will happen one after another the chance of 8 points. Is it there? Each transfer is for the transaction. It is usually called a bar's. Overrun. That is the no which transfer within a bar. 8 bytes. Yes. Yes that is a bar's. You get 16 bars together. Any question for this? Now there are a couple of things that you have to resolve. If I have a multi-channel memory controller where should I put my channel bits? In address. In other words given an address I must be able to figure out on which channel to set this particular request. And each channel is a beam connected. Where should I put my channel bits? What makes most sense? On this side. Here. Yes. Why? What is the reason? We are ignoring the last thing. That is ok. That is because the address are aligned. Yes. So imagine that I have a dual channel memory controller. What is the purpose of having dual channel? So that I can make both the channels I can utilize both the channels concurrently. What makes me do that? So what will enable both the channels to be used concurrently? How should I put the channel bits? What is the drain of request? We will schedule the request and add more channels to be used. That is right. Yes. Actually the first bit is linear. First bit? First bit of this linear address. Why do you say so? I want two independent requests to go to the channels. So they can work independently. What is the drain of request? That memory controller sees. Sorry? Right. There is nothing. Cash blocks. Right? I have said one cash block to this channel. The next one to that channel. Right? So I have just ordered it. The cash blocks to the channels. So I have put it right after the... See in this case I have a 120 white cash block. So I have put the seventh bit here. This would be my channel. Ok. The call of ID will get split unfortunately in this case. Right? Ok. So you have to give it... So this is one... One... Probably the most useful way of putting the channels. So you can alternate between cash blocks. Do you have the global bits? Do you have the exchange depth? Yeah, I will get that. Yes. I have come to that. Why this is done in this way? Just one more question before we get there. What if I have multiple multi-channel memory controllers? So then I have to select the memory controller as well. Why should I put that? What makes most sense? Does it make sense to switch alternate cash blocks to memory controllers? No, because here multiple multi-channel memory controllers. Fine. So within each memory controller I will alternate between channels. That makes sense. Why do we put the row ID on the most significant site? Can you really explain why? Think about this. Let me see. Does it maximize the amount of 20 equals data in the row if I put the row ID on the most significant site? Yes. It does, right? Yes. So it means if I have sequential access, I would have mostly other row heads except the first access. But then you introduce channel ID there. Yes. You don't just click the... Yes, I do break the locality. But if you still look inside a channel, within a channel, it still continues. Yes, overall I'm breaking the locality, but within a channel it still continues. What are the channel IDs to be at the least significant after the three engineering controllers? So the channel ID is changing? No, there is a technical problem in doing that. Then essentially you are switching bits across channels, which cannot be done. So the column offset cannot be split across channels. That's not possible. Essentially what you are saying is that why would you split one request across channels? That's what you are saying, right? Yes. So I'm saying that why wouldn't you give independent requests to channels so that they can proceed independently? Because what we are now... The problem of what you are proposing is that now you have to wait for both the channels to complete before responding to the processor. But with two independent channels given to two cache blocks, they may be responded out of order actually. As soon as one completes, I can respond to the bus. That's about it. So the latency parameters I have just mentioned. So they are specified like Tcass, Trcd, Trp. So that's the three numbers. You often find a fourth number, which is Tras. The time taken for a RAS operation. So that's how the DRAM sheets normally specify the latency parameters. Otherwise, you normally talk about DRAMs either like... So deems are normally talked about like this. The DRAM chips would be like this. Also along with this, there will be a DDR specification and a frequency parameter. So they may look like this. DDR31333. So what this means is that is a third generation double data rate DRAM. And the effective frequency is 1333 MHz. By effective frequency, what I mean is that so double data rate DRAM can transfer data on both the clock edges. Which means in this case, if it is a DDR DRAM, it will transfer 64 bits on both the clock edges. So which means to have 16 bars, you actually require 8 cycles. The question is what is the frequency of these cycles? What is the cycle time? This one is not the actual frequency. Half of this is the actual frequency. This is an effective frequency. At this frequency, you require 16 cycles. Half the frequency will require 8 cycles. So that's pretty much about it. That you need to know if you want to really make an educated decision about buying a game card. I don't think you need to know anything else. So that the marketing people can't fool you. So if I have anything else here, that's about it. So this is about DRAM. So we're talking about the main memory. And you might notice that, so an obvious question that one could ask is why do you separate the NAS and CAS operations? Why do you send the row address and column address together? That will definitely save a lot of time because I could actually combine many commands together and probably give out your data much earlier. Because here what I'm doing is that I first activate the row, then I read the row out, and then I activate my column. So instead, if I have the row address and column address, I could probably read it out with this much and could give you the data much faster. Any idea why these are separate into two phases, NAS and CAS? What could be the reason for doing this? So normally, the row address and the column address, so as I said, when I mentioned that, I said that the memory controller sends the row ID, column ID, and the column offset to the DRAM band controller. They actually draw and get sent together. First you would send the row ID, and then you would actually communicate the column ID and the column offset on the same address bus. You would multiply all these. Why is that? What is the reason? So under what circumstances, you normally think about multiplexing. Sorry? When you want to make a question. No, here you are not talking about multiplexer. We are talking about multiplexing two things on a single physical resource, that is address bus. You first send the row address, and then you send the column address. Why do you need to do these things? Not to this. Sorry? Yeah, you are short of wires. Then I do not have enough wires to communicate the row and column address in parallel. So the point is that the DRAM chips have been limited. They have very much been limited. And the reason is that, of course, the obvious question is why? Why can't we increase the number of pins? What hurts increasing the number of pins? The reason is that that hurts the density. In DRAM, the density is extremely important. How many bits you can pack in unique area? That is the factor when you design the DRAM. So when you have more pins, you normally have more pericolor circuitry, and that hurts the density of the DRAM. So here, we never mentioned the area actually, but that is a very important factor. How much area do you need to pack this two bits of data? So that is why normally the DRAM chips are being limited, which is why you multiply the row and column address on the same address bus. This one is called sense amplifier. There is a technical name. If you want to read up on that, I can give you some references. So an array of sense amplifiers actually hold the data. That is the row upper. And it involves a lot of electrical engineering or to design good sense amplifiers. Refresh regards what I have already talked about. So periodically, the memory control will be scheduling refresh cycles. So in a refresh cycle, what you normally do is, so first of all, the first thing that happens is that the bank which is currently being refreshed cannot be accessed by the processor at that time. So that is the performance factor of refresh cycle. So during refreshing what you do is, you close this row, whichever is open in this bank. And then read out each row at a time, one after another. Read out the row, write it back. That is all. That is what the refresh does. You just read one row at a time. So that refreshes the contents of the row. So that is the refresh cycle. But remember that here you have these 32 T rows. You have to refresh all of them. That takes a lot of time. Which is why DRAM refresh is really an expensive operation. And it's normally not done very frequently. Every DRAM datasheet comes with the nominal time that the DRAM can operate without a refresh. So the memory control designer has to schedule the refresh operation within that nominal time so that the data is not lost. All right. So one more small thing that I want to mention. I want to open this up a little bit more. Just to show you what goes inside. So I'll show you two rows of the DRAM. Just to show you how it works. So these are four-bit rows that I'm showing. I'm showing two rows of the DRAM. This is the DRAM cell which stores the bit. And this is the switch. Which closes whenever I give a high voltage here on this particular wire. Whenever I give a high voltage here, this closes. And whatever the value of this here will appear on this particular wire. So here I'm showing two different rows. This is called a word line. And this one often called a word. This is called a bit line. Now suppose I want to access this particular row. So what do I do? I apply a high voltage on this word line. And then I read out these four-bit lines. So I get the value in that particular row. As simple as that. Now of course, a lot of engineering goes inside this. How to design this particular bit. How to maintain this. And there are many DRAM architectures. Like for example, there are DRAM cells of only one transistor. That four-bit in a single capacitor actually is implemented in one transistor. Now suppose I want to change this. This two-row DRAM. So that I can access both the rows simultaneously. How do I do this? Currently it's not possible. Try to appreciate that. If I enable both the word lines, the two rows will get jumbled up actually. I'll not get any good data. So how do I enable this? If I want to access both the rows. What change is going to make? Sorry, second? No, that I cannot change. There are two rows. So since I cannot access both the rows here, this is called a single-ported DRAM. It's a single-ported DRAM. I can only access all of the rows. And you can see that I can add more rows just in the same way. Now how do I make it dual-ported? So I can access two rows. Otherwise I can access all the rows. Just more bit lines, that's enough? No sir, I have switches to connect to both bit lines. Very four word lines? Yes. So I add one more word line. Per row. I add one more bit line. Per column. And I add one more switch, et cetera. Now it's clear, right? I can now enable this row, et cetera, and so on. So I can not only access the same row twice, I can access these two rows also. So these are dual-ported DRAM. What, whether we have it paid? For making it dual-ported. Sorry? Anything else? Anything in terms of latency? I tell you that it is slower than a single-ported. Why? Sorry? Because it is slower. Well, you tell me exactly why it is slower. So compared to the single-ported DRAM, is it true that the word line length increases? The length of the word line spanning one row. Does it increase? It does, right? Because there have to be some gap between the bit lines. So this width increases. Does the length of the bit line increase? It does, right? So, and the time to charge is proportional to the length of the line, is RCD length. So as you add more ports, it's going to get slower and slower with instructions. This is the reason why. I think I'm going to stop here. So one more thing. So these are called access transistors. These things. And often taken together, this whole blob here is called an access stack. So you need a bigger access stack if you have more ports. That's what the implication is. So that's pretty much it about DRAM. So next time, I'll take a look at the SRAM a little bit. The static random access memory which is used for caches. However, just a few extra points that need to be touched upon. And then we'll move on to something else. But that's after the break up. This particular implementation of this particular bit. Normal is implemented using one capacitance. Just one capacitance. Which makes 7. In static RAM, this bit is normally implemented as cross-coupled inverters. Which you have learned in your digital electronics course. If you cross-coupled inverters, that acts as a latch, right? Which holds a bit. So there is a question of leaking out. So if we increase the length of the word line. Yes. Decrease number of length of the word line. Yeah. So we increase the latency. You increase? Yes. Increase. I mean. It slows down. I mean, the row sizes are more than the deities number of rows. We increase number of rows. Third? We increase the length of each row. And we decrease the number of rows. So you make the RAM wider. That's what I'm saying. It depends how wide the RAM is. You can see. The pre-charge latency is pre-charged. The bit line. We all also have to count the time to charge the order. Because until that stabilizes, the access now is just cannot be activated. The switch cannot act properly, right? So if this dissolves, both the things are on the critical path. Yeah, it may be true that you may reach a switch point where the total latency becomes continuous.